TUM Logo

Semi-Supervised Opinion Spam Detection

Opinion spam detection is a new and exciting area of research with a strong emphasis on statistical spam detection techniques. With the dawn of social networking, people are now sharing a lot about themselves and their experiences on the social networking platforms, which leaves an open room for spammers to distort public opinion and choices. Therefore, the problem of opinion spam is an ever growing concern for web- sites, businesses, and customers alike.In this work, we propose a novel way of detecting opinion spam in hotel review data using a fully semi-supervised approach which to the best of our knowledge has never been tried before. We have taken inspiration from using unlabeled data to boost performance for document classification and have successfully applied it to our problem. We developed three algorithms and evaluated their performance on labeled test data.Using our approach we were able to achieve an overall accuracy of 69.2% on labeled test data without the use of reviewer -based behavioral features, which is an improve- ment on the previous bench mark. By using reviewer-based behavioral features for the labeled training data alone, we were able achieve an overall accuracy of 84.6% on labeled test data which is almost as good as the previous best results.

Semi-Supervised Opinion Spam Detection

Supervisor(s): Bojan Kolosnjaji
Status: finished
Topic: Machine Learning Methods
Author: Muhammad Bilal Javed
Submission: 2016-06-15
Type of Thesis: Masterthesis
Proof of Concept No

Astract:

Opinion spam detection is a new and exciting area of research with a strong emphasis on statistical spam detection techniques. With the dawn of social networking, people are now sharing a lot about themselves and their experiences on the social networking platforms, which leaves an open room for spammers to distort public opinion and choices. Therefore, the problem of opinion spam is an ever growing concern for web- sites, businesses, and customers alike.In this work, we propose a novel way of detecting opinion spam in hotel review data using a fully semi-supervised approach which to the best of our knowledge has never been tried before. We have taken inspiration from using unlabeled data to boost performance for document classification and have successfully applied it to our problem. We developed three algorithms and evaluated their performance on labeled test data.Using our approach we were able to achieve an overall accuracy of 69.2% on labeled test data without the use of reviewer -based behavioral features, which is an improve- ment on the previous bench mark. By using reviewer-based behavioral features for the labeled training data alone, we were able achieve an overall accuracy of 84.6% on labeled test data which is almost as good as the previous best results.