Accurate Detection of Automatically Spun Content via Stylometric Analysis
Title | Accurate Detection of Automatically Spun Content via Stylometric Analysis |
Publication Type | Conference Paper |
Year of Publication | 2017 |
Authors | Shahid, U., Farooqi, S., Ahmad, R., Shafiq, Z., Srinivasan, P., Zaffar, F. |
Conference Name | 2017 IEEE International Conference on Data Mining (ICDM) |
Date Published | nov |
ISBN Number | 978-1-5386-3835-4 |
Keywords | content spinning techniques, Dictionaries, feature extraction, Frequency measurement, Human Behavior, Metrics, Plagiarism, plagiarism detection, plagiarism detector evasion, pubcrawl, search engines, Software, spam, spammers, Spinning, spun content detection, spun documents, stylometric analysis, stylometric artifacts, stylometry, text analysis, text spinner dictionary, text spinning, text spinning software, unsolicited e-mail |
Abstract | Spammers use automated content spinning techniques to evade plagiarism detection by search engines. Text spinners help spammers in evading plagiarism detectors by automatically restructuring sentences and replacing words or phrases with their synonyms. Prior work on spun content detection relies on the knowledge about the dictionary used by the text spinning software. In this work, we propose an approach to detect spun content and its seed without needing the text spinner's dictionary. Our key idea is that text spinners introduce stylometric artifacts that can be leveraged for detecting spun documents. We implement and evaluate our proposed approach on a corpus of spun documents that are generated using a popular text spinning software. The results show that our approach can not only accurately detect whether a document is spun but also identify its source (or seed) document - all without needing the dictionary used by the text spinner. |
URL | https://ieeexplore.ieee.org/document/8215515 |
DOI | 10.1109/ICDM.2017.52 |
Citation Key | shahid_accurate_2017 |
- spam
- unsolicited e-mail
- text spinning software
- text spinning
- text spinner dictionary
- text analysis
- stylometry
- stylometric artifacts
- stylometric analysis
- spun documents
- spun content detection
- Spinning
- spammers
- content spinning techniques
- Software
- search engines
- pubcrawl
- plagiarism detector evasion
- plagiarism detection
- Plagiarism
- Metrics
- Human behavior
- Frequency measurement
- feature extraction
- Dictionaries