Mitigating Web Scrapers using Markup Randomization
Title | Mitigating Web Scrapers using Markup Randomization |
Publication Type | Conference Paper |
Year of Publication | 2021 |
Authors | Bolbol, Noor, Barhoom, Tawfiq |
Conference Name | 2021 Palestinian International Conference on Information and Communication Technology (PICICT) |
Date Published | sep |
Keywords | Blogs, Collaboration, composability, content security, Crawlers, data mining, information and communication technology, Information Reuse, machine learning algorithms, markup HTML, middleware, middleware security, policy-based governance, pubcrawl, Randomization, security, Web crawler, Web scraping |
Abstract | Web Scraping is the technique of extracting desired data in an automated way by scanning the internal links and content of a website, this activity usually performed by systematically programmed bots. This paper explains our proposed solution to protect the blog content from theft and from being copied to other destinations by mitigating the scraping bots. To achieve our purpose we applied two steps in two levels, the first one, on the main blog page level, mitigated the work of crawler bots by adding extra empty articles anchors among real articles, and the next step, on the article page level, we add a random number of empty and hidden spans with randomly generated text among the article's body. To assess this solution we apply it to a local project developed using PHP language in Laravel framework, and put four criteria that measure the effectiveness. The results show that the changes in the file size before and after the application do not affect it, also, the processing time increased by few milliseconds which still in the acceptable range. And by using the HTML-similarity tool we get very good results that show the symmetric over style, with a few bit changes over the structure. Finally, to assess the effects on the bots, scraper bot reused and get the expected results from the programmed middleware. These results show that the solution is feasible to be adopted and use to protect blogs content. |
DOI | 10.1109/PICICT53635.2021.00038 |
Citation Key | bolbol_mitigating_2021 |