Mitigating Web Scrapers using Markup Randomization

Submitted by grigby1 on Fri, 02/25/2022 - 1:32pm

Title	Mitigating Web Scrapers using Markup Randomization
Publication Type	Conference Paper
Year of Publication	2021
Authors	Bolbol, Noor, Barhoom, Tawfiq
Conference Name	2021 Palestinian International Conference on Information and Communication Technology (PICICT)
Date Published	sep
Keywords	Blogs, Collaboration, composability, content security, Crawlers, data mining, information and communication technology, Information Reuse, machine learning algorithms, markup HTML, middleware, middleware security, policy-based governance, pubcrawl, Randomization, security, Web crawler, Web scraping
Abstract	Web Scraping is the technique of extracting desired data in an automated way by scanning the internal links and content of a website, this activity usually performed by systematically programmed bots. This paper explains our proposed solution to protect the blog content from theft and from being copied to other destinations by mitigating the scraping bots. To achieve our purpose we applied two steps in two levels, the first one, on the main blog page level, mitigated the work of crawler bots by adding extra empty articles anchors among real articles, and the next step, on the article page level, we add a random number of empty and hidden spans with randomly generated text among the article's body. To assess this solution we apply it to a local project developed using PHP language in Laravel framework, and put four criteria that measure the effectiveness. The results show that the changes in the file size before and after the application do not affect it, also, the processing time increased by few milliseconds which still in the acceptable range. And by using the HTML-similarity tool we get very good results that show the symmetric over style, with a few bit changes over the structure. Finally, to assess the effects on the bots, scraper bot reused and get the expected results from the programmed middleware. These results show that the solution is feasible to be adopted and use to protect blogs content.
DOI	10.1109/PICICT53635.2021.00038
Citation Key	bolbol_mitigating_2021

Groups:

Science of Security VO