Design, Implementation and Test of a Flexible Tor-oriented Web Mining Toolkit
Title | Design, Implementation and Test of a Flexible Tor-oriented Web Mining Toolkit |
Publication Type | Conference Paper |
Year of Publication | 2017 |
Authors | Celestini, Alessandro, Guarino, Stefano |
Conference Name | Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics |
Date Published | June 2017 |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-4503-5225-3 |
Keywords | dark web, Human Behavior, human factors, pubcrawl, tor web graph |
Abstract | Searching and retrieving information from the Web is a primary activity needed to monitor the development and usage of Web resources. Possible benefits include improving user experience (e.g. by optimizing query results) and enforcing data/user security (e.g. by identifying harmful websites). Motivated by the lack of ready-to-use solutions, in this paper we present a flexible and accessible toolkit for structure and content mining, able to crawl, download, extract and index resources from the Web. While being easily configurable to work in the "surface" Web, our suite is specifically tailored to explore the Tor dark Web, i.e. the ensemble of Web servers composing the world's most famous darknet. Notably, the toolkit is not just a Web scraper, but it includes two mining modules, respectively able to prepare content to be fed to an (external) semantic engine, and to reconstruct the graph structure of the explored portion of the Web. Other than discussing in detail the design, features and performance of our toolkit, we report the findings of a preliminary run over Tor, that clarify the potential of our solution. |
URL | https://dl.acm.org/doi/10.1145/3102254.3102266 |
DOI | 10.1145/3102254.3102266 |
Citation Key | celestini_design_2017 |