Visible to the public Design, Implementation and Test of a Flexible Tor-oriented Web Mining Toolkit

TitleDesign, Implementation and Test of a Flexible Tor-oriented Web Mining Toolkit
Publication TypeConference Paper
Year of Publication2017
AuthorsCelestini, Alessandro, Guarino, Stefano
Conference NameProceedings of the 7th International Conference on Web Intelligence, Mining and Semantics
Date PublishedJune 2017
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-5225-3
Keywordsdark web, Human Behavior, human factors, pubcrawl, tor web graph
Abstract

Searching and retrieving information from the Web is a primary activity needed to monitor the development and usage of Web resources. Possible benefits include improving user experience (e.g. by optimizing query results) and enforcing data/user security (e.g. by identifying harmful websites). Motivated by the lack of ready-to-use solutions, in this paper we present a flexible and accessible toolkit for structure and content mining, able to crawl, download, extract and index resources from the Web. While being easily configurable to work in the "surface" Web, our suite is specifically tailored to explore the Tor dark Web, i.e. the ensemble of Web servers composing the world's most famous darknet. Notably, the toolkit is not just a Web scraper, but it includes two mining modules, respectively able to prepare content to be fed to an (external) semantic engine, and to reconstruct the graph structure of the explored portion of the Web. Other than discussing in detail the design, features and performance of our toolkit, we report the findings of a preliminary run over Tor, that clarify the potential of our solution.

URLhttps://dl.acm.org/doi/10.1145/3102254.3102266
DOI10.1145/3102254.3102266
Citation Keycelestini_design_2017