Visible to the public A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence

TitleA Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence
Publication TypeConference Paper
Year of Publication2019
AuthorsKoloveas, Paris, Chantzios, Thanasis, Tryfonopoulos, Christos, Skiadopoulos, Spiros
Conference Name2019 IEEE World Congress on Services (SERVICES)
Date PublishedJuly 2019
PublisherIEEE
ISBN Number978-1-7281-3851-0
KeywordsComputer crime, crawler architecture, Crawlers, crawling architecture, cyber security, cyber threat intelligence, cyber-security information, dark web, data harvesting, hacker forums, harvested information, Human Behavior, human factors, information gathering task, Internet of Things, IoT, IoT-related cyber-threat intelligence, language models, learning (artificial intelligence), machine learning, machine learning-based crawler, Monitoring, Open Source Software, open-source tools, pubcrawl, security, security forums, security Web sites, service-oriented architecture, social networking (online), social web, statistical language modelling techniques, Task Analysis, telecommunication security, Tools
Abstract

The clear, social, and dark web have lately been identified as rich sources of valuable cyber-security information that -given the appropriate tools and methods-may be identified, crawled and subsequently leveraged to actionable cyber-threat intelligence. In this work, we focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web. The proposed architecture adopts a two-phase approach to data harvesting. Initially a machine learning-based crawler is used to direct the harvesting towards websites of interest, while in the second phase state-of-the-art statistical language modelling techniques are used to represent the harvested information in a latent low-dimensional feature space and rank it based on its potential relevance to the task at hand. The proposed architecture is realised using exclusively open-source tools, and a preliminary evaluation with crowdsourced results demonstrates its effectiveness.

URLhttps://ieeexplore.ieee.org/document/8817166
DOI10.1109/SERVICES.2019.00016
Citation Keykoloveas_crawler_2019