A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence
Title | A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence |
Publication Type | Conference Paper |
Year of Publication | 2019 |
Authors | Koloveas, Paris, Chantzios, Thanasis, Tryfonopoulos, Christos, Skiadopoulos, Spiros |
Conference Name | 2019 IEEE World Congress on Services (SERVICES) |
Date Published | July 2019 |
Publisher | IEEE |
ISBN Number | 978-1-7281-3851-0 |
Keywords | Computer crime, crawler architecture, Crawlers, crawling architecture, cyber security, cyber threat intelligence, cyber-security information, dark web, data harvesting, hacker forums, harvested information, Human Behavior, human factors, information gathering task, Internet of Things, IoT, IoT-related cyber-threat intelligence, language models, learning (artificial intelligence), machine learning, machine learning-based crawler, Monitoring, Open Source Software, open-source tools, pubcrawl, security, security forums, security Web sites, service-oriented architecture, social networking (online), social web, statistical language modelling techniques, Task Analysis, telecommunication security, Tools |
Abstract | The clear, social, and dark web have lately been identified as rich sources of valuable cyber-security information that -given the appropriate tools and methods-may be identified, crawled and subsequently leveraged to actionable cyber-threat intelligence. In this work, we focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web. The proposed architecture adopts a two-phase approach to data harvesting. Initially a machine learning-based crawler is used to direct the harvesting towards websites of interest, while in the second phase state-of-the-art statistical language modelling techniques are used to represent the harvested information in a latent low-dimensional feature space and rank it based on its potential relevance to the task at hand. The proposed architecture is realised using exclusively open-source tools, and a preliminary evaluation with crowdsourced results demonstrates its effectiveness. |
URL | https://ieeexplore.ieee.org/document/8817166 |
DOI | 10.1109/SERVICES.2019.00016 |
Citation Key | koloveas_crawler_2019 |
- security forums
- learning (artificial intelligence)
- machine learning
- machine learning-based crawler
- Monitoring
- Open Source Software
- open-source tools
- pubcrawl
- security
- language models
- security Web sites
- service-oriented architecture
- social networking (online)
- social web
- statistical language modelling techniques
- Task Analysis
- telecommunication security
- tools
- hacker forums
- crawler architecture
- Crawlers
- crawling architecture
- cyber security
- cyber threat intelligence
- cyber-security information
- dark web
- data harvesting
- Computer crime
- harvested information
- Human behavior
- Human Factors
- information gathering task
- Internet of Things
- IoT
- IoT-related cyber-threat intelligence