Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process Using Support Vector Machines and Latent Dirichlet Allocation
Title | Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process Using Support Vector Machines and Latent Dirichlet Allocation |
Publication Type | Conference Paper |
Year of Publication | 2018 |
Authors | Deliu, I., Leichter, C., Franke, K. |
Conference Name | 2018 IEEE International Conference on Big Data (Big Data) |
ISBN Number | 978-1-5386-5035-6 |
Keywords | AV detection, Computer crime, Computer hacking, CTI, cyber security, cyber threat intelligence, Cyber Threat Intelligence (CTI), cyber threat landscape, hacker forum posts, hacker forums, hybrid machine learning model, Internet, latent Dirichlet allocation, LDA, leaked credentials, learning (artificial intelligence), machine learning, machine learning algorithms, malicious proxy servers, Malware, Metrics, nontraditional information sources, privacy, pubcrawl, Resource management, security controls, Support vector machines, SVM, text classification, threat vectors, topic modeling, two-stage hybrid process, Vocabulary |
Abstract | Traditional security controls, such as firewalls, anti-virus and IDS, are ill-equipped to help IT security and response teams keep pace with the rapid evolution of the cyber threat landscape. Cyber Threat Intelligence (CTI) can help remediate this problem by exploiting non-traditional information sources, such as hacker forums and "dark-web" social platforms. Security and response teams can use the collected intelligence to identify emerging threats. Unfortunately, when manual analysis is used to extract CTI from non-traditional sources, it is a time consuming, error-prone and resource intensive process. We address these issues by using a hybrid Machine Learning model that automatically searches through hacker forum posts, identifies the posts that are most relevant to cyber security and then clusters the relevant posts into estimations of the topics that the hackers are discussing. The first (identification) stage uses Support Vector Machines and the second (clustering) stage uses Latent Dirichlet Allocation. We tested our model, using data from an actual hacker forum, to automatically extract information about various threats such as leaked credentials, malicious proxy servers, malware that evades AV detection, etc. The results demonstrate our method is an effective means for quickly extracting relevant and actionable intelligence that can be integrated with traditional security controls to increase their effectiveness. |
URL | https://ieeexplore.ieee.org/document/8622469 |
DOI | 10.1109/BigData.2018.8622469 |
Citation Key | deliu_collecting_2018 |
- security controls
- machine learning algorithms
- malicious proxy servers
- malware
- Metrics
- nontraditional information sources
- privacy
- pubcrawl
- resource management
- machine learning
- Support vector machines
- SVM
- text classification
- threat vectors
- topic modeling
- two-stage hybrid process
- Vocabulary
- AV detection
- learning (artificial intelligence)
- leaked credentials
- LDA
- latent Dirichlet allocation
- internet
- hybrid machine learning model
- hacker forums
- hacker forum posts
- cyber threat landscape
- Cyber Threat Intelligence (CTI)
- cyber threat intelligence
- cyber security
- CTI
- Computer hacking
- Computer crime