Visible to the public Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process Using Support Vector Machines and Latent Dirichlet Allocation

TitleCollecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process Using Support Vector Machines and Latent Dirichlet Allocation
Publication TypeConference Paper
Year of Publication2018
AuthorsDeliu, I., Leichter, C., Franke, K.
Conference Name2018 IEEE International Conference on Big Data (Big Data)
ISBN Number978-1-5386-5035-6
KeywordsAV detection, Computer crime, Computer hacking, CTI, cyber security, cyber threat intelligence, Cyber Threat Intelligence (CTI), cyber threat landscape, hacker forum posts, hacker forums, hybrid machine learning model, Internet, latent Dirichlet allocation, LDA, leaked credentials, learning (artificial intelligence), machine learning, machine learning algorithms, malicious proxy servers, Malware, Metrics, nontraditional information sources, privacy, pubcrawl, Resource management, security controls, Support vector machines, SVM, text classification, threat vectors, topic modeling, two-stage hybrid process, Vocabulary
Abstract

Traditional security controls, such as firewalls, anti-virus and IDS, are ill-equipped to help IT security and response teams keep pace with the rapid evolution of the cyber threat landscape. Cyber Threat Intelligence (CTI) can help remediate this problem by exploiting non-traditional information sources, such as hacker forums and "dark-web" social platforms. Security and response teams can use the collected intelligence to identify emerging threats. Unfortunately, when manual analysis is used to extract CTI from non-traditional sources, it is a time consuming, error-prone and resource intensive process. We address these issues by using a hybrid Machine Learning model that automatically searches through hacker forum posts, identifies the posts that are most relevant to cyber security and then clusters the relevant posts into estimations of the topics that the hackers are discussing. The first (identification) stage uses Support Vector Machines and the second (clustering) stage uses Latent Dirichlet Allocation. We tested our model, using data from an actual hacker forum, to automatically extract information about various threats such as leaked credentials, malicious proxy servers, malware that evades AV detection, etc. The results demonstrate our method is an effective means for quickly extracting relevant and actionable intelligence that can be integrated with traditional security controls to increase their effectiveness.

URLhttps://ieeexplore.ieee.org/document/8622469
DOI10.1109/BigData.2018.8622469
Citation Keydeliu_collecting_2018