Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process Using Support Vector Machines and Latent Dirichlet Allocation

Submitted by grigby1 on Fri, 03/15/2019 - 11:52am

Title	Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process Using Support Vector Machines and Latent Dirichlet Allocation
Publication Type	Conference Paper
Year of Publication	2018
Authors	Deliu, I., Leichter, C., Franke, K.
Conference Name	2018 IEEE International Conference on Big Data (Big Data)
ISBN Number	978-1-5386-5035-6
Keywords	AV detection, Computer crime, Computer hacking, CTI, cyber security, cyber threat intelligence, Cyber Threat Intelligence (CTI), cyber threat landscape, hacker forum posts, hacker forums, hybrid machine learning model, Internet, latent Dirichlet allocation, LDA, leaked credentials, learning (artificial intelligence), machine learning, machine learning algorithms, malicious proxy servers, Malware, Metrics, nontraditional information sources, privacy, pubcrawl, Resource management, security controls, Support vector machines, SVM, text classification, threat vectors, topic modeling, two-stage hybrid process, Vocabulary
Abstract	Traditional security controls, such as firewalls, anti-virus and IDS, are ill-equipped to help IT security and response teams keep pace with the rapid evolution of the cyber threat landscape. Cyber Threat Intelligence (CTI) can help remediate this problem by exploiting non-traditional information sources, such as hacker forums and "dark-web" social platforms. Security and response teams can use the collected intelligence to identify emerging threats. Unfortunately, when manual analysis is used to extract CTI from non-traditional sources, it is a time consuming, error-prone and resource intensive process. We address these issues by using a hybrid Machine Learning model that automatically searches through hacker forum posts, identifies the posts that are most relevant to cyber security and then clusters the relevant posts into estimations of the topics that the hackers are discussing. The first (identification) stage uses Support Vector Machines and the second (clustering) stage uses Latent Dirichlet Allocation. We tested our model, using data from an actual hacker forum, to automatically extract information about various threats such as leaked credentials, malicious proxy servers, malware that evades AV detection, etc. The results demonstrate our method is an effective means for quickly extracting relevant and actionable intelligence that can be integrated with traditional security controls to increase their effectiveness.
URL	https://ieeexplore.ieee.org/document/8622469
DOI	10.1109/BigData.2018.8622469
Citation Key	deliu_collecting_2018

Groups:

Science of Security VO