Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity

Submitted by grigby1 on Mon, 06/11/2018 - 3:40pm

Title	Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity
Publication Type	Conference Paper
Year of Publication	2017
Authors	Anderson, Blake, McGrew, David
Conference Name	Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Publisher	ACM
Conference Location	New York, NY, USA
ISBN Number	978-1-4503-4887-4
Keywords	composability, machine learning, malware classification, malware detection, Metrics, network accountability, Network security, pubcrawl, resilience, Resiliency, TLS
Abstract	The application of machine learning for the detection of malicious network traffic has been well researched over the past several decades; it is particularly appealing when the traffic is encrypted because traditional pattern-matching approaches cannot be used. Unfortunately, the promise of machine learning has been slow to materialize in the network security domain. In this paper, we highlight two primary reasons why this is the case: inaccurate ground truth and a highly non-stationary data distribution. To demonstrate and understand the effect that these pitfalls have on popular machine learning algorithms, we design and carry out experiments that show how six common algorithms perform when confronted with real network data. With our experimental results, we identify the situations in which certain classes of algorithms underperform on the task of encrypted malware traffic classification. We offer concrete recommendations for practitioners given the real-world constraints outlined. From an algorithmic perspective, we find that the random forest ensemble method outperformed competing methods. More importantly, feature engineering was decisive; we found that iterating on the initial feature set, and including features suggested by domain experts, had a much greater impact on the performance of the classification system. For example, linear regression using the more expressive feature set easily outperformed the random forest method using a standard network traffic representation on all criteria considered. Our analysis is based on millions of TLS encrypted sessions collected over 12 months from a commercial malware sandbox and two geographically distinct, large enterprise networks.
URL	http://doi.acm.org/10.1145/3097983.3098163
DOI	10.1145/3097983.3098163
Citation Key	anderson_machine_2017

Groups:

Science of Security VO