Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity
Title | Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity |
Publication Type | Conference Paper |
Year of Publication | 2017 |
Authors | Anderson, Blake, McGrew, David |
Conference Name | Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-4503-4887-4 |
Keywords | composability, machine learning, malware classification, malware detection, Metrics, network accountability, Network security, pubcrawl, resilience, Resiliency, TLS |
Abstract | The application of machine learning for the detection of malicious network traffic has been well researched over the past several decades; it is particularly appealing when the traffic is encrypted because traditional pattern-matching approaches cannot be used. Unfortunately, the promise of machine learning has been slow to materialize in the network security domain. In this paper, we highlight two primary reasons why this is the case: inaccurate ground truth and a highly non-stationary data distribution. To demonstrate and understand the effect that these pitfalls have on popular machine learning algorithms, we design and carry out experiments that show how six common algorithms perform when confronted with real network data. With our experimental results, we identify the situations in which certain classes of algorithms underperform on the task of encrypted malware traffic classification. We offer concrete recommendations for practitioners given the real-world constraints outlined. From an algorithmic perspective, we find that the random forest ensemble method outperformed competing methods. More importantly, feature engineering was decisive; we found that iterating on the initial feature set, and including features suggested by domain experts, had a much greater impact on the performance of the classification system. For example, linear regression using the more expressive feature set easily outperformed the random forest method using a standard network traffic representation on all criteria considered. Our analysis is based on millions of TLS encrypted sessions collected over 12 months from a commercial malware sandbox and two geographically distinct, large enterprise networks. |
URL | http://doi.acm.org/10.1145/3097983.3098163 |
DOI | 10.1145/3097983.3098163 |
Citation Key | anderson_machine_2017 |