Automatic Application Identification from Billions of Files
Title | Automatic Application Identification from Billions of Files |
Publication Type | Conference Paper |
Year of Publication | 2017 |
Authors | Soska, Kyle, Gates, Chris, Roundy, Kevin A., Christin, Nicolas |
Conference Name | Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-4503-4887-4 |
Keywords | aknn, application, Big Data, big data security metrics, clustering, compositionality, files, hsnw, KNN, Malware, Metadata Discovery Problem, pubcrawl, Resiliency, Scalability, security, sketch |
Abstract | Understanding how to group a set of binary files into the piece of software they belong to is highly desirable for software profiling, malware detection, or enterprise audits, among many other applications. Unfortunately, it is also extremely challenging: there is absolutely no uniformity in the ways different applications rely on different files, in how binaries are signed, or in the versioning schemes used across different pieces of software. In this paper, we show that, by combining information gleaned from a large number of endpoints (millions of computers), we can accomplish large-scale application identification automatically and reliably. Our approach relies on collecting metadata on billions of files every day, summarizing it into much smaller "sketches", and performing approximate k-nearest neighbor clustering on non-metric space representations derived from these sketches. We design and implement our proposed system using Apache Spark, show that it can process billions of files in a matter of hours, and thus could be used for daily processing. We further show our system manages to successfully identify which files belong to which application with very high precision, and adequate recall. |
URL | http://doi.acm.org/10.1145/3097983.3098196 |
DOI | 10.1145/3097983.3098196 |
Citation Key | soska_automatic_2017 |