Visible to the public Automatic Application Identification from Billions of Files

TitleAutomatic Application Identification from Billions of Files
Publication TypeConference Paper
Year of Publication2017
AuthorsSoska, Kyle, Gates, Chris, Roundy, Kevin A., Christin, Nicolas
Conference NameProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-4887-4
Keywordsaknn, application, Big Data, big data security metrics, clustering, compositionality, files, hsnw, KNN, Malware, Metadata Discovery Problem, pubcrawl, Resiliency, Scalability, security, sketch
Abstract

Understanding how to group a set of binary files into the piece of software they belong to is highly desirable for software profiling, malware detection, or enterprise audits, among many other applications. Unfortunately, it is also extremely challenging: there is absolutely no uniformity in the ways different applications rely on different files, in how binaries are signed, or in the versioning schemes used across different pieces of software. In this paper, we show that, by combining information gleaned from a large number of endpoints (millions of computers), we can accomplish large-scale application identification automatically and reliably. Our approach relies on collecting metadata on billions of files every day, summarizing it into much smaller "sketches", and performing approximate k-nearest neighbor clustering on non-metric space representations derived from these sketches. We design and implement our proposed system using Apache Spark, show that it can process billions of files in a matter of hours, and thus could be used for daily processing. We further show our system manages to successfully identify which files belong to which application with very high precision, and adequate recall.

URLhttp://doi.acm.org/10.1145/3097983.3098196
DOI10.1145/3097983.3098196
Citation Keysoska_automatic_2017