Automatic Application Identification from Billions of Files

Submitted by grigby1 on Tue, 12/12/2017 - 1:20pm

Title	Automatic Application Identification from Billions of Files
Publication Type	Conference Paper
Year of Publication	2017
Authors	Soska, Kyle, Gates, Chris, Roundy, Kevin A., Christin, Nicolas
Conference Name	Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Publisher	ACM
Conference Location	New York, NY, USA
ISBN Number	978-1-4503-4887-4
Keywords	aknn, application, Big Data, big data security metrics, clustering, compositionality, files, hsnw, KNN, Malware, Metadata Discovery Problem, pubcrawl, Resiliency, Scalability, security, sketch
Abstract	Understanding how to group a set of binary files into the piece of software they belong to is highly desirable for software profiling, malware detection, or enterprise audits, among many other applications. Unfortunately, it is also extremely challenging: there is absolutely no uniformity in the ways different applications rely on different files, in how binaries are signed, or in the versioning schemes used across different pieces of software. In this paper, we show that, by combining information gleaned from a large number of endpoints (millions of computers), we can accomplish large-scale application identification automatically and reliably. Our approach relies on collecting metadata on billions of files every day, summarizing it into much smaller "sketches", and performing approximate k-nearest neighbor clustering on non-metric space representations derived from these sketches. We design and implement our proposed system using Apache Spark, show that it can process billions of files in a matter of hours, and thus could be used for daily processing. We further show our system manages to successfully identify which files belong to which application with very high precision, and adequate recall.
URL	http://doi.acm.org/10.1145/3097983.3098196
DOI	10.1145/3097983.3098196
Citation Key	soska_automatic_2017

Groups:

Science of Security VO