Visible to the public DIComP: Lightweight Data-Driven Inference of Binary Compiler Provenance with High Accuracy

TitleDIComP: Lightweight Data-Driven Inference of Binary Compiler Provenance with High Accuracy
Publication TypeConference Paper
Year of Publication2022
AuthorsChen, Ligeng, He, Zhongling, Wu, Hao, Xu, Fengyuan, Qian, Yi, Mao, Bing
Conference Name2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)
Date Publishedmar
KeywordsBinary Analysis, codes, Compilation Options, compiler security, compositionality, Conferences, machine learning, Metrics, Neural networks, Optimization, pubcrawl, Resiliency, Scalability, security, Software
AbstractBinary analysis is pervasively utilized to assess software security and test vulnerabilities without accessing source codes. The analysis validity is heavily influenced by the inferring ability of information related to the code compilation. Among the compilation information, compiler type and optimization level, as the key factors determining how binaries look like, are still difficult to be inferred efficiently with existing tools. In this paper, we conduct a thorough empirical study on the binary's appearance under various compilation settings and propose a lightweight binary analysis tool based on the simplest machine learning method, called DIComP to infer the compiler and optimization level via most relevant features according to the observation. Our comprehensive evaluations demonstrate that DIComP can fully recognize the compiler provenance, and it is effective in inferring the optimization levels with up to 90% accuracy. Also, it is efficient to infer thousands of binaries at a millisecond level with our lightweight machine learning model (1MB).
DOI10.1109/SANER53432.2022.00025
Citation Keychen_dicomp_2022