Visible to the public Semantic Understanding of Source and Binary Code based on Natural Language Processing

TitleSemantic Understanding of Source and Binary Code based on Natural Language Processing
Publication TypeConference Paper
Year of Publication2021
AuthorsZhang, Zhongtang, Liu, Shengli, Yang, Qichao, Guo, Shichen
Conference Name2021 IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC)
Date Publishedjun
Keywordsbinary code, Binary codes, Human Behavior, LLVM IR, natural language processing, process control, pubcrawl, resilience, Resiliency, Scalability, semantic understanding, Semantics, source code, Syntactics, Training, Transforms
AbstractWith the development of open source projects, a large number of open source codes will be reused in binary software, and bugs in source codes will also be introduced into binary codes. In order to detect the reused open source codes in binary codes, it is sometimes necessary to compare and analyze the similarity between source codes and binary codes. One of the main challenge is that the compilation process can generate different binary code representations for the same source code, such as different compiler versions, compilation optimization options and target architectures, which greatly increases the difficulty of semantic similarity detection between source code and binary code. In order to solve the influence of the compilation process on the comparison of semantic similarity of codes, this paper transforms the source code and binary code into LLVM intermediate representation (LLVM IR), which is a universal intermediate representation independent of source code and binary code. We carry out semantic feature extraction and embedding training on LLVM IR based on natural language processing model. Experimental results show that LLVM IR eliminates the influence of compilation on the syntax differences between source code and binary code, and the semantic features of code are well represented and preserved.
DOI10.1109/IMCEC51613.2021.9482032
Citation Keyzhang_semantic_2021