Visible to the public Delta Encoding of Virtual-Machine Memory in the Dynamic Analysis of Malware

TitleDelta Encoding of Virtual-Machine Memory in the Dynamic Analysis of Malware
Publication TypeConference Paper
Year of Publication2016
AuthorsFowler, James E.
PublisherIEEE
ISBN Number978-1-5090-1853-6
Keywordsattribution, composability, Human Behavior, Metrics
Abstract

Malware is an ever-increasing threat to personal, corporate, and government computing systems alike. Particularly in the corporate and government sectors, the attribution of malware--including the identification of the authorship of malware as well as potentially the malefactor responsible for an attack--is of growing interest. Such malware attribution is often enabled by the fact that malware authors build on the work of others through the use of generators, libraries, and borrowed code. Determining malware phylogeny--the evolutionary history of and the derivative relations between malware--is consequently an endeavor of increasing importance, with a growing focus on the dynamic analysis of malware which involves executing a malware sample and determining the actions it takes after some period of operation. In most cases, such dynamic analysis occurs in a virtual machine, or "sandbox," in order to confine the malware to an environment in which it can do no harm to real systems. In sandbox-driven dynamic analysis of malware, a virtual machine is typically run starting from some known, malware-free baseline state. The malware is injected into the virtual machine, and the machine is allowed to run for some period of time during which the malware presumably activates. The machine is then suspended, and the current machine memory is dumped to disk. The process may then be repeated for other malware samples, each time starting from the baseline state. Stored in raw form on the disk, the dumped memory file is the same size as the virtual-machine memory, for virtual machines running modern operating systems, such memory would likely be no less than 512 MB but could be up to several GBs. If the corresponding memory dumps are to be retained for repeated analysis--as is likely to be required in order to determine a phylogeny for a large database of malware samples--lossless compression of the memory dumps is necessarily to prevent explosive disk usage. For example, the VirusShare project maintains a database of over 19 million malware samples, running these in a virtual machine with 512 MB of memory would require of 9 petabytes (PB) of storage to retain the memory dumps. In this paper, we develop a scheme for the lossless compression of memory dumps resulting from the repeated execution of malware samples in a virtual-machine sandbox. Rather than compress each memory dump individually, we capitalize on the fact that memory dumps stem from a known baseline virtual-machine state and code with respect to this baseline memory. Additionally, to further improve compression efficiency, we exploit the fact that a significant portion of the difference between the baseline memory and that of the currently running machine is the result of the loading of known executable programs and shared libraries. Consequently, we propose delta coding to compress the current virtual-machine memory dump by coding its differences with respect to a predicted memory image, with the latter formed by duplicating the loading of the executables and libraries into the baseline memory, resulting in a significant improvement in compression performance over straightforward delta coding alone. In experimental results for a body of malware samples, the proposed approach outperformed the widely used xdelta3 delta coder by approximately 20% and the popular generic gzip coder by 79%.

URLhttp://ieeexplore.ieee.org/document/7786216/
DOI10.1109/DCC.2016.66
Citation Keyfowler_delta_2016