Title | AltCC: Alternating Clustering and Classification for Batch Analysis of Malware Behavior |
Publication Type | Conference Paper |
Year of Publication | 2020 |
Authors | Ghanem, Sahar M., Aldeen, Donia Naief Saad |
Conference Name | 2020 International Symposium on Networks, Computers and Communications (ISNCC) |
Keywords | classification, Classification algorithms, clustering, Clustering algorithms, dimensionality reduction, feature extraction, Human Behavior, Malware, malware behavior, malware classication, Metrics, privacy, pubcrawl, resilience, Resiliency, scikit-learn, Sparse matrices, Tools |
Abstract | The most common goal of malware analysis is to determine if a given binary is malware or benign. Another objective is similarity analysis of malware binaries to understand how new samples differ from known ones. Similarity analysis helps to analyze the malware with respect to those already analyzed and guides the discovery of novel aspects that should be analyzed more in depth. In this work, we are concerned with similarities and differences detection of malware binaries. Thousands of malware are created every day and machine learning is an indispensable tool for its analysis. Previous work has studied clustering and classification as competing paradigms. However, in this work, a malware similarity analysis technique (AltCC) is proposed that alternates the use of clustering and classification. In addition it assumes the malware are not available all at once but processed in batches. Initially, clustering is applied to the first batch to group similar binaries into novel malware classes. Then, the discovered classes are used to train a classifier. For the following batches, the classifier is used to decide if a new binary classifies to a known class or otherwise unclassified. The unclassified binaries are clustered and the process repeats. Malware clustering (i.e. labeling) may entail further human expert analysis but dramatically reduces the effort. The effectiveness of AltCC is studied using a dataset of 29,661 malware binaries that represent malware received in six consecutive days/batches. When KMeans is used to label the dataset all at once and its labeling is compared to AltCC's, the adjusted-rand-index scores 0.71. |
DOI | 10.1109/ISNCC49221.2020.9297176 |
Citation Key | ghanem_altcc_2020 |