Detecting Malware, Malicious URLs and Virus Using Machine Learning and Signature Matching

Submitted by grigby1 on Mon, 02/07/2022 - 4:47pm

Title	Detecting Malware, Malicious URLs and Virus Using Machine Learning and Signature Matching
Publication Type	Conference Paper
Year of Publication	2021
Authors	Acharya, Jatin, Chuadhary, Anshul, Chhabria, Anish, Jangale, Smita
Conference Name	2021 2nd International Conference for Emerging Technology (INCET)
Date Published	may
Keywords	classification, computer viruses, Frequency conversion, Human Behavior, machine learning, Malware, malware analysis, Metrics, prediction, privacy, pubcrawl, Radio frequency, Random Forest, regression, resilience, Resiliency, security, signature, Signature Matching, Trojan horses, Uniform resource locators, URL Detection, Virus
Abstract	Nowadays most of our data is stored on an electronic device. The risk of that device getting infected by Viruses, Malware, Worms, Trojan, Ransomware, or any unwanted invader has increased a lot these days. This is mainly because of easy access to the internet. Viruses and malware have evolved over time so identification of these files has become difficult. Not only by viruses and malware your device can be attacked by a click on forged URLs. Our proposed solution for this problem uses machine learning techniques and signature matching techniques. The main aim of our solution is to identify the malicious programs/URLs and act upon them. The core idea in identifying the malware is selecting the key features from the Portable Executable file headers using these features we trained a random forest model. This RF model will be used for scanning a file and determining if that file is malicious or not. For identification of the virus, we are using the signature matching technique which is used to match the MD5 hash of the file with the virus signature database containing the MD5 hash of the identified viruses and their families. To distinguish between benign and illegitimate URLs there is a logistic regression model used. The regression model uses a tokenizer for feature extraction from the URL that is to be classified. The tokenizer separates all the domains, sub-domains and separates the URLs on every `/'. Then a TfidfVectorizer (Term Frequency - Inverse Document Frequency) is used to convert the text into a weighted value. These values are used to predict if the URL is safe to visit or not. On the integration of all three modules, the final application will provide full system protection against malicious software.
DOI	10.1109/INCET51464.2021.9456440
Citation Key	acharya_detecting_2021

Groups:

Science of Security VO