Title | Detecting Malware, Malicious URLs and Virus Using Machine Learning and Signature Matching |
Publication Type | Conference Paper |
Year of Publication | 2021 |
Authors | Acharya, Jatin, Chuadhary, Anshul, Chhabria, Anish, Jangale, Smita |
Conference Name | 2021 2nd International Conference for Emerging Technology (INCET) |
Date Published | may |
Keywords | classification, computer viruses, Frequency conversion, Human Behavior, machine learning, Malware, malware analysis, Metrics, prediction, privacy, pubcrawl, Radio frequency, Random Forest, regression, resilience, Resiliency, security, signature, Signature Matching, Trojan horses, Uniform resource locators, URL Detection, Virus |
Abstract | Nowadays most of our data is stored on an electronic device. The risk of that device getting infected by Viruses, Malware, Worms, Trojan, Ransomware, or any unwanted invader has increased a lot these days. This is mainly because of easy access to the internet. Viruses and malware have evolved over time so identification of these files has become difficult. Not only by viruses and malware your device can be attacked by a click on forged URLs. Our proposed solution for this problem uses machine learning techniques and signature matching techniques. The main aim of our solution is to identify the malicious programs/URLs and act upon them. The core idea in identifying the malware is selecting the key features from the Portable Executable file headers using these features we trained a random forest model. This RF model will be used for scanning a file and determining if that file is malicious or not. For identification of the virus, we are using the signature matching technique which is used to match the MD5 hash of the file with the virus signature database containing the MD5 hash of the identified viruses and their families. To distinguish between benign and illegitimate URLs there is a logistic regression model used. The regression model uses a tokenizer for feature extraction from the URL that is to be classified. The tokenizer separates all the domains, sub-domains and separates the URLs on every `/'. Then a TfidfVectorizer (Term Frequency - Inverse Document Frequency) is used to convert the text into a weighted value. These values are used to predict if the URL is safe to visit or not. On the integration of all three modules, the final application will provide full system protection against malicious software. |
DOI | 10.1109/INCET51464.2021.9456440 |
Citation Key | acharya_detecting_2021 |