Visible to the public Phishing website detection framework through web scraping and data mining

TitlePhishing website detection framework through web scraping and data mining
Publication TypeConference Paper
Year of Publication2017
AuthorsPark, A. J., Quadari, R. N., Tsang, H. H.
Conference Name2017 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)
PublisherIEEE
ISBN Number978-1-5386-3371-7
KeywordsCrawlers, data mining, feature extraction, heuristic weights, Human Behavior, human factors, phishing, Phishing Detection, pubcrawl, Training data, Uniform resource locators, visualization, Web crawler
Abstract

Phishers often exploit users' trust on the appearance of a site by using webpages that are visually similar to an authentic site. In the past, various research studies have tried to identify and classify the factors contributing towards the detection of phishing websites. The focus of this research is to establish a strong relationship between those identified heuristics (content-based) and the legitimacy of a website by analyzing training sets of websites (both phishing and legitimate websites) and in the process analyze new patterns and report findings. Many existing phishing detection tools are often not very accurate as they depend mostly on the old database of previously identified phishing websites. However, there are thousands of new phishing websites appearing every year targeting financial institutions, cloud storage/file hosting sites, government websites, and others. This paper presents a framework called Phishing-Detective that detects phishing websites based on existing and newly found heuristics. For this framework, a web crawler was developed to scrape the contents of phishing and legitimate websites. These contents were analyzed to rate the heuristics and their contribution scale factor towards the illegitimacy of a website. The data set collected from Web Scraper was then analyzed using a data mining tool to find patterns and report findings. A case study shows how this framework can be used to detect a phishing website. This research is still in progress but shows a new way of finding and using heuristics and the sum of their contributing weights to effectively and accurately detect phishing websites. Further development of this framework is discussed at the end of the paper.

URLhttps://ieeexplore.ieee.org/document/8117212
DOI10.1109/IEMCON.2017.8117212
Citation Keypark_phishing_2017