"Kn0W Thy Doma1N Name": Unbiased Phishing Detection Using Domain Name Based Features
Title | "Kn0W Thy Doma1N Name": Unbiased Phishing Detection Using Domain Name Based Features |
Publication Type | Conference Paper |
Year of Publication | 2018 |
Authors | Shirazi, Hossein, Bezawada, Bruhadeshwar, Ray, Indrakshi |
Conference Name | Proceedings of the 23Nd ACM on Symposium on Access Control Models and Technologies |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-4503-5666-4 |
Keywords | biased datasets, domain name, Human Behavior, human factor, machine learning, phishing, Phishing Detection, pubcrawl |
Abstract | Phishing websites remain a persistent security threat. Thus far, machine learning approaches appear to have the best potential as defenses. But, there are two main concerns with existing machine learning approaches for phishing detection. The first is the large number of training features used and the lack of validating arguments for these feature choices. The second concern is the type of datasets used in the literature that are inadvertently biased with respect to the features based on the website URL or content. To address these concerns, we put forward the intuition that the domain name of phishing websites is the tell-tale sign of phishing and holds the key to successful phishing detection. Accordingly, we design features that model the relationships, visual as well as statistical, of the domain name to the key elements of a phishing website, which are used to snare the end-users. The main value of our feature design is that, to bypass detection, an attacker will find it very difficult to tamper with the visual content of the phishing website without arousing the suspicion of the end user. Our feature set ensures that there is minimal or no bias with respect to a dataset. Our learning model trains with only seven features and achieves a true positive rate of 98% and a classification accuracy of 97%, on sample dataset. Compared to the state-of-the-art work, our per data instance classification is 4 times faster for legitimate websites and 10 times faster for phishing websites. Importantly, we demonstrate the shortcomings of using features based on URLs as they are likely to be biased towards specific datasets. We show the robustness of our learning algorithm by testing on unknown live phishing URLs and achieve a high detection accuracy of \$99.7%\$. |
URL | https://dl.acm.org/doi/10.1145/3205977.3205992 |
DOI | 10.1145/3205977.3205992 |
Citation Key | shirazi_kn0w_2018 |