Biblio
Various studies have been performed to explore the feasibility of detection of web-based attacks by machine learning techniques. False-positive and false-negative results have been reported as a major issue to be addressed to make machine learning-based detection and prevention of web-based attacks reliable and trustworthy. In our research, we tried to identify and address the root cause of the false-positive and false-negative results. In our experiment, we used the CSIC 2010 HTTP dataset, which contains the generated traffic targeted to an e-commerce web application. Our experimental results demonstrate that applying the proposed fine-tuned feature set extraction results in improved detection and classification of web-based attacks for all tested machine learning algorithms. The performance of the machine learning algorithm in the detection of attacks was evaluated by the Precision, Recall, Accuracy, and F-measure metrics. Among three tested algorithms, the J48 decision tree algorithm provided the highest True Positive rate, Precision, and Recall.
Attacks against websites are increasing rapidly with the expansion of web services. An increasing number of diversified web services make it difficult to prevent such attacks due to many known vulnerabilities in websites. To overcome this problem, it is necessary to collect the most recent attacks using decoy web honeypots and to implement countermeasures against malicious threats. Web honeypots collect not only malicious accesses by attackers but also benign accesses such as those by web search crawlers. Thus, it is essential to develop a means of automatically identifying malicious accesses from mixed collected data including both malicious and benign accesses. Specifically, detecting vulnerability scanning, which is a preliminary process, is important for preventing attacks. In this study, we focused on classification of accesses for web crawling and vulnerability scanning since these accesses are too similar to be identified. We propose a feature vector including features of collective accesses, e.g., intervals of request arrivals and the dispersion of source port numbers, obtained with multiple honeypots deployed in different networks for classification. Through evaluation using data collected from 37 honeypots in a real network, we show that features of collective accesses are advantageous for vulnerability scanning and crawler classification.