Visible to the public Biblio

Filters: Keyword is class imbalance  [Clear All Filters]
2022-05-19
Ndichu, Samuel, Ban, Tao, Takahashi, Takeshi, Inoue, Daisuke.  2021.  A Machine Learning Approach to Detection of Critical Alerts from Imbalanced Multi-Appliance Threat Alert Logs. 2021 IEEE International Conference on Big Data (Big Data). :2119–2127.
The extraordinary number of alerts generated by network intrusion detection systems (NIDS) can desensitize security analysts tasked with incident response. Security information and event management systems (SIEMs) perform some rudimentary automation but cannot replicate the decision-making process of a skilled analyst. Machine learning and artificial intelligence (AI) can detect patterns in data with appropriate training. In practice, the majority of the alert data comprises false alerts, and true alerts form only a small proportion. Consequently, a naive engine that classifies all security alerts into the majority class can yield a superficial high accuracy close to 100%. Without any correction for the class imbalance, the false alerts will dominate algorithmic predictions resulting in poor generalization performance. We propose a machine-learning approach to address the class imbalance problem in multi-appliance security alert data and automate the security alert analysis process performed in security operations centers (SOCs). We first used the neighborhood cleaning rule (NCR) to identify and remove ambiguous, noisy, and redundant false alerts. Then, we applied the support vector machine synthetic minority oversampling technique (SVMSMOTE) to generate synthetic training true alerts. Finally, we fit and evaluated the decision tree and random forest classifiers. In the experiments, using alert data from eight security appliances, we demonstrated that the proposed method can significantly reduce the need for manual auditing, decreasing the number of uninspected alerts and achieving a performance of 99.524% in recall.
2022-03-25
Shi, Peng, Chen, Xuebing, Kong, Xiangying, Cao, Xianghui.  2021.  SE-IDS: A Sample Equalization Method for Intrusion Detection in Industrial Control System. 2021 36th Youth Academic Annual Conference of Chinese Association of Automation (YAC). :189—195.

With the continuous emergence of cyber attacks, the security of industrial control system (ICS) has become a hot issue in academia and industry. Intrusion detection technology plays an irreplaceable role in protecting industrial system from attacks. However, the imbalance between normal samples and attack samples seriously affects the performance of intrusion detection algorithms. This paper proposes SE-IDS, which uses generative adversarial networks (GAN) to expand the minority to make the number of normal samples and attack samples relatively balanced, adopts particle swarm optimization (PSO) to optimize the parameters of LightGBM. Finally, we evaluated the performance of the proposed model on the industrial network dataset.

2020-11-09
Wheelus, C., Bou-Harb, E., Zhu, X..  2018.  Tackling Class Imbalance in Cyber Security Datasets. 2018 IEEE International Conference on Information Reuse and Integration (IRI). :229–232.
It is clear that cyber-attacks are a danger that must be addressed with great resolve, as they threaten the information infrastructure upon which we all depend. Many studies have been published expressing varying levels of success with machine learning approaches to combating cyber-attacks, but many modern studies still focus on training and evaluating with very outdated datasets containing old attacks that are no longer a threat, and also lack data on new attacks. Recent datasets like UNSW-NB15 and SANTA have been produced to address this problem. Even so, these modern datasets suffer from class imbalance, which reduces the efficacy of predictive models trained using these datasets. Herein we evaluate several pre-processing methods for addressing the class imbalance problem; using several of the most popular machine learning algorithms and a variant of UNSW-NB15 based upon the attributes from the SANTA dataset.
2020-04-03
Calvert, Chad L., Khoshgoftaar, Taghi M..  2019.  Threshold Based Optimization of Performance Metrics with Severely Imbalanced Big Security Data. 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI). :1328—1334.

Proper evaluation of classifier predictive models requires the selection of appropriate metrics to gauge the effectiveness of a model's performance. The Area Under the Receiver Operating Characteristic Curve (AUC) has become the de facto standard metric for evaluating this classifier performance. However, recent studies have suggested that AUC is not necessarily the best metric for all types of datasets, especially those in which there exists a high or severe level of class imbalance. There is a need to assess which specific metrics are most beneficial to evaluate the performance of highly imbalanced big data. In this work, we evaluate the performance of eight machine learning techniques on a severely imbalanced big dataset pertaining to the cyber security domain. We analyze the behavior of six different metrics to determine which provides the best representation of a model's predictive performance. We also evaluate the impact that adjusting the classification threshold has on our metrics. Our results find that the C4.5N decision tree is the optimal learner when evaluating all presented metrics for severely imbalanced Slow HTTP DoS attack data. Based on our results, we propose that the use of AUC alone as a primary metric for evaluating highly imbalanced big data may be ineffective, and the evaluation of metrics such as F-measure and Geometric mean can offer substantial insight into the true performance of a given model.

2019-03-06
Hess, S., Satam, P., Ditzler, G., Hariri, S..  2018.  Malicious HTML File Prediction: A Detection and Classification Perspective with Noisy Data. 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA). :1-7.

Cybersecurity plays a critical role in protecting sensitive information and the structural integrity of networked systems. As networked systems continue to expand in numbers as well as in complexity, so does the threat of malicious activity and the necessity for advanced cybersecurity solutions. Furthermore, both the quantity and quality of available data on malicious content as well as the fact that malicious activity continuously evolves makes automated protection systems for this type of environment particularly challenging. Not only is the data quality a concern, but the volume of the data can be quite small for some of the classes. This creates a class imbalance in the data used to train a classifier; however, many classifiers are not well equipped to deal with class imbalance. One such example is detecting malicious HMTL files from static features. Unfortunately, collecting malicious HMTL files is extremely difficult and can be quite noisy from HTML files being mislabeled. This paper evaluates a specific application that is afflicted by these modern cybersecurity challenges: detection of malicious HTML files. Previous work presented a general framework for malicious HTML file classification that we modify in this work to use a $\chi$2 feature selection technique and synthetic minority oversampling technique (SMOTE). We experiment with different classifiers (i.e., AdaBoost, Gentle-Boost, RobustBoost, RusBoost, and Random Forest) and a pure detection model (i.e., Isolation Forest). We benchmark the different classifiers using SMOTE on a real dataset that contains a limited number of malicious files (40) with respect to the normal files (7,263). It was found that the modified framework performed better than the previous framework's results. However, additional evidence was found to imply that algorithms which train on both the normal and malicious samples are likely overtraining to the malicious distribution. We demonstrate the likely overtraining by determining that a subset of the malicious files, while suspicious, did not come from a malicious source.