Biblio
Designing a machine learning based network intrusion detection system (IDS) with high-dimensional features can lead to prolonged classification processes. This is while low-dimensional features can reduce these processes. Moreover, classification of network traffic with imbalanced class distributions has posed a significant drawback on the performance attainable by most well-known classifiers. With the presence of imbalanced data, the known metrics may fail to provide adequate information about the performance of the classifier. This study first uses Principal Component Analysis (PCA) as a feature dimensionality reduction approach. The resulting low-dimensional features are then used to build various classifiers such as Random Forest (RF), Bayesian Network, Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) for designing an IDS. The experimental findings with low-dimensional features in binary and multi-class classification show better performance in terms of Detection Rate (DR), F-Measure, False Alarm Rate (FAR), and Accuracy. Furthermore, in this paper, we apply a Multi-Class Combined performance metric Combi ned Mc with respect to class distribution through incorporating FAR, DR, Accuracy, and class distribution parameters. In addition, we developed a uniform distribution based balancing approach to handle the imbalanced distribution of the minority class instances in the CICIDS2017 network intrusion dataset. We were able to reduce the CICIDS2017 dataset's feature dimensions from 81 to 10 using PCA, while maintaining a high accuracy of 99.6% in multi-class and binary classification.
Severe class imbalance between the majority and minority classes in large datasets can prejudice Machine Learning classifiers toward the majority class. Our work uniquely consolidates two case studies, each utilizing three learners implemented within an Apache Spark framework, six sampling methods, and five sampling distribution ratios to analyze the effect of severe class imbalance on big data analytics. We use three performance metrics to evaluate this study: Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, and Geometric Mean. In the first case study, models were trained on one dataset (POST) and tested on another (SlowlorisBig). In the second case study, the training and testing dataset roles were switched. Our comparison of performance metrics shows that Area Under the Precision-Recall Curve and Geometric Mean are sensitive to changes in the sampling distribution ratio, whereas Area Under the Receiver Operating Characteristic Curve is relatively unaffected. In addition, we demonstrate that when comparing sampling methods, borderline-SMOTE2 outperforms the other methods in the first case study, and Random Undersampling is the top performer in the second case study.