Biblio
This paper introduces an ensemble model that solves the binary classification problem by incorporating the basic Logistic Regression with the two recent advanced paradigms: extreme gradient boosted decision trees (xgboost) and deep learning. To obtain the best result when integrating sub-models, we introduce a solution to split and select sets of features for the sub-model training. In addition to the ensemble model, we propose a flexible robust and highly scalable new scheme for building a composite classifier that tries to simultaneously implement multiple layers of model decomposition and outputs aggregation to maximally reduce both bias and variance (spread) components of classification errors. We demonstrate the power of our ensemble model to solve the problem of predicting the outcome of Hearthstone, a turn-based computer game, based on game state information. Excellent predictive performance of our model has been acknowledged by the second place scored in the final ranking among 188 competing teams.
In the last decade, numerous fake websites have been developed on the World Wide Web to mimic trusted websites, with the aim of stealing financial assets from users and organizations. This form of online attack is called phishing, and it has cost the online community and the various stakeholders hundreds of million Dollars. Therefore, effective counter measures that can accurately detect phishing are needed. Machine learning (ML) is a popular tool for data analysis and recently has shown promising results in combating phishing when contrasted with classic anti-phishing approaches, including awareness workshops, visualization and legal solutions. This article investigates ML techniques applicability to detect phishing attacks and describes their pros and cons. In particular, different types of ML techniques have been investigated to reveal the suitable options that can serve as anti-phishing tools. More importantly, we experimentally compare large numbers of ML techniques on real phishing datasets and with respect to different metrics. The purpose of the comparison is to reveal the advantages and disadvantages of ML predictive models and to show their actual performance when it comes to phishing attacks. The experimental results show that Covering approach models are more appropriate as anti-phishing solutions, especially for novice users, because of their simple yet effective knowledge bases in addition to their good phishing detection rate.
Attributing the culprit of a cyber-attack is widely considered one of the major technical and policy challenges of cyber-security. The lack of ground truth for an individual responsible for a given attack has limited previous studies. Here, we overcome this limitation by leveraging DEFCON capture-the-flag (CTF) exercise data where the actual ground-truth is known. In this work, we use various classification techniques to identify the culprit in a cyberattack and find that deceptive activities account for the majority of misclassified samples. We also explore several heuristics to alleviate some of the misclassification caused by deception.
Nowadays, a typical household owns multiple digital devices that can be connected to the Internet. Advertising companies always want to seamlessly reach consumers behind devices instead of the device itself. However, the identity of consumers becomes fragmented as they switch from one device to another. A naive attempt is to use deterministic features such as user name, telephone number and email address. However consumers might refrain from giving away their personal information because of privacy and security reasons. The challenge in ICDM2015 contest is to develop an accurate probabilistic model for predicting cross-device consumer identity without using the deterministic user information. In this paper we present an accurate and scalable cross-device solution using an ensemble of Gradient Boosting Decision Trees (GBDT) and Random Forest. Our final solution ranks 9th both on the public and private LB with F0.5 score of 0.855.
The key challenge to a datacenter network is its scalability to handle many customers and their applications. In a datacenter network, packet classification plays an important role in supporting various network services. Previous algorithms store classification rules with the same length combinations in a hash table to simplify the search procedure. The search performance of hash-based algorithms is tied to the number of hash tables. To achieve fast and scalable packet classification, we propose an algorithm, encoded rule expansion, to transform rules into an equivalent set of rules with fewer distinct length combinations, without affecting the classification results. The new algorithm can minimize the storage penalty of transformation and achieve a short search time. In addition, the scheme supports fast incremental updates. Our simulation results show that more than 90% hash tables can be eliminated. The reduction of length combinations leads to an improvement on speed performance of packet classification by an order of magnitude. The results also show that the software implementation of our scheme without using any hardware parallelism can support up to one thousand customer VLANs and one million rules, where each rule consumes less than 60 bytes and each packet classification can be accomplished under 50 memory accesses.
Botnets are one of the most destructive threats against the cyber security. Recently, HTTP protocol is frequently utilized by botnets as the Command and Communication (C&C) protocol. In this work, we aim to detect HTTP based botnet activity based on botnet behaviour analysis via machine learning approach. To achieve this, we employ flow-based network traffic utilizing NetFlow (via Softflowd). The proposed botnet analysis system is implemented by employing two different machine learning algorithms, C4.5 and Naive Bayes. Our results show that C4.5 learning algorithm based classifier obtained very promising performance on detecting HTTP based botnet activity.
As any veteran of the editor wars can attest, Unix users can be fiercely and irrationally attached to the commands they use and the manner in which they use them. In this work, we investigate the problem of identifying users out of a large set of candidates (25-97) through their command-line histories. Using standard algorithms and feature sets inspired by natural language authorship attribution literature, we demonstrate conclusively that individual users can be identified with a high degree of accuracy through their command-line behavior. Further, we report on the best performing feature combinations, from the many thousands that are possible, both in terms of accuracy and generality. We validate our work by experimenting on three user corpora comprising data gathered over three decades at three distinct locations. These are the Greenberg user profile corpus (168 users), Schonlau masquerading corpus (50 users) and Cal Poly command history corpus (97 users). The first two are well known corpora published in 1991 and 2001 respectively. The last is developed by the authors in a year-long study in 2014 and represents the most recent corpus of its kind. For a 50 user configuration, we find feature sets that can successfully identify users with over 90% accuracy on the Cal Poly, Greenberg and one variant of the Schonlau corpus, and over 87% on the other Schonlau variant.
This paper presents a framework to identify the authors of Thai online messages. The identification is based on 53 writing attributes and the selected algorithms are support vector machine (SVM) and C4.5 decision tree. Experimental results indicate that the overall accuracies achieved by the SVM and the C4.5 were 79% and 75%, respectively. This difference was not statistically significant (at 95% confidence interval). As for the performance of identifying individual authors, in some cases the SVM was clearly better than the C4.5. But there were also other cases where both of them could not distinguish one author from another.
To deliver sample estimates provided with the necessary probability foundation to permit generalization from the sample data subset to the whole target population being sampled, probability sampling strategies are required to satisfy three necessary not sufficient conditions: 1) All inclusion probabilities be greater than zero in the target population to be sampled. If some sampling units have an inclusion probability of zero, then a map accuracy assessment does not represent the entire target region depicted in the map to be assessed. 2) The inclusion probabilities must be: a) knowable for nonsampled units and b) known for those units selected in the sample: since the inclusion probability determines the weight attached to each sampling unit in the accuracy estimation formulas, if the inclusion probabilities are unknown, so are the estimation weights. This original work presents a novel (to the best of these authors' knowledge, the first) probability sampling protocol for quality assessment and comparison of thematic maps generated from spaceborne/airborne very high resolution images, where: 1) an original Categorical Variable Pair Similarity Index (proposed in two different formulations) is estimated as a fuzzy degree of match between a reference and a test semantic vocabulary, which may not coincide, and 2) both symbolic pixel-based thematic quality indicators (TQIs) and sub-symbolic object-based spatial quality indicators (SQIs) are estimated with a degree of uncertainty in measurement in compliance with the well-known Quality Assurance Framework for Earth Observation (QA4EO) guidelines. Like a decision-tree, any protocol (guidelines for best practice) comprises a set of rules, equivalent to structural knowledge, and an order of presentation of the rule set, known as procedural knowledge. The combination of these two levels of knowledge makes an original protocol worth more than the sum of its parts. The several degrees of novelty of the proposed probability sampling protocol are highlighted in this paper, at the levels of understanding of both structural and procedural knowledge, in comparison with related multi-disciplinary works selected from the existing literature. In the experimental session, the proposed protocol is tested for accuracy validation of preliminary classification maps automatically generated by the Satellite Image Automatic Mapper (SIAM™) software product from two WorldView-2 images and one QuickBird-2 image provided by DigitalGlobe for testing purposes. In these experiments, collected TQIs and SQIs are statistically valid, statistically significant, consistent across maps, and in agreement with theoretical expectations, visual (qualitative) evidence and quantitative quality indexes of operativeness (OQIs) claimed for SIAM™ by related papers. As a subsidiary conclusion, the statistically consistent and statistically significant accuracy validation of the SIAM™ pre-classification maps proposed in this contribution, together with OQIs claimed for SIAM™ by related works, make the operational (automatic, accurate, near real-time, robust, scalable) SIAM™ software product eligible for opening up new inter-disciplinary research and market opportunities in accordance with the visionary goal of the Global Earth Observation System of Systems initiative and the QA4EO international guidelines.
Denial of Service (DoS) and Distributed Denial of Service (DDoS) attack, exhausts the resources of server/service and makes it unavailable for legitimate users. With increasing use of online services and attacks on these services, the importance of Intrusion Detection System (IDS) for detection of DoS/DDoS attacks has also grown. Detection accuracy & CPU utilization of Data mining based IDS is directly proportional to the quality of training dataset used to train it. Various preprocessing methods like normalization, discretization, fuzzification are used by researchers to improve the quality of training dataset. This paper evaluates the effect of various data preprocessing methods on the detection accuracy of DoS/DDoS attack detection IDS and proves that numeric to binary preprocessing method performs better compared to other methods. Experimental results obtained using KDD 99 dataset are provided to support the efficiency of proposed combination.
- « first
- ‹ previous
- 1
- 2
- 3
- 4