Biblio
This paper proposes and describes an active authentication model based on user profiles built from user-issued commands when interacting with GUI-based application. Previous behavioral models derived from user issued commands were limited to analyzing the user's interaction with the *Nix (Linux or Unix) command shell program. Human-computer interaction (HCI) research has explored the idea of building users profiles based on their behavioral patterns when interacting with such graphical interfaces. It did so by analyzing the user's keystroke and/or mouse dynamics. However, none had explored the idea of creating profiles by capturing users' usage characteristics when interacting with a specific application beyond how a user strikes the keyboard or moves the mouse across the screen. We obtain and utilize a dataset of user command streams collected from working with Microsoft (MS) Word to serve as a test bed. User profiles are first built using MS Word commands and identification takes place using machine learning algorithms. Best performance in terms of both accuracy and Area under the Curve (AUC) for Receiver Operating Characteristic (ROC) curve is reported using Random Forests (RF) and AdaBoost with random forests.
DeepQA is a large-scale natural language processing (NLP) question-and-answer system that responds across a breadth of structured and unstructured data, from hundreds of analytics that are combined with over 50 models, trained through machine learning. After the 2011 historic milestone of defeating the two best human players in the Jeopardy! game show, the technology behind IBM Watson, DeepQA, is undergoing gamification into real-world business problems. Gamifying a business domain for Watson is a composite of functional, content, and training adaptation for nongame play. During domain gamification for medical, financial, government, or any other business, each system change affects the machine-learning process. As opposed to the original Watson Jeopardy!, whose class distribution of positive-to-negative labels is 1:100, in adaptation the computed training instances, question-and-answer pairs transformed into true-false labels, result in a very low positive-to-negative ratio of 1:100 000. Such initial extreme class imbalance during domain gamification poses a big challenge for the Watson machine-learning pipelines. The combination of ingested corpus sets, question-and-answer pairs, configuration settings, and NLP algorithms contribute toward the challenging data state. We propose several data engineering techniques, such as answer key vetting and expansion, source ingestion, oversampling classes, and question set modifications to increase the computed true labels. In addition, algorithm engineering, such as an implementation of the Newton-Raphson logistic regression with a regularization term, relaxes the constraints of class imbalance during training adaptation. We conclude by empirically demonstrating that data and algorithm engineering are complementary and indispensable to overcome the challenges in this first Watson gamification for real-world business problems.
Botnets are one of the most destructive threats against the cyber security. Recently, HTTP protocol is frequently utilized by botnets as the Command and Communication (C&C) protocol. In this work, we aim to detect HTTP based botnet activity based on botnet behaviour analysis via machine learning approach. To achieve this, we employ flow-based network traffic utilizing NetFlow (via Softflowd). The proposed botnet analysis system is implemented by employing two different machine learning algorithms, C4.5 and Naive Bayes. Our results show that C4.5 learning algorithm based classifier obtained very promising performance on detecting HTTP based botnet activity.
Today's systems produce a rapidly exploding amount of data, and the data further derives more data, forming a complex data propagation network that we call the data's lineage. There are many reasons that users want systems to forget certain data including its lineage. From a privacy perspective, users who become concerned with new privacy risks of a system often want the system to forget their data and lineage. From a security perspective, if an attacker pollutes an anomaly detector by injecting manually crafted data into the training data set, the detector must forget the injected data to regain security. From a usability perspective, a user can remove noise and incorrect entries so that a recommendation engine gives useful recommendations. Therefore, we envision forgetting systems, capable of forgetting certain data and their lineages, completely and quickly. This paper focuses on making learning systems forget, the process of which we call machine unlearning, or simply unlearning. We present a general, efficient unlearning approach by transforming learning algorithms used by a system into a summation form. To forget a training data sample, our approach simply updates a small number of summations – asymptotically faster than retraining from scratch. Our approach is general, because the summation form is from the statistical query learning in which many machine learning algorithms can be implemented. Our approach also applies to all stages of machine learning, including feature selection and modeling. Our evaluation, on four diverse learning systems and real-world workloads, shows that our approach is general, effective, fast, and easy to use.
Segmentation of land and water regions is necessary in many applications involving analysis of remote sensing imagery. Not only is manual segmentation of these regions prone to considerable subjective variability, but the large volume of imagery collected by modern platforms makes manual segmentation extremely tedious to perform, particularly in applications that require frequent re-measurement. This paper examines a robust, semi-automated approach that utilizes simple and efficient machine learning algorithms to perform supervised classification of multi-spectral image data into land and water regions. By combining the four wavelength bands widely available in imaging platforms such as IKONOS, QuickBird, and GeoEye-1 with basic texture metrics, high quality segmentation can be achieved. An efficient workflow was created by constructing a Graphical User Interface (GUI) to these machine learning algorithms.
This paper proposes a highly scalable framework that can be applied to detect network anomaly at per flow level by constructing a meta-model for a family of machine learning algorithms or statistical data models. The approach is scalable and attainable because raw data needs to be accessed only one time and it will be processed, computed and transformed into a meta-model matrix in a much smaller size that can be resident in the system RAM. The calculation of meta-model matrix can be achieved through disposable updating operations at per row level: once a per-flow information is proceeded, it is no longer needed in calculating the meta-model matrix. While the proposed framework covers both Gaussian and non-Gaussian data, the focus of this work is on the linear regression models. Specifically, a new concept called meta-model sufficient statistics is proposed to analyze a group of models, where exact, not the approximate, results are derived. In addition, the proposed framework can quickly discover an optimal statistical or computer model from a family of candidate models without the need of rescanning the raw dataset. This suggest an extremely efficient and effectively theory and method is possible for big data security analysis.
Botnets have been a serious threat to the Internet security. With the constant sophistication and the resilience of them, a new trend has emerged, shifting botnets from the traditional desktop to the mobile environment. As in the desktop domain, detecting mobile botnets is essential to minimize the threat that they impose. Along the diverse set of strategies applied to detect these botnets, the ones that show the best and most generalized results involve discovering patterns in their anomalous behavior. In the mobile botnet field, one way to detect these patterns is by analyzing the operation parameters of this kind of applications. In this paper, we present an anomaly-based and host-based approach to detect mobile botnets. The proposed approach uses machine learning algorithms to identify anomalous behaviors in statistical features extracted from system calls. Using a self-generated dataset containing 13 families of mobile botnets and legitimate applications, we were able to test the performance of our approach in a close-to-reality scenario. The proposed approach achieved great results, including low false positive rates and high true detection rates.
In this paper, a novel method to do feature selection to detect botnets at their phase of Command and Control (C&C) is presented. A major problem is that researchers have proposed features based on their expertise, but there is no a method to evaluate these features since some of these features could get a lower detection rate than other. To this aim, we find the feature set based on connections of botnets at their phase of C&C, that maximizes the detection rate of these botnets. A Genetic Algorithm (GA) was used to select the set of features that gives the highest detection rate. We used the machine learning algorithm C4.5, this algorithm did the classification between connections belonging or not to a botnet. The datasets used in this paper were extracted from the repositories ISOT and ISCX. Some tests were done to get the best parameters in a GA and the algorithm C4.5. We also performed experiments in order to obtain the best set of features for each botnet analyzed (specific), and for each type of botnet (general) too. The results are shown at the end of the paper, in which a considerable reduction of features and a higher detection rate than the related work presented were obtained.
As increasingly more enterprises are deploying cloud file-sharing services, this adds a new channel for potential insider threats to company data and IPs. In this paper, we introduce a two-stage machine learning system to detect anomalies. In the first stage, we project the access logs of cloud file-sharing services onto relationship graphs and use three complementary graph-based unsupervised learning methods: OddBall, PageRank and Local Outlier Factor (LOF) to generate outlier indicators. In the second stage, we ensemble the outlier indicators and introduce the discrete wavelet transform (DWT) method, and propose a procedure to use wavelet coefficients with the Haar wavelet function to identify outliers for insider threat. The proposed system has been deployed in a real business environment, and demonstrated effectiveness by selected case studies.
As a plethora of wearable devices are being introduced, significant concerns exist on the privacy and security of personal data stored on these devices. Expanding on recent works of using electrocardiogram (ECG) as a modality for biometric authentication, in this work, we investigate the possibility of using personal ECG signals as the individually unique source for physical unclonable function (PUF), which eventually can be used as the key for encryption and decryption engines. We present new signal processing and machine learning algorithms that learn and extract maximally different ECG features for different individuals and minimally different ECG features for the same individual over time. Experimental results with a large 741-subject in-house ECG database show that the distributions of the intra-subject (same person) Hamming distance of extracted ECG features and the inter-subject Hamming distance have minimal overlap. 256-b random numbers generated from the ECG features of 648 (out of 741) subjects pass the NIST randomness tests.
CAPTCHA is a type of challenge-response test to ensure that the response is only generated by humans and not by computerized robots. CAPTCHA are getting harder as because usage of latest advanced pattern recognition and machine learning algorithms are capable of solving simpler CAPTCHA. However, some enhancement procedures make the CAPTCHAs too difficult to be recognized by the human. This paper resolves the problem by next generation human-friendly mini game-CAPTCHA for quantifying the usability of CAPTCHAs.
Phishing is a major concern on the Internet today and many users are falling victim because of criminal's deceitful tactics. Blacklisting is still the most common defence users have against such phishing websites, but is failing to cope with the increasing number. In recent years, researchers have devised modern ways of detecting such websites using machine learning. One such method is to create machine learnt models of URL features to classify whether URLs are phishing. However, there are varying opinions on what the best approach is for features and algorithms. In this paper, the objective is to evaluate the performance of the Random Forest algorithm using a lexical only dataset. The performance is benchmarked against other machine learning algorithms and additionally against those reported in the literature. Initial results from experiments indicate that the Random Forest algorithm performs the best yielding an 86.9% accuracy.
Software Defined Networks (SDNs) have gained prominence recently due to their flexible management and superior configuration functionality of the underlying network. SDNs, with OpenFlow as their primary implementation, allow for the use of a centralised controller to drive the decision making for all the supported devices in the network and manage traffic through routing table changes for incoming flows. In conventional networks, machine learning has been shown to detect malicious intrusion, and classify attacks such as DoS, user to root, and probe attacks. In this work, we extend the use of machine learning to improve traffic tolerance for SDNs. To achieve this, we extend the functionality of the controller to include a resilience framework, ReSDN, that incorporates machine learning to be able to distinguish DoS attacks, focussing on a neptune attack for our experiments. Our model is trained using the MIT KDD 1999 dataset. The system is developed as a module on top of the POX controller platform and evaluated using the Mininet simulator.
In the present time, there has been a huge increase in large data repositories by corporations, governments, and healthcare organizations. These repositories provide opportunities to design/improve decision-making systems by mining trends and patterns from the data set (that can provide credible information) to improve customer service (e.g., in healthcare). As a result, while data sharing is essential, it is an obligation to maintaining the privacy of the data donors as data custodians have legal and ethical responsibilities to secure confidentiality. This research proposes a 2-layer privacy preserving (2-LPP) data sanitization algorithm that satisfies ε-differential privacy for publishing sanitized data. The proposed algorithm also reduces the re-identification risk of the sanitized data. The proposed algorithm has been implemented, and tested with two different data sets. Compared to other existing works, the results obtained from the proposed algorithm show promising performance.
Denial of service (DOS) attacks are a serious threat to network security. These attacks are often sourced from virtual machines in the cloud, rather than from the attacker's own machine, to achieve anonymity and higher network bandwidth. Past research focused on analyzing traffic on the destination (victim's) side with predefined thresholds. These approaches have significant disadvantages. They are only passive defenses after the attack, they cannot use the outbound statistical features of attacks, and it is hard to trace back to the attacker with these approaches. In this paper, we propose a DOS attack detection system on the source side in the cloud, based on machine learning techniques. This system leverages statistical information from both the cloud server's hypervisor and the virtual machines, to prevent network packages from being sent out to the outside network. We evaluate nine machine learning algorithms and carefully compare their performance. Our experimental results show that more than 99.7% of four kinds of DOS attacks are successfully detected. Our approach does not degrade performance and can be easily extended to broader DOS attacks.
Reversible circuits are vulnerable to intellectual property and integrated circuit piracy. To show these vulnerabilities, a detailed understanding on how to identify the function embedded in a reversible circuit is crucial. To obtain the embedded function, one needs to know the synthesis approach used to generate the reversible circuit in the first place. We present a machine learning based scheme to identify the synthesis approach using telltale signs in the design.
In a world where traditional notions of privacy are increasingly challenged by the myriad companies that collect and analyze our data, it is important that decision-making entities are held accountable for unfair treatments arising from irresponsible data usage. Unfortunately, a lack of appropriate methodologies and tools means that even identifying unfair or discriminatory effects can be a challenge in practice. We introduce the unwarranted associations (UA) framework, a principled methodology for the discovery of unfair, discriminatory, or offensive user treatment in data-driven applications. The UA framework unifies and rationalizes a number of prior attempts at formalizing algorithmic fairness. It uniquely combines multiple investigative primitives and fairness metrics with broad applicability, granular exploration of unfair treatment in user subgroups, and incorporation of natural notions of utility that may account for observed disparities. We instantiate the UA framework in FairTest, the first comprehensive tool that helps developers check data-driven applications for unfair user treatment. It enables scalable and statistically rigorous investigation of associations between application outcomes (such as prices or premiums) and sensitive user attributes (such as race or gender). Furthermore, FairTest provides debugging capabilities that let programmers rule out potential confounders for observed unfair effects. We report on use of FairTest to investigate and in some cases address disparate impact, offensive labeling, and uneven rates of algorithmic error in four data-driven applications. As examples, our results reveal subtle biases against older populations in the distribution of error in a predictive health application and offensive racial labeling in an image tagger.
The increasing complexity and ubiquity in user connectivity, computing environments, information content, and software, mobile, and web applications transfers the responsibility of privacy management to the individuals. Hence, making it extremely difficult for users to maintain the intelligent and targeted level of privacy protection that they need and desire, while simultaneously maintaining their ability to optimally function. Thus, there is a critical need to develop intelligent, automated, and adaptable privacy management systems that can assist users in managing and protecting their sensitive data in the increasingly complex situations and environments that they find themselves in. This work is a first step in exploring the development of such a system, specifically how user personality traits and other characteristics can be used to help automate determination of user sharing preferences for a variety of user data and situations. The Big-Five personality traits of openness, conscientiousness, extroversion, agreeableness, and neuroticism are examined and used as inputs into several popular machine learning algorithms in order to assess their ability to elicit and predict user privacy preferences. Our results show that the Big-Five personality traits can be used to significantly improve the prediction of user privacy preferences in a number of contexts and situations, and so using machine learning approaches to automate the setting of user privacy preferences has the potential to greatly reduce the burden on users while simultaneously improving the accuracy of their privacy preferences and security.
To assure cyber security of an enterprise, typically SIEM (Security Information and Event Management) system is in place to normalize security events from different preventive technologies and flag alerts. Analysts in the security operation center (SOC) investigate the alerts to decide if it is truly malicious or not. However, generally the number of alerts is overwhelming with majority of them being false positive and exceeding the SOC's capacity to handle all alerts. Because of this, potential malicious attacks and compromised hosts may be missed. Machine learning is a viable approach to reduce the false positive rate and improve the productivity of SOC analysts. In this paper, we develop a user-centric machine learning framework for the cyber security operation center in real enterprise environment. We discuss the typical data sources in SOC, their work flow, and how to leverage and process these data sets to build an effective machine learning system. The paper is targeted towards two groups of readers. The first group is data scientists or machine learning researchers who do not have cyber security domain knowledge but want to build machine learning systems for security operations center. The second group of audiences are those cyber security practitioners who have deep knowledge and expertise in cyber security, but do not have machine learning experiences and wish to build one by themselves. Throughout the paper, we use the system we built in the Symantec SOC production environment as an example to demonstrate the complete steps from data collection, label creation, feature engineering, machine learning algorithm selection, model performance evaluations, to risk score generation.
Unmanned systems are increasing in number, while their manning requirements remain the same. To decrease manpower demands, machine learning techniques and autonomy are gaining traction and visibility. One barrier is human perception and understanding of autonomy. Machine learning techniques can result in “black box” algorithms that may yield high fitness, but poor comprehension by operators. However, Interactive Machine Learning (IML), a method to incorporate human input over the course of algorithm development by using neuro-evolutionary machine-learning techniques, may offer a solution. IML is evaluated here for its impact on developing autonomous team behaviors in an area search task. Initial findings show that IML-generated search plans were chosen over plans generated using a non-interactive ML technique, even though the participants trusted them slightly less. Further, participants discriminated each of the two types of plans from each other with a high degree of accuracy, suggesting the IML approach imparts behavioral characteristics into algorithms, making them more recognizable. Together the results lay the foundation for exploring how to team humans successfully with ML behavior.
The veil of anonymity provided by smartphones with pre-paid SIM cards, public Wi-Fi hotspots, and distributed networks like Tor has drastically complicated the task of identifying users of social media during forensic investigations. In some cases, the text of a single posted message will be the only clue to an author's identity. How can we accurately predict who that author might be when the message may never exceed 140 characters on a service like Twitter? For the past 50 years, linguists, computer scientists, and scholars of the humanities have been jointly developing automated methods to identify authors based on the style of their writing. All authors possess peculiarities of habit that influence the form and content of their written works. These characteristics can often be quantified and measured using machine learning algorithms. In this paper, we provide a comprehensive review of the methods of authorship attribution that can be applied to the problem of social media forensics. Furthermore, we examine emerging supervised learning-based methods that are effective for small sample sizes, and provide step-by-step explanations for several scalable approaches as instructional case studies for newcomers to the field. We argue that there is a significant need in forensics for new authorship attribution algorithms that can exploit context, can process multi-modal data, and are tolerant to incomplete knowledge of the space of all possible authors at training time.
Currently, mobile botnet attacks have shifted from computers to smartphones due to its functionality, ease to exploit, and based on financial intention. Mostly, it attacks Android due to its popularity and high usage among end users. Every day, more and more malicious mobile applications (apps) with the botnet capability have been developed to exploit end users' smartphones. Therefore, this paper presents a new mobile botnet classification based on permission and Application Programming Interface (API) calls in the smartphone. This classification is developed using static analysis in a controlled lab environment and the Drebin dataset is used as the training dataset. 800 apps from the Google Play Store have been chosen randomly to test the proposed classification. As a result, 16 permissions and 31 API calls that are most related with mobile botnet have been extracted using feature selection and later classified and tested using machine learning algorithms. The experimental result shows that the Random Forest Algorithm has achieved the highest detection accuracy of 99.4% with the lowest false positive rate of 16.1% as compared to other machine learning algorithms. This new classification can be used as the input for mobile botnet detection for future work, especially for financial matters.
Technological advances in wearable and implanted medical devices are enabling wireless body area networks to alter the current landscape of medical and healthcare applications. These systems have the potential to significantly improve real time patient monitoring, provide accurate diagnosis and deliver faster treatment. In spite of their growth, securing the sensitive medical and patient data relayed in these networks to protect patients' privacy and safety still remains an open challenge. The resource constraints of wireless medical sensors limit the adoption of traditional security measures in this domain. In this work, we propose a distributed mobile agent based intrusion detection system to secure these networks. Specifically, our autonomous mobile agents use machine learning algorithms to perform local and network level anomaly detection to detect various security attacks targeted on healthcare systems. Simulation results show that our system performs efficiently with high detection accuracy and low energy consumption.
Cloud computing is a revolution in IT technology that provides scalable, virtualized on-demand resources to the end users with greater flexibility, less maintenance and reduced infrastructure cost. These resources are supervised by different management organizations and provided over Internet using known networking protocols, standards and formats. The underlying technologies and legacy protocols contain bugs and vulnerabilities that can open doors for intrusion by the attackers. Attacks as DDoS (Distributed Denial of Service) are ones of the most frequent that inflict serious damage and affect the cloud performance. In a DDoS attack, the attacker usually uses innocent compromised computers (called zombies) by taking advantages of known or unknown bugs and vulnerabilities to send a large number of packets from these already-captured zombies to a server. This may occupy a major portion of network bandwidth of the victim cloud infrastructures or consume much of the servers time. Thus, in this work, we designed a DDoS detection system based on the C.4.5 algorithm to mitigate the DDoS threat. This algorithm, coupled with signature detection techniques, generates a decision tree to perform automatic, effective detection of signatures attacks for DDoS flooding attacks. To validate our system, we selected other machine learning techniques and compared the obtained results.
Network traffic identification has been a hot topic in network security area. The identification of abnormal traffic can detect attack traffic and helps network manager enforce corresponding security policies to prevent attacks. Support Vector Machines (SVMs) are one of the most promising supervised machine learning (ML) algorithms that can be applied to the identification of traffic in IP networks as well as detection of abnormal traffic. SVM shows better performance because it can avoid local optimization problems existed in many supervised learning algorithms. However, as a binary classification approach, SVM needs more research in multiclass classification. In this paper, we proposed an abnormal traffic identification system(ATIS) that can classify and identify multiple attack traffic applications. Each component of ATIS is introduced in detail and experiments are carried out based on ATIS. Through the test of KDD CUP dataset, SVM shows good performance. Furthermore, the comparison of experiments reveals that scaling and parameters has a vital impact on SVM training results.