Biblio
Cross-site scripting (XSS) is a scripting attack targeting web applications by injecting malicious scripts into web pages. Blind XSS is a subset of stored XSS, where an attacker blindly deploys malicious payloads in web pages that are stored in a persistent manner on target servers. Most of the XSS detection techniques used to detect the XSS vulnerabilities are inadequate to detect blind XSS attacks. In this research, we present machine learning based approach to detect blind XSS attacks. Testing results help to identify malicious payloads that are likely to get stored in databases through web applications.
Stealthy attackers often disable or tamper with system monitors to hide their tracks and evade detection. In this poster, we present a data-driven technique to detect such monitor compromise using evidential reasoning. Leveraging the fact that hiding from multiple, redundant monitors is difficult for an attacker, to identify potential monitor compromise, we combine alerts from different sets of monitors by using Dempster-Shafer theory, and compare the results to find outliers. We describe our ongoing work in this area.
An important ingredient for a successful recipe for solving machine learning problems is the availability of a suitable dataset. However, such a dataset may have to be extracted from a large unstructured and semi-structured data like programming code, scripts, and text. In this work, we propose a plug-in based, extensible feature extraction framework for which we have prototyped as a tool. The proposed framework is demonstrated by extracting features from two different sources of semi-structured and unstructured data. The semi-structured data comprised of web page and script based data whereas the other data was taken from email data for spam filtering. The usefulness of the tool was also assessed on the aspect of ease of programming.
As drone attracts much interest, the drone industry has opened their market to ordinary people, making drones to be used in daily lives. However, as it got easier for drone to be used by more people, safety and security issues have raised as accidents are much more likely to happen: colliding into people by losing control or invading secured properties. For safety purposes, it is essential for observers and drone to be aware of an approaching drone. In this paper, we introduce a comprehensive drone detection system based on machine learning. This system is designed to be operable on drones with camera. Based on the camera images, the system deduces location on image and vendor model of drone based on machine classification. The system is actually built with OpenCV library. We collected drone imagery and information for learning process. The system's output shows about 89 percent accuracy.
Conditional Generative Adversarial Nets [1](cGAN) was recently proposed as a novel conditional learning method by feeding some extra information into the network. In this paper we propose an improved conditional GANs which use divided z-vector (DzGAN). The computation amount will be reduced because DzGAN can implement conditional learning using not images but one-hot vector by dividing the range of z-vector (e.g. -1\textasciitilde1 to -1\textasciitilde0 and 0\textasciitilde1). In the DzGAN, the discriminator is fed by the images with label using one-hot vector and the generator is fed by divided z-vector (e.g. there are 10 classes In MNIST dataset, the divided z-vector will be z1\textasciitildez10 accordingly) with corresponding label fed into the discriminator, thus we can implement conditional learning. In this paper we use conditional Deep Convolutional Generative Adversarial Networks (cDCGAN) [7] instead of cGAN because cDCGAN can generate clear image better than cGAN. Heuristic experiments of conditional learning which compare the computation amount demonstrate that DzGAN is superior than cDCGAN.
Botnet is one of the major threats on the Internet for committing cybercrimes, such as DDoS attacks, stealing sensitive information, spreading spams, etc. It is a challenging issue to detect modern botnets that are continuously improving for evading detection. In this paper, we propose a machine learning based botnet detection system that is shown to be effective in identifying P2P botnets. Our approach extracts convolutional version of effective flow-based features, and trains a classification model by using a feed-forward artificial neural network. The experimental results show that the accuracy of detection using the convolutional features is better than the ones using the traditional features. It can achieve 94.7% of detection accuracy and 2.2% of false positive rate on the known P2P botnet datasets. Furthermore, our system provides an additional confidence testing for enhancing performance of botnet detection. It further classifies the network traffic of insufficient confidence in the neural network. The experiment shows that this stage can increase the detection accuracy up to 98.6% and decrease the false positive rate up to 0.5%.
Pre-Silicon hardware Trojan detection has been studied for years. The most popular benchmark circuits are from the Trust-Hub. Their common feature is that the probability of activating hardware Trojans is very low. This leads to a series of machine learning based hardware Trojan detection methods which try to find the nets with low signal probability of 0 or 1. On the other hand, it is considered that, if the probability of activating hardware Trojans is high, these hardware Trojans can be easily found through behaviour simulations or during functional test. This paper explores the "grey zone" between these two opposite scenarios: if the activation probability of a hardware Trojan is not low enough for machine learning to detect it and is not high enough for behaviour simulation or functional test to find it, it can escape from detection. Experiments show the existence of such hardware Trojans, and this paper suggests a new set of hardware Trojan benchmark circuits for future study.
With the explosion of Automation, Autonomy, and AI technology development today, amid encouragement to put humans at the center of AI, systems engineers and user story/requirements developers need research-based guidance on how to design for human machine teaming (HMT). Insights from more than two decades of human-automation interaction research, applied in the systems engineering process, provide building blocks for designing automation, autonomy, and AI-based systems that are effective teammates for people.
The HMT Systems Engineering Guide provides this guidance based on a 2016-17 literature search and analysis of applied research. The guide provides a framework organizing HMT research, along with methodology for engaging with users of a system to elicit user stories and/or requirements that reflect applied research findings. The framework uses organizing themes of Observability, Predictability, Directing Attention, Exploring the Solution Space, Directability, Adaptability, Common Ground, Calibrated Trust, Design Process, and Information Presentation.
The guide includes practice-oriented resources that can be used to bridge the gap between research and design, including a tailorable HMT Knowledge Audit interview methodology, step-by-step instructions for planning and conducting data collection sessions, and a set of general cognitive interface requirements that can be adapted to specific applications based upon domain-specific data collected.
An important source of cyber-attacks is malware, which proliferates in different forms such as botnets. The botnet malware typically looks for vulnerable devices across the Internet, rather than targeting specific individuals, companies or industries. It attempts to infect as many connected devices as possible, using their resources for automated tasks that may cause significant economic and social harm while being hidden to the user and device. Thus, it becomes very difficult to detect such activity. A considerable amount of research has been conducted to detect and prevent botnet infestation. In this paper, we attempt to create a foundation for an anomaly-based intrusion detection system using a statistical learning method to improve network security and reduce human involvement in botnet detection. We focus on identifying the best features to detect botnet activity within network traffic using a lightweight logistic regression model. The network traffic is processed by Bro, a popular network monitoring framework which provides aggregate statistics about the packets exchanged between a source and destination over a certain time interval. These statistics serve as features to a logistic regression model responsible for classifying malicious and benign traffic. Our model is easy to implement and simple to interpret. We characterized and modeled 8 different botnet families separately and as a mixed dataset. Finally, we measured the performance of our model on multiple parameters using F1 score, accuracy and Area Under Curve (AUC).
This computer era leads human to interact with computers and networks but there is no such solution to get rid of security problems. Securities threats misleads internet, we are sometimes losing our hope and reliability with many server based access. Even though many more crypto algorithms are coming for integrity and authentic data in computer access still there is a non reliable threat penetrates inconsistent vulnerabilities in networks. These vulnerable sites are taking control over the user's computer and doing harmful actions without user's privileges. Though Firewalls and protocols may support our browsers via setting certain rules, still our system couldn't support for data reliability and confidentiality. Since these problems are based on network access, lets we consider TCP/IP parameters as a dataset for analysis. By doing preprocess of TCP/IP packets we can build sovereign model on data set and clump cluster. Further the data set gets classified into regular traffic pattern and anonymous pattern using KNN classification algorithm. Based on obtained pattern for normal and threats data sets, security devices and system will set rules and guidelines to learn by it to take needed stroke. This paper analysis the computer to learn security actions from the given data sets which already exist in the previous happens.
Phishing websites remain a persistent security threat. Thus far, machine learning approaches appear to have the best potential as defenses. But, there are two main concerns with existing machine learning approaches for phishing detection. The first is the large number of training features used and the lack of validating arguments for these feature choices. The second concern is the type of datasets used in the literature that are inadvertently biased with respect to the features based on the website URL or content. To address these concerns, we put forward the intuition that the domain name of phishing websites is the tell-tale sign of phishing and holds the key to successful phishing detection. Accordingly, we design features that model the relationships, visual as well as statistical, of the domain name to the key elements of a phishing website, which are used to snare the end-users. The main value of our feature design is that, to bypass detection, an attacker will find it very difficult to tamper with the visual content of the phishing website without arousing the suspicion of the end user. Our feature set ensures that there is minimal or no bias with respect to a dataset. Our learning model trains with only seven features and achieves a true positive rate of 98% and a classification accuracy of 97%, on sample dataset. Compared to the state-of-the-art work, our per data instance classification is 4 times faster for legitimate websites and 10 times faster for phishing websites. Importantly, we demonstrate the shortcomings of using features based on URLs as they are likely to be biased towards specific datasets. We show the robustness of our learning algorithm by testing on unknown live phishing URLs and achieve a high detection accuracy of \$99.7%\$.
The number of malicious Android apps has been and continues to increase rapidly. These malware can damage or alter other files or settings, install additional applications, obfuscate their behaviors, propagate quickly, and so on. To identify and handle such malware, a security analyst can significantly benefit from identifying the family to which a malicious app belongs rather than only detecting if an app is malicious. To address these challenges, we present a novel machine learning-based Android malware detection and family-identification approach, RevealDroid, that operates without the need to perform complex program analyses or extract large sets of features. RevealDroid's selected features leverage categorized Android API usage, reflection-based features, and features from native binaries of apps. We assess RevealDroid for accuracy, efficiency, and obfuscation resilience using a large dataset consisting of more than 54,000 malicious and benign apps. Our experiments show that RevealDroid achieves an accuracy of 98% in detection of malware and an accuracy of 95% in determination of their families. We further demonstrate RevealDroid's superiority against state-of-the-art approaches. [URL of original paper: https://dl.acm.org/citation.cfm?id=3162625]
In this special discussion session on machine learning, the panel members discuss various issues related to building secure and low power neuromorphic systems. The security of neuromorphic systems may be discussed in term of the reliability of the model, trust in the model, and security of the underlying hardware. The low power aspect of neuromorphic computing systems may be discussed in terms of adaptation of new devices and technologies, the adaptation of new computational models, development of heterogeneous computing frameworks, or dedicated engines for processing neuromorphic models. This session may include discussion on the design space of such supporting hardware, exploring tradeoffs between power/energy, security, scalability, hardware area, performance, and accuracy.
One of the challenges in supplying the communities with wider access to scientific databases is the need for knowledge of database languages like Structured Query Language (SQL). Although the SQL language has been published in many forms, not everybody is able to write SQL queries. Another challenge is that it might not be practical to make the public aware of the structure of databases. There is a need for novice users to query relational databases using their natural language. To solve this problem, many natural language interfaces to structured databases have been developed. The goal is to provide a more intuitive method for generating database queries and delivering responses. Through social media, which makes it possible to interact with a wide section of the population, and with the help of natural language processing, researchers at the Atmospheric Radiation Measurement (ARM) Data Center at Oak Ridge National Laboratory (ORNL) have developed a concept to enable easy search and retrieval of data from several environmental data centers for the scientific community through social media.Using a machine learning framework that maps natural language text to thousands of datasets, instruments, variables, and data streams, the prototype system would allow users to request data through Twitter and receive a link (via tweet) to applicable data results on the project's search catalog tailored to their key words. This automated identification of relevant data from various petascale archives at ORNL could increase convenience, access, and use of the project's data by the broader community. In this paper we discuss how some data-intensive projects at ORNL are using innovative ways to help in data discovery.
Physical Unclonable Functions (PUFs) have been designed for many security applications such as identification, authentication of devices and key generation, especially for lightweight electronics. Traditional approaches to enhancing security, such as hash functions, may be expensive and resource dependent. However, modelling attacks using machine learning (ML) show the vulnerability of most PUFs. In this paper, a combination of a 32-bit current mirror and 16-bit arbiter PUFs in 65nm CMOS technology is proposed to improve resilience against modelling attacks. Both PUFs are vulnerable to machine learning attacks and we reduce the output prediction rate from 99.2% and 98.8% individually, to 60%.
In this paper, we report our work on using machine learning techniques to predict back bending activity based on field data acquired in a local nursing home. The data are recorded by a privacy-aware compliance tracking system (PACTS). The objective of PACTS is to detect back-bending activities and issue real-time alerts to the participant when she bends her back excessively, which we hope could help the participant form good habits of using proper body mechanics when performing lifting/pulling tasks. We show that our algorithms can differentiate nursing staffs baseline and high-level bending activities by using human skeleton data without any expert rules.
Critical Infrastructures (CIs) use Supervisory Control And Data Acquisition (SCADA) systems for remote control and monitoring. Sophisticated security measures are needed to address malicious intrusions, which are steadily increasing in number and variety due to the massive spread of connectivity and standardisation of open SCADA protocols. Traditional Intrusion Detection Systems (IDSs) cannot detect attacks that are not already present in their databases. Therefore, in this paper, we assess Machine Learning (ML) for intrusion detection in SCADA systems using a real data set collected from a gas pipeline system and provided by the Mississippi State University (MSU). The contribution of this paper is two-fold: 1) The evaluation of four techniques for missing data estimation and two techniques for data normalization, 2) The performances of Support Vector Machine (SVM), and Random Forest (RF) are assessed in terms of accuracy, precision, recall and F1score for intrusion detection. Two cases are differentiated: binary and categorical classifications. Our experiments reveal that RF detect intrusions effectively, with an F1score of respectively \textbackslashtextgreater 99%.
Software vulnerabilities are a primary concern in the IT security industry, as malicious hackers who discover these vulnerabilities can often exploit them for nefarious purposes. However, complex programs, particularly those written in a relatively low-level language like C, are difficult to fully scan for bugs, even when both manual and automated techniques are used. Since analyzing code and making sure it is securely written is proven to be a non-trivial task, both static analysis and dynamic analysis techniques have been heavily investigated, and this work focuses on the former. The contribution of this paper is a demonstration of how it is possible to catch a large percentage of bugs by extracting text features from functions in C source code and analyzing them with a machine learning classifier. Relatively simple features (character count, character diversity, entropy, maximum nesting depth, arrow count, "if" count, "if" complexity, "while" count, and "for" count) were extracted from these functions, and so were complex features (character n-grams, word n-grams, and suffix trees). The simple features performed unexpectedly better compared to the complex features (74% accuracy compared to 69% accuracy).
The field of robotics has matured using artificial intelligence and machine learning such that intelligent robots are being developed in the form of autonomous vehicles. The anticipated widespread use of intelligent robots and their potential to do harm has raised interest in their security. This research evaluates a cyberattack on the machine learning policy of an autonomous vehicle by designing and attacking a robotic vehicle operating in a dynamic environment. The primary contribution of this research is an initial assessment of effective manipulation through an indirect attack on a robotic vehicle using the Q learning algorithm for real-time routing control. Secondly, the research highlights the effectiveness of this attack along with relevant artifact issues.
The increasing publication of large amounts of data, theoretically anonymous, can lead to a number of attacks on the privacy of people. The publication of sensitive data without exposing the data owners is generally not part of the software developers concerns. The regulations for the data privacy-preserving create an appropriate scenario to focus on privacy from the perspective of the use or data exploration that takes place in an organization. The increasing number of sanctions for privacy violations motivates the systematic comparison of three known machine learning algorithms in order to measure the usefulness of the data privacy preserving. The scope of the evaluation is extended by comparing them with a known privacy preservation metric. Different parameter scenarios and privacy levels are used. The use of publicly available implementations, the presentation of the methodology, explanation of the experiments and the analysis allow providing a framework of work on the problem of the preservation of privacy. Problems are shown in the measurement of the usefulness of the data and its relationship with the privacy preserving. The findings motivate the need to create optimized metrics on the privacy preferences of the owners of the data since the risks of predicting sensitive attributes by means of machine learning techniques are not usually eliminated. In addition, it is shown that there may be a hundred percent, but it cannot be measured. As well as ensuring adequate performance of machine learning models that are of interest to the organization that data publisher.
Computational Intelligence (CI) algorithms/techniques are packaged in a variety of disparate frameworks/applications that all vary with respect to specific supported functionality and implementation decisions that drastically change performance. Developers looking to employ different CI techniques are faced with a series of trade-offs in selecting the appropriate library/framework. These include resource consumption, features, portability, interface complexity, ease of parallelization, etc. Considerations such as language compatibility and familiarity with a particular library make the choice of libraries even more difficult. The paper introduces MeetCI, an open source software framework for computational intelligence software design automation that facilitates the application design decisions and their software implementation process. MeetCI abstracts away specific framework details of CI techniques designed within a variety of libraries. This allows CI users to benefit from a variety of current frameworks without investigating the nuances of each library/framework. Using an XML file, developed in accordance with the specifications, the user can design a CI application generically, and utilize various CI software without having to redesign their entire technology stack. Switching between libraries in MeetCI is trivial and accessing the right library to satisfy a user's goals can be done easily and effectively. The paper discusses the framework's use in design of various applications. The design process is illustrated with four different examples from expert systems and machine learning domains, including the development of an expert system for security evaluation, two classification problems and a prediction problem with recurrent neural networks.
Traditional anti-virus technologies have failed to keep pace with proliferation of malware due to slow process of their signatures and heuristics updates. Similarly, there are limitations of time and resources in order to perform manual analysis on each malware. There is a need to learn from this vast quantity of data, containing cyber attack pattern, in an automated manner to proactively adapt to ever-evolving threats. Machine learning offers unique advantages to learn from past cyber attacks to handle future cyber threats. The purpose of this research is to propose a framework for multi-classification of malware into well-known categories by applying different machine learning models over corpus of malware analysis reports. These reports are generated through an open source malware sandbox in an automated manner. We applied extensive pre-modeling techniques for data cleaning, features exploration and features engineering to prepare training and test datasets. Best possible hyper-parameters are selected to build machine learning models. These prepared datasets are then used to train the machine learning classifiers and to compare their prediction accuracy. Finally, these results are validated through a comprehensive 10-fold cross-validation methodology. The best results are achieved through Gaussian Naive Bayes classifier with random accuracy of 96% and 10-Fold Cross Validation accuracy of 91.2%. The said framework can be deployed in an operational environment to learn from malware attacks for proactively adapting matching counter measures.
With the exponential hike in cyber threats, organizations are now striving for better data mining techniques in order to analyze security logs received from their IT infrastructures to ensure effective and automated cyber threat detection. Machine Learning (ML) based analytics for security machine data is the next emerging trend in cyber security, aimed at mining security data to uncover advanced targeted cyber threats actors and minimizing the operational overheads of maintaining static correlation rules. However, selection of optimal machine learning algorithm for security log analytics still remains an impeding factor against the success of data science in cyber security due to the risk of large number of false-positive detections, especially in the case of large-scale or global Security Operations Center (SOC) environments. This fact brings a dire need for an efficient machine learning based cyber threat detection model, capable of minimizing the false detection rates. In this paper, we are proposing optimal machine learning algorithms with their implementation framework based on analytical and empirical evaluations of gathered results, while using various prediction, classification and forecasting algorithms.
With the rapid development of the information industry, the applications of Internet of things, cloud computing and artificial intelligence have greatly affected people's life, and the network equipment has increased with a blowout type. At the same time, more complex network environment has also led to a more serious network security problem. The traditional security solution becomes inefficient in the new situation. Therefore, it is an important task for the security industry to seek technical progress and improve the protection detection and protection ability of the security industry. Botnets have been one of the most important issues in many network security problems, especially in the last one or two years, and China has become one of the most endangered countries by botnets, thus the huge impact of botnets in the world has caused its detection problems to reset people's attention. This paper, based on the topic of botnet detection, focuses on the latest research achievements of botnet detection based on machine learning technology. Firstly, it expounds the application process of machine learning technology in the research of network space security, introduces the structure characteristics of botnet, and then introduces the machine learning in botnet detection. The security features of these solutions and the commonly used machine learning algorithms are emphatically analyzed and summarized. Finally, it summarizes the existing problems in the existing solutions, and the future development direction and challenges of machine learning technology in the research of network space security.
Phishing e-mails are considered as spam e-mails, which aim to collect sensitive personal information about the users via network. Since the main purpose of this behavior is mostly to harm users financially, it is vital to detect these phishing or spam e-mails immediately to prevent unauthorized access to users' vital information. To detect phishing e-mails, using a quicker and robust classification method is important. Considering the billions of e-mails on the Internet, this classification process is supposed to be done in a limited time to analyze the results. In this work, we present some of the early results on the classification of spam email using deep learning and machine methods. We utilize word2vec to represent emails instead of using the popular keyword or other rule-based methods. Vector representations are then fed into a neural network to create a learning model. We have tested our method on an open dataset and found over 96% accuracy levels with the deep learning classification methods in comparison to the standard machine learning algorithms.