Biblio
Due to the recent technological development, home appliances and electric devices are equipped with high-performance hardware device. Since demand of hardware devices is increased, production base become internationalized to mass-produce hardware devices with low cost and hardware vendors outsource their products to third-party vendors. Accordingly, malicious third-party vendors can easily insert malfunctions (also known as "hardware Trojans'') into their products. In this paper, we design six kinds of hardware Trojans at a gate-level netlist, and apply a neural-network (NN) based hardware-Trojan detection method to them. The designed hardware Trojans are different in trigger circuits. In addition, we insert them to normal circuits, and detect hardware Trojans using a machine-learning-based hardware-Trojan detection method with neural networks. In our experiment, we learned Trojan-infected benchmarks using NN, and performed cross validation to evaluate the learned NN. The experimental results demonstrate that the average TPR (True Positive Rate) becomes 72.9%, the average TNR (True Negative Rate) becomes 90.0%.
Pre-Silicon hardware Trojan detection has been studied for years. The most popular benchmark circuits are from the Trust-Hub. Their common feature is that the probability of activating hardware Trojans is very low. This leads to a series of machine learning based hardware Trojan detection methods which try to find the nets with low signal probability of 0 or 1. On the other hand, it is considered that, if the probability of activating hardware Trojans is high, these hardware Trojans can be easily found through behaviour simulations or during functional test. This paper explores the "grey zone" between these two opposite scenarios: if the activation probability of a hardware Trojan is not low enough for machine learning to detect it and is not high enough for behaviour simulation or functional test to find it, it can escape from detection. Experiments show the existence of such hardware Trojans, and this paper suggests a new set of hardware Trojan benchmark circuits for future study.
With the frequent use of Wi-Fi and hotspots that provide a wireless Internet environment, awareness and threats to wireless AP (Access Point) security are steadily increasing. Especially when using unauthorized APs in company, government and military facilities, there is a high possibility of being subjected to various viruses and hacking attacks. It is necessary to detect unauthorized Aps for protection of information. In this paper, we use RTT (Round Trip Time) value data set to detect authorized and unauthorized APs in wired / wireless integrated environment, analyze them using machine learning algorithms including SVM (Support Vector Machine), C4.5, KNN (K Nearest Neighbors) and MLP (Multilayer Perceptron). Overall, KNN shows the highest accuracy.
In practice, Defenders need a more efficient network detection approach which has the advantages of quick-responding learning capability of new network behavioural features for network intrusion detection purpose. In many applications the capability of Deep Learning techniques has been confirmed to outperform classic approaches. Accordingly, this study focused on network intrusion detection using convolutional neural networks (CNNs) based on LeNet-5 to classify the network threats. The experiment results show that the prediction accuracy of intrusion detection goes up to 99.65% with samples more than 10,000. The overall accuracy rate is 97.53%.
Tactics Techniques and Procedures (TTPs) in cyber domain is an important threat information that describes the behavior and attack patterns of an adversary. Timely identification of associations between TTPs can lead to effective strategy for diagnosing the Cyber Threat Actors (CTAs) and their attack vectors. This study profiles the prevalence and regularities in the TTPs of CTAs. We developed a machine learning-based framework that takes as input Cyber Threat Intelligence (CTI) documents, selects the most prevalent TTPs with high information gain as features and based on them mine interesting regularities between TTPs using Association Rule Mining (ARM). We evaluated the proposed framework with publicly available TTPbased CTI documents. The results show that there are 28 TTPs more prevalent than the other TTPs. Our system identified 155 interesting association rules among the TTPs of CTAs. A summary of these rules is given to effectively investigate threats in the network.
Traditional security controls, such as firewalls, anti-virus and IDS, are ill-equipped to help IT security and response teams keep pace with the rapid evolution of the cyber threat landscape. Cyber Threat Intelligence (CTI) can help remediate this problem by exploiting non-traditional information sources, such as hacker forums and "dark-web" social platforms. Security and response teams can use the collected intelligence to identify emerging threats. Unfortunately, when manual analysis is used to extract CTI from non-traditional sources, it is a time consuming, error-prone and resource intensive process. We address these issues by using a hybrid Machine Learning model that automatically searches through hacker forum posts, identifies the posts that are most relevant to cyber security and then clusters the relevant posts into estimations of the topics that the hackers are discussing. The first (identification) stage uses Support Vector Machines and the second (clustering) stage uses Latent Dirichlet Allocation. We tested our model, using data from an actual hacker forum, to automatically extract information about various threats such as leaked credentials, malicious proxy servers, malware that evades AV detection, etc. The results demonstrate our method is an effective means for quickly extracting relevant and actionable intelligence that can be integrated with traditional security controls to increase their effectiveness.
The recently developed deep belief network (DBN) has been shown to be an effective methodology for solving time series forecasting problems. However, the performance of DBN is seriously depended on the reasonable setting of hyperparameters. At present, random search, grid search and Bayesian optimization are the most common methods of hyperparameters optimization. As an alternative, a state-of-the-art derivative-free optimizer-negative correlation search (NCS) is adopted in this paper to decide the sizes of DBN and learning rates during the training processes. A comparative analysis is performed between the proposed method and other popular techniques in the time series forecasting experiment based on two types of time series datasets. Experiment results statistically affirm the efficiency of the proposed model to obtain better prediction results compared with conventional neural network models.
Deep neural networks (DNNs) are effective machine learning models to solve a large class of recognition problems, including the classification of nonlinearly separable patterns. The applications of DNNs are, however, limited by the large size and high energy consumption of the networks. Recently, stochastic computation (SC) has been considered to implement DNNs to reduce the hardware cost. However, it requires a large number of random number generators (RNGs) that lower the energy efficiency of the network. To overcome these limitations, we propose the design of an energy-efficient deep belief network (DBN) based on stochastic computation. An approximate SC activation unit (A-SCAU) is designed to implement different types of activation functions in the neurons. The A-SCAU is immune to signal correlations, so the RNGs can be shared among all neurons in the same layer with no accuracy loss. The area and energy of the proposed design are 5.27% and 3.31% (or 26.55% and 29.89%) of a 32-bit floating-point (or an 8-bit fixed-point) implementation. It is shown that the proposed SC-DBN design achieves a higher classification accuracy compared to the fixed-point implementation. The accuracy is only lower by 0.12% than the floating-point design at a similar computation speed, but with a significantly lower energy consumption.
Cybersecurity plays a critical role in protecting sensitive information and the structural integrity of networked systems. As networked systems continue to expand in numbers as well as in complexity, so does the threat of malicious activity and the necessity for advanced cybersecurity solutions. Furthermore, both the quantity and quality of available data on malicious content as well as the fact that malicious activity continuously evolves makes automated protection systems for this type of environment particularly challenging. Not only is the data quality a concern, but the volume of the data can be quite small for some of the classes. This creates a class imbalance in the data used to train a classifier; however, many classifiers are not well equipped to deal with class imbalance. One such example is detecting malicious HMTL files from static features. Unfortunately, collecting malicious HMTL files is extremely difficult and can be quite noisy from HTML files being mislabeled. This paper evaluates a specific application that is afflicted by these modern cybersecurity challenges: detection of malicious HTML files. Previous work presented a general framework for malicious HTML file classification that we modify in this work to use a $\chi$2 feature selection technique and synthetic minority oversampling technique (SMOTE). We experiment with different classifiers (i.e., AdaBoost, Gentle-Boost, RobustBoost, RusBoost, and Random Forest) and a pure detection model (i.e., Isolation Forest). We benchmark the different classifiers using SMOTE on a real dataset that contains a limited number of malicious files (40) with respect to the normal files (7,263). It was found that the modified framework performed better than the previous framework's results. However, additional evidence was found to imply that algorithms which train on both the normal and malicious samples are likely overtraining to the malicious distribution. We demonstrate the likely overtraining by determining that a subset of the malicious files, while suspicious, did not come from a malicious source.
At a time when all it takes to open a Twitter account is a mobile phone, the act of authenticating information encountered on social media becomes very complex, especially when we lack measures to verify digital identities in the first place. Because the platform supports anonymity, fake news generated by dubious sources have been observed to travel much faster and farther than real news. Hence, we need valid measures to identify authors of misinformation to avert these consequences. Researchers propose different authorship attribution techniques to approach this kind of problem. However, because tweets are made up of only 280 characters, finding a suitable authorship attribution technique is a challenge. This research aims to classify authors of tweets by comparing machine learning methods like logistic regression and naive Bayes. The processes of this application are fetching of tweets, pre-processing, feature extraction, and developing a machine learning model for classification. This paper illustrates the text classification for authorship process using machine learning techniques. In total, there were 46,895 tweets used as both training and testing data, and unique features specific to Twitter were extracted. Several steps were done in the pre-processing phase, including removal of short texts, removal of stop-words and punctuations, tokenizing and stemming of texts as well. This approach transforms the pre-processed data into a set of feature vector in Python. Logistic regression and naive Bayes algorithms were applied to the set of feature vectors for the training and testing of the classifier. The logistic regression based classifier gave the highest accuracy of 91.1% compared to the naive Bayes classifier with 89.8%.
E-mail communication is one of today's indispensable communication ways. The widespread use of email has brought about some problems. The most important one of these problems are spam (unwanted) e-mails, often composed of advertisements or offensive content, sent without the recipient's request. In this study, it is aimed to analyze the content information of e-mails written in Turkish with the help of Naive Bayes Classifier and Vector Space Model from machine learning methods, to determine whether these e-mails are spam e-mails and classify them. Both methods are subjected to different evaluation criteria and their performances are compared.
Spam emails have been a chronic issue in computer security. They are very costly economically and extremely dangerous for computers and networks. Despite of the emergence of social networks and other Internet based information exchange venues, dependence on email communication has increased over the years and this dependence has resulted in an urgent need to improve spam filters. Although many spam filters have been created to help prevent these spam emails from entering a user's inbox, there is a lack or research focusing on text modifications. Currently, Naive Bayes is one of the most popular methods of spam classification because of its simplicity and efficiency. Naive Bayes is also very accurate; however, it is unable to correctly classify emails when they contain leetspeak or diacritics. Thus, in this proposes, we implemented a novel algorithm for enhancing the accuracy of the Naive Bayes Spam Filter so that it can detect text modifications and correctly classify the email as spam or ham. Our Python algorithm combines semantic based, keyword based, and machine learning algorithms to increase the accuracy of Naive Bayes compared to Spamassassin by over two hundred percent. Additionally, we have discovered a relationship between the length of the email and the spam score, indicating that Bayesian Poisoning, a controversial topic, is actually a real phenomenon and utilized by spammers.
Short Message Service is now-days the most used way of communication in the electronic world. While many researches exist on the email spam detection, we haven't had the insight knowledge about the spam done within the SMS's. This might be because the frequency of spam in these short messages is quite low than the emails. This paper presents different ways of analyzing spam for SMS and a new pre-processing way to get the actual dataset of spam messages. This dataset was then used on different algorithm techniques to find the best working algorithm in terms of both accuracy and recall. Random Forest algorithm was then implemented in a real world application library written in C\# for cross platform .Net development. This library is capable of using a prebuild model for classifying a new dataset for spam and ham.
The security of image steganography is an important basis for evaluating steganography algorithms. Steganography has recently made great progress in the long-term confrontation with steganalysis. To improve the security of image steganography, steganography must have the ability to resist detection by steganalysis algorithms. Traditional embedding-based steganography embeds the secret information into the content of an image, which unavoidably leaves a trace of the modification that can be detected by increasingly advanced machine-learning-based steganalysis algorithms. The concept of steganography without embedding (SWE), which does not need to modify the data of the carrier image, appeared to overcome the detection of machine-learning-based steganalysis algorithms. In this paper, we propose a novel image SWE method based on deep convolutional generative adversarial networks. We map the secret information into a noise vector and use the trained generator neural network model to generate the carrier image based on the noise vector. No modification or embedding operations are required during the process of image generation, and the information contained in the image can be extracted successfully by another neural network, called the extractor, after training. The experimental results show that this method has the advantages of highly accurate information extraction and a strong ability to resist detection by state-of-the-art image steganalysis algorithms.
This paper presents an assessment of continuous verification using linguistic style as a cognitive biometric. In stylometry, it is widely known that linguistic style is highly characteristic of authorship using representations that capture authorial style at character, lexical, syntactic, and semantic levels. In this work, we provide a contrast to previous efforts by implementing a one-class classification problem using Isolation Forests. Our approach demonstrates the usefulness of this classifier for accurately verifying the genuine user, and yields recognition accuracy exceeding 98% using very small training samples of 50 and 100-character blocks.
It is well known that distributed cyber attacks simultaneously launched from many hosts have caused the most serious problems in recent years including problems of privacy leakage and denial of services. Thus, how to detect those attacks at early stage has become an important and urgent topic in the cyber security community. For this purpose, recognizing C&C (Command & Control) communication between compromised bots and the C&C server becomes a crucially important issue, because C&C communication is in the preparation phase of distributed attacks. Although attack detection based on signature has been practically applied since long ago, it is well-known that it cannot efficiently deal with new kinds of attacks. In recent years, ML(Machine learning)-based detection methods have been studied widely. In those methods, feature selection is obviously very important to the detection performance. We once utilized up to 55 features to pick out C&C traffic in order to accomplish early detection of DDoS attacks. In this work, we try to answer the question that "Are all of those features really necessary?" We mainly investigate how the detection performance moves as the features are removed from those having lowest importance and we try to make it clear that what features should be payed attention for early detection of distributed attacks. We use honeypot data collected during the period from 2008 to 2013. SVM(Support Vector Machine) and PCA(Principal Component Analysis) are utilized for feature selection and SVM and RF(Random Forest) are for building the classifier. We find that the detection performance is generally getting better if more features are utilized. However, after the number of features has reached around 40, the detection performance will not change much even more features are used. It is also verified that, in some specific cases, more features do not always means a better detection performance. We also discuss 10 important features which have the biggest influence on classification.
Two-factor authentication (2FA) popularly works by verifying something the user knows (a password) and something she possesses (a token, popularly instantiated with a smart phone). Conventional 2FA systems require extra interaction like typing a verification code, which is not very user-friendly. For improved user experience, recent work aims at zero-effort 2FA, in which a smart phone placed close to a computer (where the user enters her username/password into a browser to log into a server) automatically assists with the authentication. To prove her possession of the smart phone, the user needs to prove the phone is on the login spot, which reduces zero-effort 2FA to co-presence detection. In this paper, we propose SoundAuth, a secure zero-effort 2FA mechanism based on (two kinds of) ambient audio signals. SoundAuth looks for signs of proximity by having the browser and the smart phone compare both their surrounding sounds and certain unpredictable near-ultrasounds; if significant distinguishability is found, SoundAuth rejects the login request. For the ambient signals comparison, we regard it as a classification problem and employ a machine learning technique to analyze the audio signals. Experiments with real login attempts show that SoundAuth not only is comparable to existent schemes concerning utility, but also outperforms them in terms of resilience to attacks. SoundAuth can be easily deployed as it is readily supported by most smart phones and major browsers.
Attack graph approach is a common tool for the analysis of network security. However, analysis of attack graphs could be complicated and difficult depending on the attack graph size. This paper presents an approximate analysis approach for attack graphs based on Q-learning. First, we employ multi-host multi-stage vulnerability analysis (MulVAL) to generate an attack graph for a given network topology. Then we refine the attack graph and generate a simplified graph called a transition graph. Next, we use a Q-learning model to find possible attack routes that an attacker could use to compromise the security of the network. Finally, we evaluate the approach by applying it to a typical IT network scenario with specific services, network configurations, and vulnerabilities.
Feature extraction and feature selection are the first tasks in pre-processing of input logs in order to detect cybersecurity threats and attacks by utilizing data mining techniques in the field of Artificial Intelligence. When it comes to the analysis of heterogeneous data derived from different sources, these tasks are found to be time-consuming and difficult to be managed efficiently. In this paper, we present an approach for handling feature extraction and feature selection utilizing machine learning algorithms for security analytics of heterogeneous data derived from different network sensors. The approach is implemented in Apache Spark, using its python API, named pyspark.