Biblio
The enormous growth of Internet-based traffic exposes corporate networks with a wide variety of vulnerabilities. Intrusive traffics are affecting the normal functionality of network's operation by consuming corporate resources and time. Efficient ways of identifying, protecting, and mitigating from intrusive incidents enhance productivity. As Intrusion Detection System (IDS) is hosted in the network and at the user machine level to oversee the malicious traffic in the network and at the individual computer, it is one of the critical components of a network and host security. Unsupervised anomaly traffic detection techniques are improving over time. This research aims to find an efficient classifier that detects anomaly traffic from NSL-KDD dataset with high accuracy level and minimal error rate by experimenting with five machine learning techniques. Five binary classifiers: Stochastic Gradient Decent, Random Forests, Logistic Regression, Support Vector Machine, and Sequential Model are tested and validated to produce the result. The outcome demonstrates that Random Forest Classifier outperforms the other four classifiers with and without applying the normalization process to the dataset.
Since the term “Fog Computing” has been coined by Cisco Systems in 2012, security and privacy issues of this promising paradigm are still open challenges. Among various security challenges, Access Control is a crucial concern for all cloud computing-like systems (e.g. Fog computing, Mobile edge computing) in the IoT era. Therefore, assigning the precise level of access in such an inherently scalable, heterogeneous and dynamic environment is not easy to perform. This work defines the uncertainty challenge for authentication phase of the access control in fog computing because on one hand fog has a number of characteristics that amplify uncertainty in authentication and on the other hand applying traditional access control models does not result in a flexible and resilient solution. Therefore, we have proposed a novel prediction model based on the extension of Attribute Based Access Control (ABAC) model. Our data-driven model is able to handle uncertainty in authentication. It is also able to consider the mobility of mobile edge devices in order to handle authentication. In doing so, we have built our model using and comparing four supervised classification algorithms namely as Decision Tree, Naïve Bayes, Logistic Regression and Support Vector Machine. Our model can achieve authentication performance with 88.14% accuracy using Logistic Regression.
Code churn has been successfully used to identify defect inducing changes in software development. Our recent analysis of the cross-release code churn showed that several design metrics exhibit moderate correlation with the number of defects in complex systems. The goal of this paper is to explore whether cross-release code churn can be used to identify critical design change and contribute to prediction of defects for software in evolution. In our case study, we used two types of data from consecutive releases of open-source projects, with and without cross-release code churn, to build standard prediction models. The prediction models were trained on earlier releases and tested on the following ones, evaluating the performance in terms of AUC, GM and effort aware measure Pop. The comparison of their performance was used to answer our research question. The obtained results showed that the prediction model performs better when cross-release code churn is included. Practical implication of this research is to use cross-release code churn to aid in safe planning of next release in software development.
Anomaly detection on security logs is receiving more and more attention. Authentication events are an important component of security logs, and being able to produce trustful and accurate predictions minimizes the effort of cyber-experts to stop false attacks. Observed events are classified into Normal, for legitimate user behavior, and Malicious, for malevolent actions. These classes are consistently excessively imbalanced which makes the classification problem harder; in the commonly used Los Alamos dataset, the malicious class comprises only 0.00033% of the total. This work proposes a novel method to extract advanced composite features, and a supervised learning technique for classifying authentication logs trustfully; the models are Random Forest, LogitBoost, Logistic Regression, and ultimately Majority Voting which leverages the predictions of the previous models and gives the final prediction for each authentication event. We measure the performance of our experiments by using the False Negative Rate and False Positive Rate. In overall we achieve 0 False Negative Rate (i.e. no attack was missed), and on average a False Positive Rate of 0.0019.
An important source of cyber-attacks is malware, which proliferates in different forms such as botnets. The botnet malware typically looks for vulnerable devices across the Internet, rather than targeting specific individuals, companies or industries. It attempts to infect as many connected devices as possible, using their resources for automated tasks that may cause significant economic and social harm while being hidden to the user and device. Thus, it becomes very difficult to detect such activity. A considerable amount of research has been conducted to detect and prevent botnet infestation. In this paper, we attempt to create a foundation for an anomaly-based intrusion detection system using a statistical learning method to improve network security and reduce human involvement in botnet detection. We focus on identifying the best features to detect botnet activity within network traffic using a lightweight logistic regression model. The network traffic is processed by Bro, a popular network monitoring framework which provides aggregate statistics about the packets exchanged between a source and destination over a certain time interval. These statistics serve as features to a logistic regression model responsible for classifying malicious and benign traffic. Our model is easy to implement and simple to interpret. We characterized and modeled 8 different botnet families separately and as a mixed dataset. Finally, we measured the performance of our model on multiple parameters using F1 score, accuracy and Area Under Curve (AUC).
The monitoring circuit is widely applied in radiation environment and it is of significance to study the circuit reliability with the radiation effects. In this paper, an intelligent analysis method based on Deep Belief Network (DBN) and Support Vector Method is proposed according to the radiation experiments analysis of the monitoring circuit. The Total Ionizing Dose (TID) of the monitoring circuit is used to identify the circuit degradation trend. Firstly, the output waveforms of the monitoring circuit are obtained by radiating with the different TID. Subsequently, the Deep Belief Network Model is trained to extract the features of the circuit signal. Finally, the Support Vector Machine (SVM) and Support Vector Regression (SVR) are applied to classify and predict the remaining useful life (RUL) of the monitoring circuit. According to the experimental results, the performance of DBN-SVM exceeds DBN method for feature extraction and classification, and SVR is effective for predicting the degradation.
At a time when all it takes to open a Twitter account is a mobile phone, the act of authenticating information encountered on social media becomes very complex, especially when we lack measures to verify digital identities in the first place. Because the platform supports anonymity, fake news generated by dubious sources have been observed to travel much faster and farther than real news. Hence, we need valid measures to identify authors of misinformation to avert these consequences. Researchers propose different authorship attribution techniques to approach this kind of problem. However, because tweets are made up of only 280 characters, finding a suitable authorship attribution technique is a challenge. This research aims to classify authors of tweets by comparing machine learning methods like logistic regression and naive Bayes. The processes of this application are fetching of tweets, pre-processing, feature extraction, and developing a machine learning model for classification. This paper illustrates the text classification for authorship process using machine learning techniques. In total, there were 46,895 tweets used as both training and testing data, and unique features specific to Twitter were extracted. Several steps were done in the pre-processing phase, including removal of short texts, removal of stop-words and punctuations, tokenizing and stemming of texts as well. This approach transforms the pre-processed data into a set of feature vector in Python. Logistic regression and naive Bayes algorithms were applied to the set of feature vectors for the training and testing of the classifier. The logistic regression based classifier gave the highest accuracy of 91.1% compared to the naive Bayes classifier with 89.8%.
This study seeks to investigate how the development of e-government services impacts on cybersecurity. The study uses the methods of correlation and multiple regression to analyse two sets of global data, the e-government development index of the 2015 United Nations e-government survey and the 2015 International Telecommunication Union global cybersecurity development index (GCI 2015). After analysing the various contextual factors affecting e-government development, the study found that, various composite measures of e-government development are significantly correlated with cybersecurity development. The therefore study contributes to the understanding of the relationship between e-government and cybersecurity development. The authors developed a model to highlight this relationship and have validated the model using empirical data. This is expected to provide guidance on specific dimensions of e-government services that will stimulate the development of cybersecurity. The study provided the basis for understanding the patterns in cybersecurity development and has implication for policy makers in developing trust and confidence for the adoption e-government services.
Multi-robot transfer learning allows a robot to use data generated by a second, similar robot to improve its own behavior. The potential advantages are reducing the time of training and the unavoidable risks that exist during the training phase. Transfer learning algorithms aim to find an optimal transfer map between different robots. In this paper, we investigate, through a theoretical study of single-input single-output (SISO) systems, the properties of such optimal transfer maps. We first show that the optimal transfer learning map is, in general, a dynamic system. The main contribution of the paper is to provide an algorithm for determining the properties of this optimal dynamic map including its order and regressors (i.e., the variables it depends on). The proposed algorithm does not require detailed knowledge of the robots' dynamics, but relies on basic system properties easily obtainable through simple experimental tests. We validate the proposed algorithm experimentally through an example of transfer learning between two different quadrotor platforms. Experimental results show that an optimal dynamic map, with correct properties obtained from our proposed algorithm, achieves 60-70% reduction of transfer learning error compared to the cases when the data is directly transferred or transferred using an optimal static map.
Accurate short-term traffic flow forecasting is of great significance for real-time traffic control, guidance and management. The k-nearest neighbor (k-NN) model is a classic data-driven method which is relatively effective yet simple to implement for short-term traffic flow forecasting. For conventional prediction mechanism of k-NN model, the k nearest neighbors' outputs weighted by similarities between the current traffic flow vector and historical traffic flow vectors is directly used to generate prediction values, so that the prediction results are always not ideal. It is observed that there are always some outliers in k nearest neighbors' outputs, which may have a bad influences on the prediction value, and the local similarities between current traffic flow and historical traffic flows at the current sampling period should have a greater relevant to the prediction value. In this paper, we focus on improving the prediction mechanism of k-NN model and proposed a k-nearest neighbor locally search regression algorithm (k-LSR). The k-LSR algorithm can use locally search strategy to search for optimal nearest neighbors' outputs and use optimal nearest neighbors' outputs weighted by local similarities to forecast short-term traffic flow so as to improve the prediction mechanism of k-NN model. The proposed algorithm is tested on the actual data and compared with other algorithms in performance. We use the root mean squared error (RMSE) as the evaluation indicator. The comparison results show that the k-LSR algorithm is more successful than the k-NN and k-nearest neighbor locally weighted regression algorithm (k-LWR) in forecasting short-term traffic flow, and which prove the superiority and good practicability of the proposed algorithm.
In order to investigate the relationship and effect on the performance of magnetic modulator among applied DC current, excitation source, excitation loop current, sensitivity and induced voltage of detecting winding, this paper measured initial permeability, maximum permeability, saturation magnetic induction intensity, remanent magnetic induction intensity, coercivity, saturated magnetic field intensity, magnetization curve, permeability curve and hysteresis loop of main core 1J85 permalloy of magnetic modulator based on ballistic method. On this foundation, employ curve fitting tool of MATLAB; adopt multiple regression method to comprehensively compare and analyze the sum of squares due to error (SSE), coefficient of determination (R-square), degree-of-freedom adjusted coefficient of determination (Adjusted R-square), and root mean squared error (RMSE) of fitting results. Finally, establish B-H curve mathematical model based on the sum of arc-hyperbolic sine function and polynomial.
In order to solve the problem of millimeter wave (mm-wave) antenna impedance mismatch in 5G communication system, a optimization algorithm for Particle Swarm Ant Colony Optimization (PSACO) is proposed to optimize antenna patch parameter. It is proved that the proposed method can effectively achieve impedance matching in 28GHz center frequency, and the return loss characteristic is obviously improved. At the same time, the nonlinear regression model is used to solve the nonlinear relationship between the resonant frequency and the patch parameters. The Elman Neural Network (Elman NN) model is used to verify the reliability of PSACO and nonlinear regression model. Patch parameters optimized by PSACO were introduced into the nonlinear relationship, which obtained error within 2%. The method proposed in this paper improved efficiency in antenna design.
Over the years cybercriminals have misused the Domain Name System (DNS) - a critical component of the Internet - to gain profit. Despite this persisting trend, little empirical information about the security of Top-Level Domains (TLDs) and of the overall 'health' of the DNS ecosystem exists. In this paper, we present security metrics for this ecosystem and measure the operational values of such metrics using three representative phishing and malware datasets. We benchmark entire TLDs against the rest of the market. We explicitly distinguish these metrics from the idea of measuring security performance, because the measured values are driven by multiple factors, not just by the performance of the particular market player. We consider two types of security metrics: occurrence of abuse and persistence of abuse. In conjunction, they provide a good understanding of the overall health of a TLD. We demonstrate that attackers abuse a variety of free services with good reputation, affecting not only the reputation of those services, but of entire TLDs. We find that, when normalized by size, old TLDs like .com host more bad content than new generic TLDs. We propose a statistical regression model to analyze how the different properties of TLD intermediaries relate to abuse counts. We find that next to TLD size, abuse is positively associated with domain pricing (i.e. registries who provide free domain registrations witness more abuse). Last but not least, we observe a negative relation between the DNSSEC deployment rate and the count of phishing domains.
Over the years cybercriminals have misused the Domain Name System (DNS) - a critical component of the Internet - to gain profit. Despite this persisting trend, little empirical information about the security of Top-Level Domains (TLDs) and of the overall 'health' of the DNS ecosystem exists. In this paper, we present security metrics for this ecosystem and measure the operational values of such metrics using three representative phishing and malware datasets. We benchmark entire TLDs against the rest of the market. We explicitly distinguish these metrics from the idea of measuring security performance, because the measured values are driven by multiple factors, not just by the performance of the particular market player. We consider two types of security metrics: occurrence of abuse and persistence of abuse. In conjunction, they provide a good understanding of the overall health of a TLD. We demonstrate that attackers abuse a variety of free services with good reputation, affecting not only the reputation of those services, but of entire TLDs. We find that, when normalized by size, old TLDs like .com host more bad content than new generic TLDs. We propose a statistical regression model to analyze how the different properties of TLD intermediaries relate to abuse counts. We find that next to TLD size, abuse is positively associated with domain pricing (i.e. registries who provide free domain registrations witness more abuse). Last but not least, we observe a negative relation between the DNSSEC deployment rate and the count of phishing domains.
By using generalized regression neural network clustering analysis, effective clustering of five kinds of network intrusion behavior modes is carried out. First of all, intrusion data is divided into five categories by making use of fuzzy C means clustering algorithm. Then, the samples that are closet to the center of each class in the clustering results are taken as the clustering training samples of generalized neural network for the data training, and the results output by the training are the individual owned invasion category. The experimental results showed that the new algorithm has higher classification accuracy of network intrusion ways, which can provide more reliable data support for the prevention of the network intrusion.
There are several security requirements identification methods proposed by researchers in up-front requirements engineering (RE). However, in open source software (OSS) projects, developers use lightweight representation and refine requirements frequently by writing comments. They also tend to discuss security aspect in comments by providing code snippets, attachments, and external resource links. Since most security requirements identification methods in up-front RE are based on textual information retrieval techniques, these methods are not suitable for OSS projects or just-in-time RE. In our study, we propose a new model based on logistic regression to identify security requirements in OSS projects. We used five metrics to build security requirements identification models and tested the performance of these metrics by applying those models to three OSS projects. Our results show that four out of five metrics achieved high performance in intra-project testing.
Identity masking methods have been developed in recent years for use in multiple applications aimed at protecting privacy. There is only limited work, however, targeted at evaluating effectiveness of methods-with only a handful of studies testing identity masking effectiveness for human perceivers. Here, we employed human participants to evaluate identity masking algorithms on video data of drivers, which contains subtle movements of the face and head. We evaluated the effectiveness of the “personalized supervised bilinear regression method for Facial Action Transfer (FAT)” de-identification algorithm. We also evaluated an edge-detection filter, as an alternate “fill-in” method when face tracking failed due to abrupt or fast head motions. Our primary goal was to develop methods for humanbased evaluation of the effectiveness of identity masking. To this end, we designed and conducted two experiments to address the effectiveness of masking in preventing recognition and in preserving action perception. 1- How effective is an identity masking algorithm?We conducted a face recognition experiment and employed Signal Detection Theory (SDT) to measure human accuracy and decision bias. The accuracy results show that both masks (FAT mask and edgedetection) are effective, but that neither completely eliminated recognition. However, the decision bias data suggest that both masks altered the participants' response strategy and made them less likely to affirm identity. 2- How effectively does the algorithm preserve actions? We conducted two experiments on facial behavior annotation. Results showed that masking had a negative effect on annotation accuracy for the majority of actions, with differences across action types. Notably, the FAT mask preserved actions better than the edge-detection mask. To our knowledge, this is the first study to evaluate a deidentification method aimed at preserving facial ac- ions employing human evaluators in a laboratory setting.
In this paper we present results of a research on automatic extremist text detection. For this purpose an experimental dataset in the Russian language was created. According to the Russian legislation we cannot make it publicly available. We compared various classification methods (multinomial naive Bayes, logistic regression, linear SVM, random forest, and gradient boosting) and evaluated the contribution of differentiating features (lexical, semantic and psycholinguistic) to classification quality. The results of experiments show that psycholinguistic and semantic features are promising for extremist text detection.
Multi-state logic presents a promising avenue for more-than-Moore scaling, since efficient implementation of multi-valued logic (MVL) can significantly reduce switching and interconnection requirements and result in significant benefits compared to binary CMOS. So far, traditional approaches lag behind binary CMOS due to: (a) reliance on logic decomposition approaches [4][5][6] that result in many multi-valued minterms [4], complex polynomials [5], and decision diagrams [6], which are difficult to implement, and (b) emulation of multi-valued computation and communication through binary switches and medium that require data conversion, and large circuits. In this paper, we propose a fundamentally different approach for MVL decomposition, merging concepts from data science and nanoelectronics to tackle the problems, (a) First, we do linear regression on all inputs and outputs of a multivalued function, and find an expression that fits most input and output combinations. For unmatched combinations, we do successive regressions to find linear expressions. Next, using our novel visual pattern matching technique, we find conditions based on input and output conditions to select each expression. These expressions along with associated selection criteria ensure that for all possible inputs of a specific function, correct output can be reached. Our selection of regression model to find linear expressions, coefficients and conditions allow efficient hardware implementation. We discuss an approach for solving problem (b) and show an example of quaternary sum circuit. Our estimates show 65.6% saving of switching components compared with a 4-bit CMOS adder.
This paper introduces an ensemble model that solves the binary classification problem by incorporating the basic Logistic Regression with the two recent advanced paradigms: extreme gradient boosted decision trees (xgboost) and deep learning. To obtain the best result when integrating sub-models, we introduce a solution to split and select sets of features for the sub-model training. In addition to the ensemble model, we propose a flexible robust and highly scalable new scheme for building a composite classifier that tries to simultaneously implement multiple layers of model decomposition and outputs aggregation to maximally reduce both bias and variance (spread) components of classification errors. We demonstrate the power of our ensemble model to solve the problem of predicting the outcome of Hearthstone, a turn-based computer game, based on game state information. Excellent predictive performance of our model has been acknowledged by the second place scored in the final ranking among 188 competing teams.
This paper is based on the previous research that selects the proper surrogate nodes for fast recovery mechanism in industrial IoT (Internet of Things) Environment which uses a variety of sensors to collect the data and exchange the collected data in real-time for creating added value. We are going to suggest the way that how to decide the number of surrogate node automatically in different deployed industrial IoT Environment so that minimize the system recovery time when the central server likes IoT gateway is in failure. We are going to use the network simulator to measure the recovery time depending on the number of the selected surrogate nodes according to the sub-devices which are connected to the IoT gateway.
This paper proposes a highly scalable framework that can be applied to detect network anomaly at per flow level by constructing a meta-model for a family of machine learning algorithms or statistical data models. The approach is scalable and attainable because raw data needs to be accessed only one time and it will be processed, computed and transformed into a meta-model matrix in a much smaller size that can be resident in the system RAM. The calculation of meta-model matrix can be achieved through disposable updating operations at per row level: once a per-flow information is proceeded, it is no longer needed in calculating the meta-model matrix. While the proposed framework covers both Gaussian and non-Gaussian data, the focus of this work is on the linear regression models. Specifically, a new concept called meta-model sufficient statistics is proposed to analyze a group of models, where exact, not the approximate, results are derived. In addition, the proposed framework can quickly discover an optimal statistical or computer model from a family of candidate models without the need of rescanning the raw dataset. This suggest an extremely efficient and effectively theory and method is possible for big data security analysis.
Novice programmers exhibit a repertoire of affective states over time when they are learning computer programming. The modeling of frustration is important as it informs on the need for pedagogical intervention of the student who may otherwise lose confidence and interest in the learning. In this paper, contextual and keystroke features of the students within a Java tutoring system are used to detect frustration of student within a programming exercise session. As compared to psychological sensors used in other studies, the use of contextual and keystroke logs are less obtrusive and the equipment used (keyboard) is ubiquitous in most learning environment. The technique of logistic regression with lasso regularization is utilized for the modeling to prevent over-fitting. The results showed that a model that uses only contextual and keystroke features achieved a prediction accuracy level of 0.67 and a recall measure of 0.833. Thus, we conclude that it is possible to detect frustration of a student from distilling both the contextual and keystroke logs within the tutoring system with an adequate level of accuracy.
Multichannel sensor systems are widely used in condition monitoring for effective failure prevention of critical equipment or processes. However, loss of sensor readings due to malfunctions of sensors and/or communication has long been a hurdle to reliable operations of such integrated systems. Moreover, asynchronous data sampling and/or limited data transmission are usually seen in multiple sensor channels. To reliably perform fault diagnosis and prognosis in such operating environments, a data recovery method based on functional principal component analysis (FPCA) can be utilized. However, traditional FPCA methods are not robust to outliers and their capabilities are limited in recovering signals with strongly skewed distributions (i.e., lack of symmetry). This paper provides a robust data-recovery method based on functional data analysis to enhance the reliability of multichannel sensor systems. The method not only considers the possibly skewed distribution of each channel of signal trajectories, but is also capable of recovering missing data for both individual and correlated sensor channels with asynchronous data that may be sparse as well. In particular, grand median functions, rather than classical grand mean functions, are utilized for robust smoothing of sensor signals. Furthermore, the relationship between the functional scores of two correlated signals is modeled using multivariate functional regression to enhance the overall data-recovery capability. An experimental flow-control loop that mimics the operation of coolant-flow loop in a multimodular integral pressurized water reactor is used to demonstrate the effectiveness and adaptability of the proposed data-recovery method. The computational results illustrate that the proposed method is robust to outliers and more capable than the existing FPCA-based method in terms of the accuracy in recovering strongly skewed signals. In addition, turbofan engine data are also analyzed to verify the capability of the proposed method in recovering non-skewed signals.