Visible to the public Biblio

Filters: Keyword is unsupervised learning  [Clear All Filters]
2020-05-22
Yan, Donghui, Wang, Yingjie, Wang, Jin, Wang, Honggang, Li, Zhenpeng.  2018.  K-nearest Neighbor Search by Random Projection Forests. 2018 IEEE International Conference on Big Data (Big Data). :4775—4781.
K-nearest neighbor (kNN) search has wide applications in many areas, including data mining, machine learning, statistics and many applied domains. Inspired by the success of ensemble methods and the flexibility of tree-based methodology, we propose random projection forests, rpForests, for kNN search. rpForests finds kNNs by aggregating results from an ensemble of random projection trees with each constructed recursively through a series of carefully chosen random projections. rpForests achieves a remarkable accuracy in terms of fast decay in the missing rate of kNNs and that of discrepancy in the kNN distances. rpForests has a very low computational complexity. The ensemble nature of rpForests makes it easily run in parallel on multicore or clustered computers; the running time is expected to be nearly inversely proportional to the number of cores or machines. We give theoretical insights by showing the exponential decay of the probability that neighboring points would be separated by ensemble random projection trees when the ensemble size increases. Our theory can be used to refine the choice of random projections in the growth of trees, and experiments show that the effect is remarkable.
2020-05-18
Sel, Slhami, Hanbay, Davut.  2019.  E-Mail Classification Using Natural Language Processing. 2019 27th Signal Processing and Communications Applications Conference (SIU). :1–4.
Thanks to the rapid increase in technology and electronic communications, e-mail has become a serious communication tool. In many applications such as business correspondence, reminders, academic notices, web page memberships, e-mail is used as primary way of communication. If we ignore spam e-mails, there remain hundreds of e-mails received every day. In order to determine the importance of received e-mails, the subject or content of each e-mail must be checked. In this study we proposed an unsupervised system to classify received e-mails. Received e-mails' coordinates are determined by a method of natural language processing called as Word2Vec algorithm. According to the similarities, processed data are grouped by k-means algorithm with an unsupervised training model. In this study, 10517 e-mails were used in training. The success of the system is tested on a test group of 200 e-mails. In the test phase M3 model (window size 3, min. Word frequency 10, Gram skip) consolidated the highest success (91%). Obtained results are evaluated in section VI.
2020-05-11
Mirza, Ali H., Cosan, Selin.  2018.  Computer network intrusion detection using sequential LSTM Neural Networks autoencoders. 2018 26th Signal Processing and Communications Applications Conference (SIU). :1–4.
In this paper, we introduce a sequential autoencoder framework using long short term memory (LSTM) neural network for computer network intrusion detection. We exploit the dimensionality reduction and feature extraction property of the autoencoder framework to efficiently carry out the reconstruction process. Furthermore, we use the LSTM networks to handle the sequential nature of the computer network data. We assign a threshold value based on cross-validation in order to classify whether the incoming network data sequence is anomalous or not. Moreover, the proposed framework can work on both fixed and variable length data sequence and works efficiently for unforeseen and unpredictable network attacks. We then also use the unsupervised version of the LSTM, GRU, Bi-LSTM and Neural Networks. Through a comprehensive set of experiments, we demonstrate that our proposed sequential intrusion detection framework performs well and is dynamic, robust and scalable.
2020-01-27
Farag, Nadine, El-Seoud, Samir Abou, McKee, Gerard, Hassan, Ghada.  2019.  Bullying Hurts: A Survey on Non-Supervised Techniques for Cyber-Bullying Detection. Proceedings of the 2019 8th International Conference on Software and Information Engineering. :85–90.
The contemporary period is scarred by the predominant place of social media in everyday life. Despite social media being a useful tool for communication and social gathering it also offers opportunities for harmful criminal activities. One of these activities is cyber-bullying enabled through the abuse and mistreatment of the internet as a means of bullying others virtually. As a way of minimising this occurrence, research into computer-based researched is carried out to detect cyber-bullying by the scientific research community. An extensive literature search shows that supervised learning techniques are the most commonly used methods for cyber-bullying detection. However, some non-supervised techniques and other approaches have proven to be effective towards cyber-bullying detection. This paper, therefore, surveys recent research on non-supervised techniques and offers some suggestions for future research in textual-based cyber-bullying detection including detecting roles, detecting emotional state, automated annotation and stylometric methods.
Hsu, Hsiao-Tzu, Jong, Gwo-Jia, Chen, Jhih-Hao, Jhe, Ciou-Guo.  2019.  Improve Iot Security System Of Smart-Home By Using Support Vector Machine. 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS). :674–677.
The traditional smart-home is designed to integrate the concept of the Internet of Things(IoT) into our home environment, and to improve the comfort of home. It connects electrical products and household goods to the network, and then monitors and controls them. However, this paper takes home safety as the main axis of research. It combines the past concept of smart-home and technology of machine learning to improve the whole system of smart-home. Through systematic self-learning, it automatically figure out whether it is normal or abnormal, and reports to remind building occupants safety. At the same time, it saves the cost of human resources preservation. This paper make a set of rules table as the basic criteria first, and then classify a part of data which collected by traditional Internet of Things of smart-home by manual way, which includes the opening and closing of doors and windows, the starting and stopping of motors, the connection and interruption of the system, and the time of sending each data to label, then use Support Vector Machine(SVM) algorithm to classify and build models, and then train it. The executed model is applied to our smart-home system. Finally, we verify the Accuracy of anomaly reporting in our system.
2020-01-21
Aldairi, Maryam, Karimi, Leila, Joshi, James.  2019.  A Trust Aware Unsupervised Learning Approach for Insider Threat Detection. 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI). :89–98.

With the rapidly increasing connectivity in cyberspace, Insider Threat is becoming a huge concern. Insider threat detection from system logs poses a tremendous challenge for human analysts. Analyzing log files of an organization is a key component of an insider threat detection and mitigation program. Emerging machine learning approaches show tremendous potential for performing complex and challenging data analysis tasks that would benefit the next generation of insider threat detection systems. However, with huge sets of heterogeneous data to analyze, applying machine learning techniques effectively and efficiently to such a complex problem is not straightforward. In this paper, we extract a concise set of features from the system logs while trying to prevent loss of meaningful information and providing accurate and actionable intelligence. We investigate two unsupervised anomaly detection algorithms for insider threat detection and draw a comparison between different structures of the system logs including daily dataset and periodically aggregated one. We use the generated anomaly score from the previous cycle as the trust score of each user fed to the next period's model and show its importance and impact in detecting insiders. Furthermore, we consider the psychometric score of users in our model and check its effectiveness in predicting insiders. As far as we know, our model is the first one to take the psychometric score of users into consideration for insider threat detection. Finally, we evaluate our proposed approach on CERT insider threat dataset (v4.2) and show how it outperforms previous approaches.

2019-08-12
Wang, Bingning, Liu, Kang, Zhao, Jun.  2018.  Deep Semantic Hashing with Multi-Adversarial Training. Proceedings of the 27th ACM International Conference on Information and Knowledge Management. :1453–1462.
With the amount of data has been rapidly growing over recent decades, binary hashing has become an attractive approach for fast search over large databases, in which the high-dimensional data such as image, video or text is mapped into a low-dimensional binary code. Searching in this hamming space is extremely efficient which is independent of the data size. A lot of methods have been proposed to learn this binary mapping. However, to make the binary codes conserves the input information, previous works mostly resort to mean squared error, which is prone to lose a lot of input information [11]. On the other hand, most of the previous works adopt the norm constraint or approximation on the hidden representation to make it as close as possible to binary, but the norm constraint is too strict that harms the expressiveness and flexibility of the code. In this paper, to generate desirable binary codes, we introduce two adversarial training procedures to the hashing process. We replace the L2 reconstruction error with an adversarial training process to make the codes reserve its input information, and we apply another adversarial learning discriminator on the hidden codes to make it proximate to binary. With the adversarial training process, the generated codes are getting close to binary while also conserves the input information. We conduct comprehensive experiments on both supervised and unsupervised hashing applications and achieves a new state of the arts result on many image hashing benchmarks.
2019-03-15
Bian, R., Xue, M., Wang, J..  2018.  Building Trusted Golden Models-Free Hardware Trojan Detection Framework Against Untrustworthy Testing Parties Using a Novel Clustering Ensemble Technique. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :1458-1463.

As a result of the globalization of integrated circuits (ICs) design and fabrication process, ICs are becoming vulnerable to hardware Trojans. Most of the existing hardware Trojan detection works suppose that the testing stage is trustworthy. However, testing parties may conspire with malicious attackers to modify the results of hardware Trojan detection. In this paper, we propose a trusted and robust hardware Trojan detection framework against untrustworthy testing parties exploiting a novel clustering ensemble method. The proposed technique can expose the malicious modifications on Trojan detection results introduced by untrustworthy testing parties. Compared with the state-of-the-art detection methods, the proposed technique does not require fabricated golden chips or simulated golden models. The experiment results on ISCAS89 benchmark circuits show that the proposed technique can resist modifications robustly and detect hardware Trojans with decent accuracy (up to 91%).

2019-01-31
Wang, Siqi, Zeng, Yijie, Liu, Qiang, Zhu, Chengzhang, Zhu, En, Yin, Jianping.  2018.  Detecting Abnormality Without Knowing Normality: A Two-Stage Approach for Unsupervised Video Abnormal Event Detection. Proceedings of the 26th ACM International Conference on Multimedia. :636–644.

Abnormal event detection in video surveillance is a valuable but challenging problem. Most methods adopt a supervised setting that requires collecting videos with only normal events for training. However, very few attempts are made under unsupervised setting that detects abnormality without priorly knowing normal events. Existing unsupervised methods detect drastic local changes as abnormality, which overlooks the global spatio-temporal context. This paper proposes a novel unsupervised approach, which not only avoids manually specifying normality for training as supervised methods do, but also takes the whole spatio-temporal context into consideration. Our approach consists of two stages: First, normality estimation stage trains an autoencoder and estimates the normal events globally from the entire unlabeled videos by a self-adaptive reconstruction loss thresholding scheme. Second, normality modeling stage feeds the estimated normal events from the previous stage into one-class support vector machine to build a refined normality model, which can further exclude abnormal events and enhance abnormality detection performance. Experiments on various benchmark datasets reveal that our method is not only able to outperform existing unsupervised methods by a large margin (up to 14.2% AUC gain), but also favorably yields comparable or even superior performance to state-of-the-art supervised methods.

2018-11-19
Zhao, Zhi-Lin, Wang, Chang-Dong, Lin, Kun-Yu, Lai, Jian-Huang.  2017.  Missing Value Learning. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. :2427–2430.

Missing value is common in many machine learning problems and much effort has been made to handle missing data to improve the performance of the learned model. Sometimes, our task is not to train a model using those unlabeled/labeled data with missing value but process examples according to the values of some specified features. So, there is an urgent need of developing a method to predict those missing values. In this paper, we focus on learning from the known values to learn missing value as close as possible to the true one. It's difficult for us to predict missing value because we do not know the structure of the data matrix and some missing values may relate to some other missing values. We solve the problem by recovering the complete data matrix under the three reasonable constraints: feature relationship, upper recovery error bound and class relationship. The proposed algorithm can deal with both unlabeled and labeled data and generative adversarial idea will be used in labeled data to transfer knowledge. Extensive experiments have been conducted to show the effectiveness of the proposed algorithms.

2018-09-05
Chen, Yizheng, Nadji, Yacin, Kountouras, Athanasios, Monrose, Fabian, Perdisci, Roberto, Antonakakis, Manos, Vasiloglou, Nikolaos.  2017.  Practical Attacks Against Graph-based Clustering. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. :1125–1142.
Graph modeling allows numerous security problems to be tackled in a general way, however, little work has been done to understand their ability to withstand adversarial attacks. We design and evaluate two novel graph attacks against a state-of-the-art network-level, graph-based detection system. Our work highlights areas in adversarial machine learning that have not yet been addressed, specifically: graph-based clustering techniques, and a global feature space where realistic attackers without perfect knowledge must be accounted for (by the defenders) in order to be practical. Even though less informed attackers can evade graph clustering with low cost, we show that some practical defenses are possible.
2018-07-06
Biggio, Battista, Rieck, Konrad, Ariu, Davide, Wressnegger, Christian, Corona, Igino, Giacinto, Giorgio, Roli, Fabio.  2014.  Poisoning Behavioral Malware Clustering. Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop. :27–36.
Clustering algorithms have become a popular tool in computer security to analyze the behavior of malware variants, identify novel malware families, and generate signatures for antivirus systems. However, the suitability of clustering algorithms for security-sensitive settings has been recently questioned by showing that they can be significantly compromised if an attacker can exercise some control over the input data. In this paper, we revisit this problem by focusing on behavioral malware clustering approaches, and investigate whether and to what extent an attacker may be able to subvert these approaches through a careful injection of samples with poisoning behavior. To this end, we present a case study on Malheur, an open-source tool for behavioral malware clustering. Our experiments not only demonstrate that this tool is vulnerable to poisoning attacks, but also that it can be significantly compromised even if the attacker can only inject a very small percentage of attacks into the input data. As a remedy, we discuss possible countermeasures and highlight the need for more secure clustering algorithms.
2018-04-04
Nawaratne, R., Bandaragoda, T., Adikari, A., Alahakoon, D., Silva, D. De, Yu, X..  2017.  Incremental knowledge acquisition and self-learning for autonomous video surveillance. IECON 2017 - 43rd Annual Conference of the IEEE Industrial Electronics Society. :4790–4795.

The world is witnessing a remarkable increase in the usage of video surveillance systems. Besides fulfilling an imperative security and safety purpose, it also contributes towards operations monitoring, hazard detection and facility management in industry/smart factory settings. Most existing surveillance techniques use hand-crafted features analyzed using standard machine learning pipelines for action recognition and event detection. A key shortcoming of such techniques is the inability to learn from unlabeled video streams. The entire video stream is unlabeled when the requirement is to detect irregular, unforeseen and abnormal behaviors, anomalies. Recent developments in intelligent high-level video analysis have been successful in identifying individual elements in a video frame. However, the detection of anomalies in an entire video feed requires incremental and unsupervised machine learning. This paper presents a novel approach that incorporates high-level video analysis outcomes with incremental knowledge acquisition and self-learning for autonomous video surveillance. The proposed approach is capable of detecting changes that occur over time and separating irregularities from re-occurrences, without the prerequisite of a labeled dataset. We demonstrate the proposed approach using a benchmark video dataset and the results confirm its validity and usability for autonomous video surveillance.

2018-03-26
Afshar, Ardavan, Ho, Joyce C., Dilkina, Bistra, Perros, Ioakeim, Khalil, Elias B., Xiong, Li, Sunderam, Vaidy.  2017.  CP-ORTHO: An Orthogonal Tensor Factorization Framework for Spatio-Temporal Data. Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. :67:1–67:4.

Extracting patterns and deriving insights from spatio-temporal data finds many target applications in various domains, such as in urban planning and computational sustainability. Due to their inherent capability of simultaneously modeling the spatial and temporal aspects of multiple instances, tensors have been successfully used to analyze such spatio-temporal data. However, standard tensor factorization approaches often result in components that are highly overlapping, which hinders the practitioner's ability to interpret them without advanced domain knowledge. In this work, we tackle this challenge by proposing a tensor factorization framework, called CP-ORTHO, to discover distinct and easily-interpretable patterns from multi-modal, spatio-temporal data. We evaluate our approach on real data reflecting taxi drop-off activity. CP-ORTHO provides more distinct and interpretable patterns than prior art, as measured via relevant quantitative metrics, without compromising the solution's accuracy. We observe that CP-ORTHO is fast, in that it achieves this result in 5x less time than the most accurate competing approach.

2018-02-06
Pappu, Aasish, Blanco, Roi, Mehdad, Yashar, Stent, Amanda, Thadani, Kapil.  2017.  Lightweight Multilingual Entity Extraction and Linking. Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. :365–374.

Text analytics systems often rely heavily on detecting and linking entity mentions in documents to knowledge bases for downstream applications such as sentiment analysis, question answering and recommender systems. A major challenge for this task is to be able to accurately detect entities in new languages with limited labeled resources. In this paper we present an accurate and lightweight, multilingual named entity recognition (NER) and linking (NEL) system. The contributions of this paper are three-fold: 1) Lightweight named entity recognition with competitive accuracy; 2) Candidate entity retrieval that uses search click-log data and entity embeddings to achieve high precision with a low memory footprint; and 3) efficient entity disambiguation. Our system achieves state-of-the-art performance on TAC KBP 2013 multilingual data and on English AIDA CONLL data.

2017-12-28
Boucher, A., Badri, M..  2017.  Predicting Fault-Prone Classes in Object-Oriented Software: An Adaptation of an Unsupervised Hybrid SOM Algorithm. 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS). :306–317.

Many fault-proneness prediction models have been proposed in literature to identify fault-prone code in software systems. Most of the approaches use fault data history and supervised learning algorithms to build these models. However, since fault data history is not always available, some approaches also suggest using semi-supervised or unsupervised fault-proneness prediction models. The HySOM model, proposed in literature, uses function-level source code metrics to predict fault-prone functions in software systems, without using any fault data. In this paper, we adapt the HySOM approach for object-oriented software systems to predict fault-prone code at class-level granularity using object-oriented source code metrics. This adaptation makes it easier to prioritize the efforts of the testing team as unit tests are often written for classes in object-oriented software systems, and not for methods. Our adaptation also generalizes one main element of the HySOM model, which is the calculation of the source code metrics threshold values. We conducted an empirical study using 12 public datasets. Results show that the adaptation of the HySOM model for class-level fault-proneness prediction improves the consistency and the performance of the model. We additionally compared the performance of the adapted model to supervised approaches based on the Naive Bayes Network, ANN and Random Forest algorithms.

2017-12-12
Feng, W., Yan, W., Wu, S., Liu, N..  2017.  Wavelet transform and unsupervised machine learning to detect insider threat on cloud file-sharing. 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). :155–157.

As increasingly more enterprises are deploying cloud file-sharing services, this adds a new channel for potential insider threats to company data and IPs. In this paper, we introduce a two-stage machine learning system to detect anomalies. In the first stage, we project the access logs of cloud file-sharing services onto relationship graphs and use three complementary graph-based unsupervised learning methods: OddBall, PageRank and Local Outlier Factor (LOF) to generate outlier indicators. In the second stage, we ensemble the outlier indicators and introduce the discrete wavelet transform (DWT) method, and propose a procedure to use wavelet coefficients with the Haar wavelet function to identify outliers for insider threat. The proposed system has been deployed in a real business environment, and demonstrated effectiveness by selected case studies.

2017-10-19
Zhang, Chenwei, Xie, Sihong, Li, Yaliang, Gao, Jing, Fan, Wei, Yu, Philip S..  2016.  Multi-source Hierarchical Prediction Consolidation. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :2251–2256.
In big data applications such as healthcare data mining, due to privacy concerns, it is necessary to collect predictions from multiple information sources for the same instance, with raw features being discarded or withheld when aggregating multiple predictions. Besides, crowd-sourced labels need to be aggregated to estimate the ground truth of the data. Due to the imperfection caused by predictive models or human crowdsourcing workers, noisy and conflicting information is ubiquitous and inevitable. Although state-of-the-art aggregation methods have been proposed to handle label spaces with flat structures, as the label space is becoming more and more complicated, aggregation under a label hierarchical structure becomes necessary but has been largely ignored. These label hierarchies can be quite informative as they are usually created by domain experts to make sense of highly complex label correlations such as protein functionality interactions or disease relationships. We propose a novel multi-source hierarchical prediction consolidation method to effectively exploits the complicated hierarchical label structures to resolve the noisy and conflicting information that inherently originates from multiple imperfect sources. We formulate the problem as an optimization problem with a closed-form solution. The consolidation result is inferred in a totally unsupervised, iterative fashion. Experimental results on both synthetic and real-world data sets show the effectiveness of the proposed method over existing alternatives.
2015-05-04
Pratanwanich, N., Lio, P..  2014.  Who Wrote This? Textual Modeling with Authorship Attribution in Big Data Data Mining Workshop (ICDMW), 2014 IEEE International Conference on. :645-652.

By representing large corpora with concise and meaningful elements, topic-based generative models aim to reduce the dimension and understand the content of documents. Those techniques originally analyze on words in the documents, but their extensions currently accommodate meta-data such as authorship information, which has been proved useful for textual modeling. The importance of learning authorship is to extract author interests and assign authors to anonymous texts. Author-Topic (AT) model, an unsupervised learning technique, successfully exploits authorship information to model both documents and author interests using topic representations. However, the AT model simplifies that each author has equal contribution on multiple-author documents. To overcome this limitation, we assumes that authors give different degrees of contributions on a document by using a Dirichlet distribution. This automatically transforms the unsupervised AT model to Supervised Author-Topic (SAT) model, which brings a novelty of authorship prediction on anonymous texts. The SAT model outperforms the AT model for identifying authors of documents written by either single authors or multiple authors with a better Receiver Operating Characteristic (ROC) curve and a significantly higher Area Under Curve (AUC). The SAT model not only achieves competitive performance to state-of-the-art techniques e.g. Random forests but also maintains the characteristics of the unsupervised models for information discovery i.e. Word distributions of topics, author interests, and author contributions.
 

2015-05-01
Mohagheghi, S..  2014.  Integrity Assessment Scheme for Situational Awareness in Utility Automation Systems. Smart Grid, IEEE Transactions on. 5:592-601.

Today's more reliable communication technology, together with the availability of higher computational power, have paved the way for introduction of more advanced automation systems based on distributed intelligence and multi-agent technology. However, abundance of data, while making these systems more powerful, can at the same time act as their biggest vulnerability. In a web of interconnected devices and components functioning within an automation framework, potential impact of malfunction in a single device, either through internal failure or external damage/intrusion, may lead to detrimental side-effects spread across the whole underlying system. The potentially large number of devices, along with their inherent interrelations and interdependencies, may hinder the ability of human operators to interpret events, identify their scope of impact and take remedial actions if necessary. Through utilization of the concepts of graph-theoretic fuzzy cognitive maps (FCM) and expert systems, this paper puts forth a solution that is able to reveal weak links and vulnerabilities of an automation system, should it become exposed to partial internal failure or external damage. A case study has been performed on the IEEE 34-bus test distribution system to show the efficiency of the proposed scheme.