Visible to the public Biblio

Filters: Keyword is semi-supervised learning  [Clear All Filters]
2023-08-03
Thai, Ho Huy, Hieu, Nguyen Duc, Van Tho, Nguyen, Hoang, Hien Do, Duy, Phan The, Pham, Van-Hau.  2022.  Adversarial AutoEncoder and Generative Adversarial Networks for Semi-Supervised Learning Intrusion Detection System. 2022 RIVF International Conference on Computing and Communication Technologies (RIVF). :584–589.
As one of the defensive solutions against cyberattacks, an Intrusion Detection System (IDS) plays an important role in observing the network state and alerting suspicious actions that can break down the system. There are many attempts of adopting Machine Learning (ML) in IDS to achieve high performance in intrusion detection. However, all of them necessitate a large amount of labeled data. In addition, labeling attack data is a time-consuming and expensive human-labor operation, it makes existing ML methods difficult to deploy in a new system or yields lower results due to a lack of labels on pre-trained data. To address these issues, we propose a semi-supervised IDS model that leverages Generative Adversarial Networks (GANs) and Adversarial AutoEncoder (AAE), called a semi-supervised adversarial autoencoder (SAAE). Our SAAE experimental results on two public datasets for benchmarking ML-based IDS, including NF-CSE-CIC-IDS2018 and NF-UNSW-NB15, demonstrate the effectiveness of AAE and GAN in case of using only a small number of labeled data. In particular, our approach outperforms other ML methods with the highest detection rates in spite of the scarcity of labeled data for model training, even with only 1% labeled data.
ISSN: 2162-786X
2023-03-17
Woralert, Chutitep, Liu, Chen, Blasingame, Zander.  2022.  HARD-Lite: A Lightweight Hardware Anomaly Realtime Detection Framework Targeting Ransomware. 2022 Asian Hardware Oriented Security and Trust Symposium (AsianHOST). :1–6.
Recent years have witnessed a surge in ransomware attacks. Especially, many a new variant of ransomware has continued to emerge, employing more advanced techniques distributing the payload while avoiding detection. This renders the traditional static ransomware detection mechanism ineffective. In this paper, we present our Hardware Anomaly Realtime Detection - Lightweight (HARD-Lite) framework that employs semi-supervised machine learning method to detect ransomware using low-level hardware information. By using an LSTM network with a weighted majority voting ensemble and exponential moving average, we are able to take into consideration the temporal aspect of hardware-level information formed as time series in order to detect deviation in system behavior, thereby increasing the detection accuracy whilst reducing the number of false positives. Testing against various ransomware across multiple families, HARD-Lite has demonstrated remarkable effectiveness, detecting all cases tested successfully. What's more, with a hierarchical design that distributing the classifier from the user machine that is under monitoring to a server machine, Hard-Lite enables good scalability as well.
2023-02-17
Yerima, Suleiman Y., Bashar, Abul.  2022.  Semi-supervised novelty detection with one class SVM for SMS spam detection. 2022 29th International Conference on Systems, Signals and Image Processing (IWSSIP). CFP2255E-ART:1–4.
The volume of SMS messages sent on a daily basis globally has continued to grow significantly over the past years. Hence, mobile phones are becoming increasingly vulnerable to SMS spam messages, thereby exposing users to the risk of fraud and theft of personal data. Filtering of messages to detect and eliminate SMS spam is now a critical functionality for which different types of machine learning approaches are still being explored. In this paper, we propose a system for detecting SMS spam using a semi-supervised novelty detection approach based on one class SVM classifier. The system is built as an anomaly detector that learns only from normal SMS messages thus enabling detection models to be implemented in the absence of labelled SMS spam training examples. We evaluated our proposed system using a benchmark dataset consisting of 747 SMS spam and 4827 non-spam messages. The results show that our proposed method out-performed the traditional supervised machine learning approaches based on binary, frequency or TF-IDF bag-of-words. The overall accuracy was 98% with 100% SMS spam detection rate and only around 3% false positive rate.
ISSN: 2157-8702
2023-01-06
Franci, Adriano, Cordy, Maxime, Gubri, Martin, Papadakis, Mike, Traon, Yves Le.  2022.  Influence-Driven Data Poisoning in Graph-Based Semi-Supervised Classifiers. 2022 IEEE/ACM 1st International Conference on AI Engineering – Software Engineering for AI (CAIN). :77—87.
Graph-based Semi-Supervised Learning (GSSL) is a practical solution to learn from a limited amount of labelled data together with a vast amount of unlabelled data. However, due to their reliance on the known labels to infer the unknown labels, these algorithms are sensitive to data quality. It is therefore essential to study the potential threats related to the labelled data, more specifically, label poisoning. In this paper, we propose a novel data poisoning method which efficiently approximates the result of label inference to identify the inputs which, if poisoned, would produce the highest number of incorrectly inferred labels. We extensively evaluate our approach on three classification problems under 24 different experimental settings each. Compared to the state of the art, our influence-driven attack produces an average increase of error rate 50% higher, while being faster by multiple orders of magnitude. Moreover, our method can inform engineers of inputs that deserve investigation (relabelling them) before training the learning model. We show that relabelling one-third of the poisoned inputs (selected based on their influence) reduces the poisoning effect by 50%. ACM Reference Format: Adriano Franci, Maxime Cordy, Martin Gubri, Mike Papadakis, and Yves Le Traon. 2022. Influence-Driven Data Poisoning in Graph-Based Semi-Supervised Classifiers. In 1st Conference on AI Engineering - Software Engineering for AI (CAIN’22), May 16–24, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3522664.3528606
2022-04-12
Lavi, Bahram, Nascimento, José, Rocha, Anderson.  2021.  Semi-Supervised Feature Embedding for Data Sanitization in Real-World Events. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :2495—2499.
With the rapid growth of data sharing through social media networks, determining relevant data items concerning a particular subject becomes paramount. We address the issue of establishing which images represent an event of interest through a semi-supervised learning technique. The method learns consistent and shared features related to an event (from a small set of examples) to propagate them to an unlabeled set. We investigate the behavior of five image feature representations considering low- and high-level features and their combinations. We evaluate the effectiveness of the feature embedding approach on five collected datasets from real-world events.
2022-03-25
Alibrahim, Hussain, Ludwig, Simone A..  2021.  Investigation of Domain Name System Attack Clustering using Semi-Supervised Learning with Swarm Intelligence Algorithms. 2021 IEEE Symposium Series on Computational Intelligence (SSCI). :01—09.

Domain Name System (DNS) is the Internet's system for converting alphabetic names into numeric IP addresses. It is one of the early and vulnerable network protocols, which has several security loopholes that have been exploited repeatedly over the years. The clustering task for the automatic recognition of these attacks uses machine learning approaches based on semi-supervised learning. A family of bio-inspired algorithms, well known as Swarm Intelligence (SI) methods, have recently emerged to meet the requirements for the clustering task and have been successfully applied to various real-world clustering problems. In this paper, Particle Swarm Optimization (PSO), Artificial Bee Colony (ABC), and Kmeans, which is one of the most popular cluster algorithms, have been applied. Furthermore, hybrid algorithms consisting of Kmeans and PSO, and Kmeans and ABC have been proposed for the clustering process. The Canadian Institute for Cybersecurity (CIC) data set has been used for this investigation. In addition, different measures of clustering performance have been used to compare the different algorithms.

2022-02-07
Abdelmonem, Salma, Seddik, Shahd, El-Sayed, Rania, Kaseb, Ahmed S..  2021.  Enhancing Image-Based Malware Classification Using Semi-Supervised Learning. 2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES). :125–128.
Malicious software (malware) creators are constantly mutating malware files in order to avoid detection, resulting in hundreds of millions of new malware every year. Therefore, most malware files are unlabeled due to the time and cost needed to label them manually. This makes it very challenging to perform malware detection, i.e., deciding whether a file is malware or not, and malware classification, i.e., determining the family of the malware. Most solutions use supervised learning (e.g., ResNet and VGG) whose accuracy degrades significantly with the lack of abundance of labeled data. To solve this problem, this paper proposes a semi-supervised learning model for image-based malware classification. In this model, malware files are represented as grayscale images, and semi-supervised learning is carefully selected to handle the plethora of unlabeled data. Our proposed model is an enhanced version of the ∏-model, which makes it more accurate and consistent. Experiments show that our proposed model outperforms the original ∏-model by 4% in accuracy and three other supervised models by 6% in accuracy especially when the ratio of labeled samples is as low as 20%.
2020-08-17
Regol, Florence, Pal, Soumyasundar, Coates, Mark.  2019.  Node Copying for Protection Against Graph Neural Network Topology Attacks. 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). :709–713.
Adversarial attacks can affect the performance of existing deep learning models. With the increased interest in graph based machine learning techniques, there have been investigations which suggest that these models are also vulnerable to attacks. In particular, corruptions of the graph topology can degrade the performance of graph based learning algorithms severely. This is due to the fact that the prediction capability of these algorithms relies mostly on the similarity structure imposed by the graph connectivity. Therefore, detecting the location of the corruption and correcting the induced errors becomes crucial. There has been some recent work which tackles the detection problem, however these methods do not address the effect of the attack on the downstream learning task. In this work, we propose an algorithm that uses node copying to mitigate the degradation in classification that is caused by adversarial attacks. The proposed methodology is applied only after the model for the downstream task is trained and the added computation cost scales well for large graphs. Experimental results show the effectiveness of our approach for several real world datasets.
2020-06-12
Li, Wenyue, Yin, Jihao, Han, Bingnan, Zhu, Hongmei.  2019.  Generative Adversarial Network with Folded Spectrum for Hyperspectral Image Classification. IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium. :883—886.

Hyperspectral image (HSIs) with abundant spectral information but limited labeled dataset endows the rationality and necessity of semi-supervised spectral-based classification methods. Where, the utilizing approach of spectral information is significant to classification accuracy. In this paper, we propose a novel semi-supervised method based on generative adversarial network (GAN) with folded spectrum (FS-GAN). Specifically, the original spectral vector is folded to 2D square spectrum as input of GAN, which can generate spectral texture and provide larger receptive field over both adjacent and non-adjacent spectral bands for deep feature extraction. The generated fake folded spectrum, the labeled and unlabeled real folded spectrum are then fed to the discriminator for semi-supervised learning. A feature matching strategy is applied to prevent model collapse. Extensive experimental comparisons demonstrate the effectiveness of the proposed method.

Liu, Junfu, Chen, Keming, Xu, Guangluan, Li, Hao, Yan, Menglong, Diao, Wenhui, Sun, Xian.  2019.  Semi-Supervised Change Detection Based on Graphs with Generative Adversarial Networks. IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium. :74—77.

In this paper, we present a semi-supervised remote sensing change detection method based on graph model with Generative Adversarial Networks (GANs). Firstly, the multi-temporal remote sensing change detection problem is converted as a problem of semi-supervised learning on graph where a majority of unlabeled nodes and a few labeled nodes are contained. Then, GANs are adopted to generate samples in a competitive manner and help improve the classification accuracy. Finally, a binary change map is produced by classifying the unlabeled nodes to a certain class with the help of both the labeled nodes and the unlabeled nodes on graph. Experimental results carried on several very high resolution remote sensing image data sets demonstrate the effectiveness of our method.

2019-02-08
Zügner, Daniel, Akbarnejad, Amir, Günnemann, Stephan.  2018.  Adversarial Attacks on Neural Networks for Graph Data. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. :2847-2856.
Deep learning models for graphs have achieved strong performance for the task of node classification. Despite their proliferation, currently there is no study of their robustness to adversarial attacks. Yet, in domains where they are likely to be used, e.g. the web, adversaries are common. Can deep learning models for graphs be easily fooled? In this work, we introduce the first study of adversarial attacks on attributed graphs, specifically focusing on models exploiting ideas of graph convolutions. In addition to attacks at test time, we tackle the more challenging class of poisoning/causative attacks, which focus on the training phase of a machine learning model.We generate adversarial perturbations targeting the node's features and the graph structure, thus, taking the dependencies between instances in account. Moreover, we ensure that the perturbations remain unnoticeable by preserving important data characteristics. To cope with the underlying discrete domain we propose an efficient algorithm Nettack exploiting incremental computations. Our experimental study shows that accuracy of node classification significantly drops even when performing only few perturbations. Even more, our attacks are transferable: the learned attacks generalize to other state-of-the-art node classification models and unsupervised approaches, and likewise are successful even when only limited knowledge about the graph is given.
2018-03-19
Ghosh, Shalini, Das, Ariyam, Porras, Phil, Yegneswaran, Vinod, Gehani, Ashish.  2017.  Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. :1793–1802.

Onion sites on the darkweb operate using the Tor Hidden Service (HS) protocol to shield their locations on the Internet, which (among other features) enables these sites to host malicious and illegal content while being resistant to legal action and seizure. Identifying and monitoring such illicit sites in the darkweb is of high relevance to the Computer Security and Law Enforcement communities. We have developed an automated infrastructure that crawls and indexes content from onion sites into a large-scale data repository, called LIGHTS, with over 100M pages. In this paper we describe Automated Tool for Onion Labeling (ATOL), a novel scalable analysis service developed to conduct a thematic assessment of the content of onion sites in the LIGHTS repository. ATOL has three core components – (a) a novel keyword discovery mechanism (ATOLKeyword) which extends analyst-provided keywords for different categories by suggesting new descriptive and discriminative keywords that are relevant for the categories; (b) a classification framework (ATOLClassify) that uses the discovered keywords to map onion site content to a set of categories when sufficient labeled data is available; (c) a clustering framework (ATOLCluster) that can leverage information from multiple external heterogeneous knowledge sources, ranging from domain expertise to Bitcoin transaction data, to categorize onion content in the absence of sufficient supervised data. The paper presents empirical results of ATOL on onion datasets derived from the LIGHTS repository, and additionally benchmarks ATOL's algorithms on the publicly available 20 Newsgroups dataset to demonstrate the reproducibility of its results. On the LIGHTS dataset, ATOLClassify gives a 12% performance gain over an analyst-provided baseline, while ATOLCluster gives a 7% improvement over state-of-the-art semi-supervised clustering algorithms. We also discuss how ATOL has been deployed and externally evaluated, as part of the LIGHTS system.

2017-11-27
Meng, Q., Shameng, Wen, Chao, Feng, Chaojing, Tang.  2016.  Predicting buffer overflow using semi-supervised learning. 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). :1959–1963.

As everyone knows vulnerability detection is a very difficult and time consuming work, so taking advantage of the unlabeled data sufficiently is needed and helpful. According the above reality, in this paper a method is proposed to predict buffer overflow based on semi-supervised learning. We first employ Antlr to extract AST from C/C++ source files, then according to the 22 buffer overflow attributes taxonomies, a 22-dimension vector is extracted from every function in AST, at last, the vector is leveraged to train a classifier to predict buffer overflow vulnerabilities. The experiment and evaluation indicate our method is correct and efficient.

2017-09-15
Alabdulmohsin, Ibrahim, Han, YuFei, Shen, Yun, Zhang, XiangLiang.  2016.  Content-Agnostic Malware Detection in Heterogeneous Malicious Distribution Graph. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :2395–2400.

Malware detection has been widely studied by analysing either file dropping relationships or characteristics of the file distribution network. This paper, for the first time, studies a global heterogeneous malware delivery graph fusing file dropping relationship and the topology of the file distribution network. The integration offers a unique ability of structuring the end-to-end distribution relationship. However, it brings large heterogeneous graphs to analysis. In our study, an average daily generated graph has more than 4 million edges and 2.7 million nodes that differ in type, such as IPs, URLs, and files. We propose a novel Bayesian label propagation model to unify the multi-source information, including content-agnostic features of different node types and topological information of the heterogeneous network. Our approach does not need to examine the source codes nor inspect the dynamic behaviours of a binary. Instead, it estimates the maliciousness of a given file through a semi-supervised label propagation procedure, which has a linear time complexity w.r.t. the number of nodes and edges. The evaluation on 567 million real-world download events validates that our proposed approach efficiently detects malware with a high accuracy.

2017-03-20
Han, YuFei, Shen, Yun.  2016.  Accurate Spear Phishing Campaign Attribution and Early Detection. Proceedings of the 31st Annual ACM Symposium on Applied Computing. :2079–2086.

There is growing evidence that spear phishing campaigns are increasingly pervasive, sophisticated, and remain the starting points of more advanced attacks. Current campaign identification and attribution process heavily relies on manual efforts and is inefficient in gathering intelligence in a timely manner. It is ideal that we can automatically attribute spear phishing emails to known campaigns and achieve early detection of new campaigns using limited labelled emails as the seeds. In this paper, we introduce four categories of email profiling features that capture various characteristics of spear phishing emails. Building on these features, we implement and evaluate an affinity graph based semi-supervised learning model for campaign attribution and detection. We demonstrate that our system, using only 25 labelled emails, achieves 0.9 F1 score with a 0.01 false positive rate in known campaign attribution, and is able to detect previously unknown spear phishing campaigns, achieving 100% 'darkmoon', over 97% of 'samkams' and 91% of 'bisrala' campaign detection using 246 labelled emails in our experiments.