Visible to the public Biblio

Filters: Keyword is Semisupervised learning  [Clear All Filters]
2023-08-03
Thai, Ho Huy, Hieu, Nguyen Duc, Van Tho, Nguyen, Hoang, Hien Do, Duy, Phan The, Pham, Van-Hau.  2022.  Adversarial AutoEncoder and Generative Adversarial Networks for Semi-Supervised Learning Intrusion Detection System. 2022 RIVF International Conference on Computing and Communication Technologies (RIVF). :584–589.
As one of the defensive solutions against cyberattacks, an Intrusion Detection System (IDS) plays an important role in observing the network state and alerting suspicious actions that can break down the system. There are many attempts of adopting Machine Learning (ML) in IDS to achieve high performance in intrusion detection. However, all of them necessitate a large amount of labeled data. In addition, labeling attack data is a time-consuming and expensive human-labor operation, it makes existing ML methods difficult to deploy in a new system or yields lower results due to a lack of labels on pre-trained data. To address these issues, we propose a semi-supervised IDS model that leverages Generative Adversarial Networks (GANs) and Adversarial AutoEncoder (AAE), called a semi-supervised adversarial autoencoder (SAAE). Our SAAE experimental results on two public datasets for benchmarking ML-based IDS, including NF-CSE-CIC-IDS2018 and NF-UNSW-NB15, demonstrate the effectiveness of AAE and GAN in case of using only a small number of labeled data. In particular, our approach outperforms other ML methods with the highest detection rates in spite of the scarcity of labeled data for model training, even with only 1% labeled data.
ISSN: 2162-786X
2023-06-29
Wang, Zhichao.  2022.  Deep Learning Methods for Fake News Detection. 2022 IEEE 2nd International Conference on Data Science and Computer Application (ICDSCA). :472–475.

Nowadays, although it is much more convenient to obtain news with social media and various news platforms, the emergence of all kinds of fake news has become a headache and urgent problem that needs to be solved. Currently, the fake news recognition algorithm for fake news mainly uses GCN, including some other niche algorithms such as GRU, CNN, etc. Although all fake news verification algorithms can reach quite a high accuracy with sufficient datasets, there is still room for improvement for unsupervised learning and semi-supervised. This article finds that the accuracy of the GCN method for fake news detection is basically about 85% through comparison with other neural network models, which is satisfactory, and proposes that the current field lacks a unified training dataset, and that in the future fake news detection models should focus more on semi-supervised learning and unsupervised learning.

2023-02-03
Chakraborty, Joymallya, Majumder, Suvodeep, Tu, Huy.  2022.  Fair-SSL: Building fair ML Software with less data. 2022 IEEE/ACM International Workshop on Equitable Data & Technology (FairWare). :1–8.
Ethical bias in machine learning models has become a matter of concern in the software engineering community. Most of the prior software engineering works concentrated on finding ethical bias in models rather than fixing it. After finding bias, the next step is mitigation. Prior researchers mainly tried to use supervised approaches to achieve fairness. However, in the real world, getting data with trustworthy ground truth is challenging and also ground truth can contain human bias. Semi-supervised learning is a technique where, incrementally, labeled data is used to generate pseudo-labels for the rest of data (and then all that data is used for model training). In this work, we apply four popular semi-supervised techniques as pseudo-labelers to create fair classification models. Our framework, Fair-SSL, takes a very small amount (10%) of labeled data as input and generates pseudo-labels for the unlabeled data. We then synthetically generate new data points to balance the training data based on class and protected attribute as proposed by Chakraborty et al. in FSE 2021. Finally, classification model is trained on the balanced pseudo-labeled data and validated on test data. After experimenting on ten datasets and three learners, we find that Fair-SSL achieves similar performance as three state-of-the-art bias mitigation algorithms. That said, the clear advantage of Fair-SSL is that it requires only 10% of the labeled training data. To the best of our knowledge, this is the first SE work where semi-supervised techniques are used to fight against ethical bias in SE ML models. To facilitate open science and replication, all our source code and datasets are publicly available at https://github.com/joymallyac/FairSSL. CCS CONCEPTS • Software and its engineering → Software creation and management; • Computing methodologies → Machine learning. ACM Reference Format: Joymallya Chakraborty, Suvodeep Majumder, and Huy Tu. 2022. Fair-SSL: Building fair ML Software with less data. In International Workshop on Equitable Data and Technology (FairWare ‘22), May 9, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3524491.3527305
2023-01-20
Khan, Rashid, Saxena, Neetesh, Rana, Omer, Gope, Prosanta.  2022.  ATVSA: Vehicle Driver Profiling for Situational Awareness. 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). :348–357.

Increasing connectivity and automation in vehicles leads to a greater potential attack surface. Such vulnerabilities within vehicles can also be used for auto-theft, increasing the potential for attackers to disable anti-theft mechanisms implemented by vehicle manufacturers. We utilize patterns derived from Controller Area Network (CAN) bus traffic to verify driver “behavior”, as a basis to prevent vehicle theft. Our proposed model uses semi-supervised learning that continuously profiles a driver, using features extracted from CAN bus traffic. We have selected 15 key features and obtained an accuracy of 99% using a dataset comprising a total of 51 features across 10 different drivers. We use a number of data analysis algorithms, such as J48, Random Forest, JRip and clustering, using 94K records. Our results show that J48 is the best performing algorithm in terms of training and testing (1.95 seconds and 0.44 seconds recorded, respectively). We also analyze the effect of using a sliding window on algorithm performance, altering the size of the window to identify the impact on prediction accuracy.

2023-01-06
Franci, Adriano, Cordy, Maxime, Gubri, Martin, Papadakis, Mike, Traon, Yves Le.  2022.  Influence-Driven Data Poisoning in Graph-Based Semi-Supervised Classifiers. 2022 IEEE/ACM 1st International Conference on AI Engineering – Software Engineering for AI (CAIN). :77—87.
Graph-based Semi-Supervised Learning (GSSL) is a practical solution to learn from a limited amount of labelled data together with a vast amount of unlabelled data. However, due to their reliance on the known labels to infer the unknown labels, these algorithms are sensitive to data quality. It is therefore essential to study the potential threats related to the labelled data, more specifically, label poisoning. In this paper, we propose a novel data poisoning method which efficiently approximates the result of label inference to identify the inputs which, if poisoned, would produce the highest number of incorrectly inferred labels. We extensively evaluate our approach on three classification problems under 24 different experimental settings each. Compared to the state of the art, our influence-driven attack produces an average increase of error rate 50% higher, while being faster by multiple orders of magnitude. Moreover, our method can inform engineers of inputs that deserve investigation (relabelling them) before training the learning model. We show that relabelling one-third of the poisoned inputs (selected based on their influence) reduces the poisoning effect by 50%. ACM Reference Format: Adriano Franci, Maxime Cordy, Martin Gubri, Mike Papadakis, and Yves Le Traon. 2022. Influence-Driven Data Poisoning in Graph-Based Semi-Supervised Classifiers. In 1st Conference on AI Engineering - Software Engineering for AI (CAIN’22), May 16–24, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3522664.3528606
2022-07-05
Parizad, Ali, Hatziadoniu, Constantine.  2021.  False Data Detection in Power System Under State Variables' Cyber Attacks Using Information Theory. 2021 IEEE Power and Energy Conference at Illinois (PECI). :1—8.
State estimation (SE) plays a vital role in the reliable operation of modern power systems, gives situational awareness to the operators, and is employed in different functions of the Energy Management System (EMS), such as Optimal Power Flow (OPF), Contingency Analysis (CA), power market mechanism, etc. To increase SE's accuracy and protect it from compromised measurements, Bad Data Detection (BDD) algorithm is employed. However, the integration of Information and Communication Technologies (ICT) into the modern power system makes it a complicated cyber-physical system (CPS). It gives this opportunity to an adversary to find some loopholes and flaws, penetrate to CPS layer, inject false data, bypass existing BDD schemes, and consequently, result in security and stability issues. This paper employs a semi-supervised learning method to find normal data patterns and address the False Data Injection Attack (FDIA) problem. Based on this idea, the Probability Distribution Functions (PDFs) of measurement variations are derived for training and test data sets. Two distinct indices, i.e., Absolute Distance (AD) and Relative Entropy (RE), a concept in Information Theory, are utilized to find the distance between these two PDFs. In case an intruder compromises data, the related PDF changes. However, we demonstrate that AD fails to detect these changes. On the contrary, the RE index changes significantly and can properly detect FDIA. This proposed method can be used in a real-time attack detection process where the larger RE index indicates the possibility of an attack on the real-time data. To investigate the proposed methodology's effectiveness, we utilize the New York Independent System Operator (NYISO) data (Jan.-Dec. 2019) with a 5-minute resolution and map it to the IEEE 14-bus test system, and prepare an appropriate data set. After that, two different case studies (attacks on voltage magnitude ( Vm), and phase angle (θ)) with different attack parameters (i.e., 0.90, 0.95, 0.98, 1.02, 1.05, and 1.10) are defined to assess the impact of an attack on the state variables at different buses. The results show that RE index is a robust and reliable index, appropriate for real-time applications, and can detect FDIA in most of the defined case studies.
2022-04-12
Lavi, Bahram, Nascimento, José, Rocha, Anderson.  2021.  Semi-Supervised Feature Embedding for Data Sanitization in Real-World Events. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :2495—2499.
With the rapid growth of data sharing through social media networks, determining relevant data items concerning a particular subject becomes paramount. We address the issue of establishing which images represent an event of interest through a semi-supervised learning technique. The method learns consistent and shared features related to an event (from a small set of examples) to propagate them to an unlabeled set. We investigate the behavior of five image feature representations considering low- and high-level features and their combinations. We evaluate the effectiveness of the feature embedding approach on five collected datasets from real-world events.
2022-03-25
Alibrahim, Hussain, Ludwig, Simone A..  2021.  Investigation of Domain Name System Attack Clustering using Semi-Supervised Learning with Swarm Intelligence Algorithms. 2021 IEEE Symposium Series on Computational Intelligence (SSCI). :01—09.

Domain Name System (DNS) is the Internet's system for converting alphabetic names into numeric IP addresses. It is one of the early and vulnerable network protocols, which has several security loopholes that have been exploited repeatedly over the years. The clustering task for the automatic recognition of these attacks uses machine learning approaches based on semi-supervised learning. A family of bio-inspired algorithms, well known as Swarm Intelligence (SI) methods, have recently emerged to meet the requirements for the clustering task and have been successfully applied to various real-world clustering problems. In this paper, Particle Swarm Optimization (PSO), Artificial Bee Colony (ABC), and Kmeans, which is one of the most popular cluster algorithms, have been applied. Furthermore, hybrid algorithms consisting of Kmeans and PSO, and Kmeans and ABC have been proposed for the clustering process. The Canadian Institute for Cybersecurity (CIC) data set has been used for this investigation. In addition, different measures of clustering performance have been used to compare the different algorithms.

2022-02-07
Wang, Shuwei, Wang, Qiuyun, Jiang, Zhengwei, Wang, Xuren, Jing, Rongqi.  2021.  A Weak Coupling of Semi-Supervised Learning with Generative Adversarial Networks for Malware Classification. 2020 25th International Conference on Pattern Recognition (ICPR). :3775–3782.
Malware classification helps to understand its purpose and is also an important part of attack detection. And it is also an important part of discovering attacks. Due to continuous innovation and development of artificial intelligence, it is a trend to combine deep learning with malware classification. In this paper, we propose an improved malware image rescaling algorithm (IMIR) based on local mean algorithm. Its main goal of IMIR is to reduce the loss of information from samples during the process of converting binary files to image files. Therefore, we construct a neural network structure based on VGG model, which is suitable for image classification. In the real world, a mass of malware family labels are inaccurate or lacking. To deal with this situation, we propose a novel method to train the deep neural network by Semi-supervised Generative Adversarial Network (SGAN), which only needs a small amount of malware that have accurate labels about families. By integrating SGAN with weak coupling, we can retain the weak links of supervised part and unsupervised part of SGAN. It improves the accuracy of malware classification by making classifiers more independent of discriminators. The results of experimental demonstrate that our model achieves exhibiting favorable performance. The recalls of each family in our data set are all higher than 93.75%.
Abdelmonem, Salma, Seddik, Shahd, El-Sayed, Rania, Kaseb, Ahmed S..  2021.  Enhancing Image-Based Malware Classification Using Semi-Supervised Learning. 2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES). :125–128.
Malicious software (malware) creators are constantly mutating malware files in order to avoid detection, resulting in hundreds of millions of new malware every year. Therefore, most malware files are unlabeled due to the time and cost needed to label them manually. This makes it very challenging to perform malware detection, i.e., deciding whether a file is malware or not, and malware classification, i.e., determining the family of the malware. Most solutions use supervised learning (e.g., ResNet and VGG) whose accuracy degrades significantly with the lack of abundance of labeled data. To solve this problem, this paper proposes a semi-supervised learning model for image-based malware classification. In this model, malware files are represented as grayscale images, and semi-supervised learning is carefully selected to handle the plethora of unlabeled data. Our proposed model is an enhanced version of the ∏-model, which makes it more accurate and consistent. Experiments show that our proposed model outperforms the original ∏-model by 4% in accuracy and three other supervised models by 6% in accuracy especially when the ratio of labeled samples is as low as 20%.
Gao, Tan, Li, Xudong, Chen, Wen.  2021.  Co-training For Image-Based Malware Classification. 2021 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC). :568–572.
A malware detection model based on semi-supervised learning is proposed in the paper. Our model includes mainly three parts: malware visualization, feature extraction, and classification. Firstly, the malware visualization converts malware into grayscale images; then the features of the images are extracted to reflect the coding patterns of malware; finally, a collaborative learning model is applied to malware detections using both labeled and unlabeled software samples. The proposed model was evaluated based on two commonly used benchmark datasets. The results demonstrated that compared with traditional methods, our model not only reduced the cost of sample labeling but also improved the detection accuracy through incorporating unlabeled samples into the collaborative learning process, thereby achieved higher classification performance.
2021-09-21
Yan, Fan, Liu, Jia, Gu, Liang, Chen, Zelong.  2020.  A Semi-Supervised Learning Scheme to Detect Unknown DGA Domain Names Based on Graph Analysis. 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). :1578–1583.
A large amount of malware families use the domain generation algorithms (DGA) to randomly generate a large amount of domain names. It is a good way to bypass conventional blacklists of domain names, because we cannot predict which of the randomly generated domain names are selected for command and control (C&C) communications. An effective approach for detecting known DGA families is to investigate the malware with reverse engineering to find the adopted generation algorithms. As reverse engineering cannot handle the variants of DGA families, some researches leverage supervised learning to find new variants. However, the explainability of supervised learning is low and cannot find previously unseen DGA families. In this paper, we propose a graph-based semi-supervised learning scheme to track the evolution of known DGA families and find previously unseen DGA families. With a domain relation graph, we can clearly figure out how new variants relate to known DGA domain names, which induces better explainability. We deployed the proposed scheme on real network scenarios and show that the proposed scheme can not only comprehensively and precisely find known DGA families, but also can find new DGA families which have not seen before.
2020-06-12
Li, Wenyue, Yin, Jihao, Han, Bingnan, Zhu, Hongmei.  2019.  Generative Adversarial Network with Folded Spectrum for Hyperspectral Image Classification. IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium. :883—886.

Hyperspectral image (HSIs) with abundant spectral information but limited labeled dataset endows the rationality and necessity of semi-supervised spectral-based classification methods. Where, the utilizing approach of spectral information is significant to classification accuracy. In this paper, we propose a novel semi-supervised method based on generative adversarial network (GAN) with folded spectrum (FS-GAN). Specifically, the original spectral vector is folded to 2D square spectrum as input of GAN, which can generate spectral texture and provide larger receptive field over both adjacent and non-adjacent spectral bands for deep feature extraction. The generated fake folded spectrum, the labeled and unlabeled real folded spectrum are then fed to the discriminator for semi-supervised learning. A feature matching strategy is applied to prevent model collapse. Extensive experimental comparisons demonstrate the effectiveness of the proposed method.

Liu, Junfu, Chen, Keming, Xu, Guangluan, Li, Hao, Yan, Menglong, Diao, Wenhui, Sun, Xian.  2019.  Semi-Supervised Change Detection Based on Graphs with Generative Adversarial Networks. IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium. :74—77.

In this paper, we present a semi-supervised remote sensing change detection method based on graph model with Generative Adversarial Networks (GANs). Firstly, the multi-temporal remote sensing change detection problem is converted as a problem of semi-supervised learning on graph where a majority of unlabeled nodes and a few labeled nodes are contained. Then, GANs are adopted to generate samples in a competitive manner and help improve the classification accuracy. Finally, a binary change map is produced by classifying the unlabeled nodes to a certain class with the help of both the labeled nodes and the unlabeled nodes on graph. Experimental results carried on several very high resolution remote sensing image data sets demonstrate the effectiveness of our method.

Min, Congwen, Li, Yi, Fang, Li, Chen, Ping.  2019.  Conditional Generative Adversarial Network on Semi-supervised Learning Task. 2019 IEEE 5th International Conference on Computer and Communications (ICCC). :1448—1452.

Semi-supervised learning has recently gained increasingly attention because it can combine abundant unlabeled data with carefully labeled data to train deep neural networks. However, common semi-supervised methods deeply rely on the quality of pseudo labels. In this paper, we proposed a new semi-supervised learning method based on Generative Adversarial Network (GAN), by using discriminator to learn the feature of both labeled and unlabeled data, instead of generating pseudo labels that cannot all be correct. Our approach, semi-supervised conditional GAN (SCGAN), builds upon the conditional GAN model, extending it to semi-supervised learning by changing the discriminator's output to a classification output and a real or false output. We evaluate our approach with basic semi-supervised model on MNIST dataset. It shows that our approach achieves the classification accuracy with 84.15%, outperforming the basic semi-supervised model with 72.94%, when labeled data are 1/600 of all data.

2020-02-10
Zhan, Ying, Qin, Jin, Huang, Tao, Wu, Kang, Hu, Dan, Zhao, Zhengang, Wang, Yuntao, Cao, Ying, Jiao, RunCheng, Medjadba, Yasmine et al..  2019.  Hyperspectral Image Classification Based on Generative Adversarial Networks with Feature Fusing and Dynamic Neighborhood Voting Mechanism. IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium. :811–814.

Classifying Hyperspectral images with few training samples is a challenging problem. The generative adversarial networks (GAN) are promising techniques to address the problems. GAN constructs an adversarial game between a discriminator and a generator. The generator generates samples that are not distinguishable by the discriminator, and the discriminator determines whether or not a sample is composed of real data. In this paper, by introducing multilayer features fusion in GAN and a dynamic neighborhood voting mechanism, a novel algorithm for HSIs classification based on 1-D GAN was proposed. Extracting and fusing multiple layers features in discriminator, and using a little labeled samples, we fine-tuned a new sample 1-D CNN spectral classifier for HSIs. In order to improve the accuracy of the classification, we proposed a dynamic neighborhood voting mechanism to classify the HSIs with spatial features. The obtained results show that the proposed models provide competitive results compared to the state-of-the-art methods.

2017-11-27
Meng, Q., Shameng, Wen, Chao, Feng, Chaojing, Tang.  2016.  Predicting buffer overflow using semi-supervised learning. 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). :1959–1963.

As everyone knows vulnerability detection is a very difficult and time consuming work, so taking advantage of the unlabeled data sufficiently is needed and helpful. According the above reality, in this paper a method is proposed to predict buffer overflow based on semi-supervised learning. We first employ Antlr to extract AST from C/C++ source files, then according to the 22 buffer overflow attributes taxonomies, a 22-dimension vector is extracted from every function in AST, at last, the vector is leveraged to train a classifier to predict buffer overflow vulnerabilities. The experiment and evaluation indicate our method is correct and efficient.

2015-05-05
SHAR, L., Briand, L., Tan, H..  2014.  Web Application Vulnerability Prediction using Hybrid Program Analysis and Machine Learning. Dependable and Secure Computing, IEEE Transactions on. PP:1-1.

Due to limited time and resources, web software engineers need support in identifying vulnerable code. A practical approach to predicting vulnerable code would enable them to prioritize security auditing efforts. In this paper, we propose using a set of hybrid (static+dynamic) code attributes that characterize input validation and input sanitization code patterns and are expected to be significant indicators of web application vulnerabilities. Because static and dynamic program analyses complement each other, both techniques are used to extract the proposed attributes in an accurate and scalable way. Current vulnerability prediction techniques rely on the availability of data labeled with vulnerability information for training. For many real world applications, past vulnerability data is often not available or at least not complete. Hence, to address both situations where labeled past data is fully available or not, we apply both supervised and semi-supervised learning when building vulnerability predictors based on hybrid code attributes. Given that semi-supervised learning is entirely unexplored in this domain, we describe how to use this learning scheme effectively for vulnerability prediction. We performed empirical case studies on seven open source projects where we built and evaluated supervised and semi-supervised models. When cross validated with fully available labeled data, the supervised models achieve an average of 77 percent recall and 5 percent probability of false alarm for predicting SQL injection, cross site scripting, remote code execution and file inclusion vulnerabilities. With a low amount of labeled data, when compared to the supervised model, the semi-supervised model showed an average improvement of 24 percent higher recall and 3 percent lower probability of false alarm, thus suggesting semi-supervised learning may be a preferable solution for many real world applications where vulnerability data is missing.