Visible to the public Biblio

Filters: Keyword is BERT  [Clear All Filters]
2023-09-18
Ding, Zhenquan, Xu, Hui, Guo, Yonghe, Yan, Longchuan, Cui, Lei, Hao, Zhiyu.  2022.  Mal-Bert-GCN: Malware Detection by Combining Bert and GCN. 2022 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). :175—183.
With the dramatic increase in malicious software, the sophistication and innovation of malware have increased over the years. In particular, the dynamic analysis based on the deep neural network has shown high accuracy in malware detection. However, most of the existing methods only employ the raw API sequence feature, which cannot accurately reflect the actual behavior of malicious programs in detail. The relationship between API calls is critical for detecting suspicious behavior. Therefore, this paper proposes a malware detection method based on the graph neural network. We first connect the API sequences executed by different processes to build a directed process graph. Then, we apply Bert to encode the API sequences of each process into node embedding, which facilitates the semantic execution information inside the processes. Finally, we employ GCN to mine the deep semantic information based on the directed process graph and node embedding. In addition to presenting the design, we have implemented and evaluated our method on 10,000 malware and 10,000 benign software datasets. The results show that the precision and recall of our detection model reach 97.84% and 97.83%, verifying the effectiveness of our proposed method.
2023-09-01
Sumoto, Kensuke, Kanakogi, Kenta, Washizaki, Hironori, Tsuda, Naohiko, Yoshioka, Nobukazu, Fukazawa, Yoshiaki, Kanuka, Hideyuki.  2022.  Automatic labeling of the elements of a vulnerability report CVE with NLP. 2022 IEEE 23rd International Conference on Information Reuse and Integration for Data Science (IRI). :164—165.
Common Vulnerabilities and Exposures (CVE) databases contain information about vulnerabilities of software products and source code. If individual elements of CVE descriptions can be extracted and structured, then the data can be used to search and analyze CVE descriptions. Herein we propose a method to label each element in CVE descriptions by applying Named Entity Recognition (NER). For NER, we used BERT, a transformer-based natural language processing model. Using NER with machine learning can label information from CVE descriptions even if there are some distortions in the data. An experiment involving manually prepared label information for 1000 CVE descriptions shows that the labeling accuracy of the proposed method is about 0.81 for precision and about 0.89 for recall. In addition, we devise a way to train the data by dividing it into labels. Our proposed method can be used to label each element automatically from CVE descriptions.
2023-06-09
Sun, Zeyu, Zhang, Chi.  2022.  Research on Relation Extraction of Fusion Entity Enhancement and Shortest Dependency Path based on BERT. 2022 IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC). 10:766—770.
Deep learning models rely on single word features and location features of text to achieve good results in text relation extraction tasks. However, previous studies have failed to make full use of semantic information contained in sentence dependency syntax trees, and data sparseness and noise propagation still affect classification models. The BERT(Bidirectional Encoder Representations from Transformers) pretrained language model provides a better representation of natural language processing tasks. And entity enhancement methods have been proved to be effective in relation extraction tasks. Therefore, this paper proposes a combination of the shortest dependency path and entity-enhanced BERT pre-training language model for model construction to reduce the impact of noise terms on the classification model and obtain more semantically expressive feature representation. The algorithm is tested on SemEval-2010 Task 8 English relation extraction dataset, and the F1 value of the final experiment can reach 0. 881.
2023-02-03
Ni, Xuming, Zheng, Jianxin, Guo, Yu, Jin, Xu, Li, Ling.  2022.  Predicting severity of software vulnerability based on BERT-CNN. 2022 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI). :711–715.
Software vulnerabilities threaten the security of computer system, and recently more and more loopholes have been discovered and disclosed. For the detected vulnerabilities, the relevant personnel will analyze the vulnerability characteristics, and combine the vulnerability scoring system to determine their severity level, so as to determine which vulnerabilities need to be dealt with first. In recent years, some characteristic description-based methods have been used to predict the severity level of vulnerability. However, the traditional text processing methods only grasp the superficial meaning of the text and ignore the important contextual information in the text. Therefore, this paper proposes an innovative method, called BERT-CNN, which combines the specific task layer of Bert with CNN to capture important contextual information in the text. First, we use Bert to process the vulnerability description and other information, including Access Gained, Attack Origin and Authentication Required, to generate the feature vectors. Then these feature vectors of vulnerabilities and their severity levels are input into a CNN network, and the parameters of the CNN are gotten. Next, the fine-tuned Bert and the trained CNN are used to predict the severity level of a vulnerability. The results show that our method outperforms the state-of-the-art method with 91.31% on F1-score.
2023-02-02
Pujar, Saurabh, Zheng, Yunhui, Buratti, Luca, Lewis, Burn, Morari, Alessandro, Laredo, Jim, Postlethwait, Kevin, Görn, Christoph.  2022.  Varangian: A Git Bot for Augmented Static Analysis. 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). :766–767.

The complexity and scale of modern software programs often lead to overlooked programming errors and security vulnerabilities. Developers often rely on automatic tools, like static analysis tools, to look for bugs and vulnerabilities. Static analysis tools are widely used because they can understand nontrivial program behaviors, scale to millions of lines of code, and detect subtle bugs. However, they are known to generate an excess of false alarms which hinder their utilization as it is counterproductive for developers to go through a long list of reported issues, only to find a few true positives. One of the ways proposed to suppress false positives is to use machine learning to identify them. However, training machine learning models requires good quality labeled datasets. For this purpose, we developed D2A [3], a differential analysis based approach that uses the commit history of a code repository to create a labeled dataset of Infer [2] static analysis output.

2022-10-06
He, Bingjun, Chen, Jianfeng.  2021.  Named Entity Recognition Method in Network Security Domain Based on BERT-BiLSTM-CRF. 2021 IEEE 21st International Conference on Communication Technology (ICCT). :508–512.
With the increase of the number of network threats, the knowledge graph is an effective method to quickly analyze the network threats from the mass of network security texts. Named entity recognition in network security domain is an important task to construct knowledge graph. Aiming at the problem that key Chinese entity information in network security related text is difficult to identify, a named entity recognition model in network security domain based on BERT-BiLSTM-CRF is proposed to identify key named entities in network security related text. This model adopts the BERT pre-training model to obtain the word vectors of the preceding and subsequent text information, and the obtained word vectors will be input to the subsequent BiLSTM module and CRF module for encoding and sorting. The test results show that this model has a good effect on the data set of network security domain. The recognition effect of this model is better than that of LSTM-CRF, BERT-LSTM-CRF, BERT-CRF and other models, and the F1=93.81%.
2022-07-15
Wang, Yan, Allouache, Yacine, Joubert, Christian.  2021.  A Staffing Recommender System based on Domain-Specific Knowledge Graph. 2021 Eighth International Conference on Social Network Analysis, Management and Security (SNAMS). :1—6.
In the economics environment, Job Matching is always a challenge involving the evolution of knowledge and skills. A good matching of skills and jobs can stimulate the growth of economics. Recommender System (RecSys), as one kind of Job Matching, can help the candidates predict the future job relevant to their preferences. However, RecSys still has the problem of cold start and data sparsity. The content-based filtering in RecSys needs the adaptive data for the specific staffing tasks of Bidirectional Encoder Representations from Transformers (BERT). In this paper, we propose a job RecSys based on skills and locations using a domain-specific Knowledge Graph (KG). This system has three parts: a pipeline of Named Entity Recognition (NER) and Relation Extraction (RE) using BERT; a standardization system for pre-processing, semantic enrichment and semantic similarity measurement; a domain-specific Knowledge Graph (KG). Two different relations in the KG are computed by cosine similarity and Term Frequency-Inverse Document Frequency (TF-IDF) respectively. The raw data used in the staffing RecSys include 3000 descriptions of job offers from Indeed, 126 Curriculum Vitae (CV) in English from Kaggle and 106 CV in French from Linx of Capgemini Engineering. The staffing RecSys is integrated under an architecture of Microservices. The autonomy and effectiveness of the staffing RecSys are verified through the experiment using Discounted Cumulative Gain (DCG). Finally, we propose several potential research directions for this research.
2022-05-19
Qing-chao, Ni, Cong-jue, Yin, Dong-hua, Zhao.  2021.  Research on Small Sample Text Classification Based on Attribute Extraction and Data Augmentation. 2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). :53–57.
With the development of deep learning and the progress of natural language processing technology, as well as the continuous disclosure of judicial data such as judicial documents, legal intelligence has gradually become a research hot spot. The crime classification task is an important branch of text classification, which can help people related to the law to improve their work efficiency. However, in the actual research, the sample data is small and the distribution of crime categories is not balanced. To solve these two problems, BERT was used as the encoder to solve the problem of small data volume, and attribute extraction network was added to solve the problem of unbalanced distribution. Finally, the accuracy of 90.35% on small sample data set could be achieved, and F1 value was 67.62, which was close to the best model performance under sufficient data. Finally, a text enhancement method based on back-translation technology is proposed. Different models are used to conduct experiments. Finally, it is found that LSTM model is improved to some extent, but BERT is not improved to some extent.
Zhang, Cheng, Yamana, Hayato.  2021.  Improving Text Classification Using Knowledge in Labels. 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA). :193–197.
Various algorithms and models have been proposed to address text classification tasks; however, they rarely consider incorporating the additional knowledge hidden in class labels. We argue that hidden information in class labels leads to better classification accuracy. In this study, instead of encoding the labels into numerical values, we incorporated the knowledge in the labels into the original model without changing the model architecture. We combined the output of an original classification model with the relatedness calculated based on the embeddings of a sequence and a keyword set. A keyword set is a word set to represent knowledge in the labels. Usually, it is generated from the classes while it could also be customized by the users. The experimental results show that our proposed method achieved statistically significant improvements in text classification tasks. The source code and experimental details of this study can be found on Github11https://github.com/HeroadZ/KiL.
2022-05-06
Bai, Zilong, Hu, Beibei.  2021.  A Universal Bert-Based Front-End Model for Mandarin Text-To-Speech Synthesis. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :6074–6078.
The front-end text processing module is considered as an essential part that influences the intelligibility and naturalness of a Mandarin text-to-speech system significantly. For commercial text-to-speech systems, the Mandarin front-end should meet the requirements of high accuracy and low time latency while also ensuring maintainability. In this paper, we propose a universal BERT-based model that can be used for various tasks in the Mandarin front-end without changing its architecture. The feature extractor and classifiers in the model are shared for several sub-tasks, which improves the expandability and maintainability. We trained and evaluated the model with polyphone disambiguation, text normalization, and prosodic boundary prediction for single task modules and multi-task learning. Results show that, the model maintains high performance for single task modules and shows higher accuracy and lower time latency for multi-task modules, indicating that the proposed universal front-end model is promising as a maintainable Mandarin front-end for commercial applications.
2022-04-13
Issifu, Abdul Majeed, Ganiz, Murat Can.  2021.  A Simple Data Augmentation Method to Improve the Performance of Named Entity Recognition Models in Medical Domain. 2021 6th International Conference on Computer Science and Engineering (UBMK). :763–768.
Easy Data Augmentation is originally developed for text classification tasks. It consists of four basic methods: Synonym Replacement, Random Insertion, Random Deletion, and Random Swap. They yield accuracy improvements on several deep neural network models. In this study we apply these methods to a new domain. We augment Named Entity Recognition datasets from medical domain. Although the augmentation task is much more difficult due to the nature of named entities which consist of word or word groups in the sentences, we show that we can improve the named entity recognition performance.
2022-04-12
Evangelatos, Pavlos, Iliou, Christos, Mavropoulos, Thanassis, Apostolou, Konstantinos, Tsikrika, Theodora, Vrochidis, Stefanos, Kompatsiaris, Ioannis.  2021.  Named Entity Recognition in Cyber Threat Intelligence Using Transformer-based Models. 2021 IEEE International Conference on Cyber Security and Resilience (CSR). :348—353.
The continuous increase in sophistication of threat actors over the years has made the use of actionable threat intelligence a critical part of the defence against them. Such Cyber Threat Intelligence is published daily on several online sources, including vulnerability databases, CERT feeds, and social media, as well as on forums and web pages from the Surface and the Dark Web. Named Entity Recognition (NER) techniques can be used to extract the aforementioned information in an actionable form from such sources. In this paper we investigate how the latest advances in the NER domain, and in particular transformer-based models, can facilitate this process. To this end, the dataset for NER in Threat Intelligence (DNRTI) containing more than 300 pieces of threat intelligence reports from open source threat intelligence websites is used. Our experimental results demonstrate that transformer-based techniques are very effective in extracting cybersecurity-related named entities, by considerably outperforming the previous state- of-the-art approaches tested with DNRTI.
2022-03-10
Ozan, Şükrü, Taşar, D. Emre.  2021.  Auto-tagging of Short Conversational Sentences using Natural Language Processing Methods. 2021 29th Signal Processing and Communications Applications Conference (SIU). :1—4.
In this study, we aim to find a method to autotag sentences specific to a domain. Our training data comprises short conversational sentences extracted from chat conversations between company's customer representatives and web site visitors. We manually tagged approximately 14 thousand visitor inputs into ten basic categories, which will later be used in a transformer-based language model with attention mechanisms for the ultimate goal of developing a chatbot application that can produce meaningful dialogue.We considered three different stateof- the-art models and reported their auto-tagging capabilities. We achieved the best performance with the bidirectional encoder representation from transformers (BERT) model. Implementation of the models used in these experiments can be cloned from our GitHub repository and tested for similar auto-tagging problems without much effort.
2021-06-01
Ming, Kun.  2020.  Chinese Coreference Resolution via Bidirectional LSTMs using Word and Token Level Representations. 2020 16th International Conference on Computational Intelligence and Security (CIS). :73–76.
Coreference resolution is an important task in the field of natural language processing. Most existing methods usually utilize word-level representations, ignoring massive information from the texts. To address this issue, we investigate how to improve Chinese coreference resolution by using span-level semantic representations. Specifically, we propose a model which acquires word and character representations through pre-trained Skip-Gram embeddings and pre-trained BERT, then explicitly leverages span-level information by performing bidirectional LSTMs among above representations. Experiments on CoNLL-2012 shared task have demonstrated that the proposed model achieves 62.95% F1-score, outperforming our baseline methods.