Biblio

List
Filter

Found 165 results

Filters: Keyword is natural language processing [Clear All Filters]

2020-05-18

Peng, Tianrui, Harris, Ian, Sawa, Yuki. 2018. Detecting Phishing Attacks Using Natural Language Processing and Machine Learning. 2018 IEEE 12th International Conference on Semantic Computing (ICSC). :300–301.

Phishing attacks are one of the most common and least defended security threats today. We present an approach which uses natural language processing techniques to analyze text and detect inappropriate statements which are indicative of phishing attacks. Our approach is novel compared to previous work because it focuses on the natural language text contained in the attack, performing semantic analysis of the text to detect malicious intent. To demonstrate the effectiveness of our approach, we have evaluated it using a large benchmark set of phishing emails.

Zong, Zhaorong, Hong, Changchun. 2018. On Application of Natural Language Processing in Machine Translation. 2018 3rd International Conference on Mechanical, Control and Computer Engineering (ICMCCE). :506–510.

Natural language processing is the core of machine translation. In the history, its development process is almost the same as machine translation, and the two complement each other. This article compares the natural language processing of statistical corpora with neural machine translation and concludes the natural language processing: Neural machine translation has the advantage of deep learning, which is very suitable for dealing with the high dimension, label-free and big data of natural language, therefore, its application is more general and reflects the power of big data and big data thinking.

2020-03-23

Xu, Yilin, Ge, Weimin, Li, Xiaohong, Feng, Zhiyong, Xie, Xiaofei, Bai, Yude. 2019. A Co-Occurrence Recommendation Model of Software Security Requirement. 2019 International Symposium on Theoretical Aspects of Software Engineering (TASE). :41–48.

To guarantee the quality of software, specifying security requirements (SRs) is essential for developing systems, especially for security-critical software systems. However, using security threat to determine detailed SR is quite difficult according to Common Criteria (CC), which is too confusing and technical for non-security specialists. In this paper, we propose a Co-occurrence Recommend Model (CoRM) to automatically recommend software SRs. In this model, the security threats of product are extracted from security target documents of software, in which the related security requirements are tagged. In order to establish relationships between software security threat and security requirement, semantic similarities between different security threat is calculated by Skip-thoughts Model. To evaluate our CoRM model, over 1000 security target documents of 9 types software products are exploited. The results suggest that building a CoRM model via semantic similarity is feasible and reliable.

2020-01-28

KADOGUCHI, Masashi, HAYASHI, Shota, HASHIMOTO, Masaki, OTSUKA, Akira. 2019. Exploring the Dark Web for Cyber Threat Intelligence Using Machine Leaning. 2019 IEEE International Conference on Intelligence and Security Informatics (ISI). :200–202.

In recent years, cyber attack techniques are increasingly sophisticated, and blocking the attack is more and more difficult, even if a kind of counter measure or another is taken. In order for a successful handling of this situation, it is crucial to have a prediction of cyber attacks, appropriate precautions, and effective utilization of cyber intelligence that enables these actions. Malicious hackers share various kinds of information through particular communities such as the dark web, indicating that a great deal of intelligence exists in cyberspace. This paper focuses on forums on the dark web and proposes an approach to extract forums which include important information or intelligence from huge amounts of forums and identify traits of each forum using methodologies such as machine learning, natural language processing and so on. This approach will allow us to grasp the emerging threats in cyberspace and take appropriate measures against malicious activities.

2019-12-16

Karve, Shreya, Nagmal, Arati, Papalkar, Sahil, Deshpande, S. A.. 2018. Context Sensitive Conversational Agent Using DNN. 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA). :475–478.

We investigate a method of building a closed domain intelligent conversational agent using deep neural networks. A conversational agent is a dialog system intended to converse with a human, with a coherent structure. Our conversational agent uses a retrieval based model that identifies the intent of the input user query and maps it to a knowledge base to return appropriate results. Human conversations are based on context, but existing conversational agents are context insensitive. To overcome this limitation, our system uses a simple stack based context identification and storage system. The conversational agent generates responses according to the current context of conversation. allowing more human-like conversations.

Alam, Mehreen. 2018. Neural Encoder-Decoder based Urdu Conversational Agent. 2018 9th IEEE Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON). :901–905.

Conversational agents have very much become part of our lives since the renaissance of neural network based "neural conversational agents". Previously used manually annotated and rule based methods lacked the scalability and generalization capabilities of the neural conversational agents. A neural conversational agent has two parts: at one end an encoder understands the question while the other end a decoder prepares and outputs the corresponding answer to the question asked. Both the parts are typically designed using recurrent neural network and its variants and trained in an end-to-end fashion. Although conversation agents for other languages have been developed, Urdu language has seen very less progress in building of conversational agents. Especially recent state of the art neural network based techniques have not been explored yet. In this paper, we design an attention driven deep encoder-decoder based neural conversational agent for Urdu language. Overall, we make following contributions we (i) create a dataset of 5000 question-answer pairs, and (ii) present a new deep encoder-decoder based conversational agent for Urdu language. For our work, we limit the knowledge base of our agent to general knowledge regarding Pakistan. Our best model has the BLEU score of 58 and gives syntactically and semantically correct answers in majority of the cases.

2019-11-25

Zuin, Gianlucca, Chaimowicz, Luiz, Veloso, Adriano. 2018. Learning Transferable Features For Open-Domain Question Answering. 2018 International Joint Conference on Neural Networks (IJCNN). :1–8.

Corpora used to learn open-domain Question-Answering (QA) models are typically collected from a wide variety of topics or domains. Since QA requires understanding natural language, open-domain QA models generally need very large training corpora. A simple way to alleviate data demand is to restrict the domain covered by the QA model, leading thus to domain-specific QA models. While learning improved QA models for a specific domain is still challenging due to the lack of sufficient training data in the topic of interest, additional training data can be obtained from related topic domains. Thus, instead of learning a single open-domain QA model, we investigate domain adaptation approaches in order to create multiple improved domain-specific QA models. We demonstrate that this can be achieved by stratifying the source dataset, without the need of searching for complementary data unlike many other domain adaptation approaches. We propose a deep architecture that jointly exploits convolutional and recurrent networks for learning domain-specific features while transferring domain-shared features. That is, we use transferable features to enable model adaptation from multiple source domains. We consider different transference approaches designed to learn span-level and sentence-level QA models. We found that domain-adaptation greatly improves sentence-level QA performance, and span-level QA benefits from sentence information. Finally, we also show that a simple clustering algorithm may be employed when the topic domains are unknown and the resulting loss in accuracy is negligible.

2019-11-12

Zhang, Xian, Ben, Kerong, Zeng, Jie. 2018. Cross-Entropy: A New Metric for Software Defect Prediction. 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS). :111-122.

Defect prediction is an active topic in software quality assurance, which can help developers find potential bugs and make better use of resources. To improve prediction performance, this paper introduces cross-entropy, one common measure for natural language, as a new code metric into defect prediction tasks and proposes a framework called DefectLearner for this process. We first build a recurrent neural network language model to learn regularities in source code from software repository. Based on the trained model, the cross-entropy of each component can be calculated. To evaluate the discrimination for defect-proneness, cross-entropy is compared with 20 widely used metrics on 12 open-source projects. The experimental results show that cross-entropy metric is more discriminative than 50% of the traditional metrics. Besides, we combine cross-entropy with traditional metric suites together for accurate defect prediction. With cross-entropy added, the performance of prediction models is improved by an average of 2.8% in F1-score.

Ferenc, Rudolf, Heged\H us, Péter, Gyimesi, Péter, Antal, Gábor, Bán, Dénes, Gyimóthy, Tibor. 2019. Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions. 2019 IEEE/ACM 7th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). :8-14.

The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-the-art machine learning techniques, including a popular deep learning algorithm, perform in predicting functions with possible security vulnerabilities in JavaScript programs. We applied 8 machine learning algorithms to build prediction models using a new dataset constructed for this research from the vulnerability information in public databases of the Node Security Project and the Snyk platform, and code fixing patches from GitHub. We used static source code metrics as predictors and an extensive grid-search algorithm to find the best performing models. We also examined the effect of various re-sampling strategies to handle the imbalanced nature of the dataset. The best performing algorithm was KNN, which created a model for the prediction of vulnerable functions with an F-measure of 0.76 (0.91 precision and 0.66 recall). Moreover, deep learning, tree and forest based classifiers, and SVM were competitive with F-measures over 0.70. Although the F-measures did not vary significantly with the re-sampling strategies, the distribution of precision and recall did change. No re-sampling seemed to produce models preferring high precision, while re-sampling strategies balanced the IR measures.

2019-09-26

Jackson, K. A., Bennett, B. T.. 2018. Locating SQL Injection Vulnerabilities in Java Byte Code Using Natural Language Techniques. SoutheastCon 2018. :1-5.

With so much our daily lives relying on digital devices like personal computers and cell phones, there is a growing demand for code that not only functions properly, but is secure and keeps user data safe. However, ensuring this is not such an easy task, and many developers do not have the required skills or resources to ensure their code is secure. Many code analysis tools have been written to find vulnerabilities in newly developed code, but this technology tends to produce many false positives, and is still not able to identify all of the problems. Other methods of finding software vulnerabilities automatically are required. This proof-of-concept study applied natural language processing on Java byte code to locate SQL injection vulnerabilities in a Java program. Preliminary findings show that, due to the high number of terms in the dataset, using singular decision trees will not produce a suitable model for locating SQL injection vulnerabilities, while random forest structures proved more promising. Still, further work is needed to determine the best classification tool.

2019-08-05

Sorokine, Alex, Thakur, Gautam, Palumbo, Rachel. 2018. Machine Learning to Improve Retrieval by Category in Big Volunteered Geodata. Proceedings of the 12th Workshop on Geographic Information Retrieval. :4:1–4:2.

Nowadays, Volunteered Geographic Information (VGI) is commonly used in research and practical applications. However, the quality assurance of such a geographic data remains a problem. In this study we use machine learning and natural language processing to improve record retrieval by category (e.g. restaurant, museum, etc.) from Wikimapia Points of Interest data. We use textual information contained in VGI records to evaluate its ability to determine the category label. The performance of the trained classifier is evaluated on the complete dataset and then is compared with its performance on regional subsets. Preliminary analysis shows significant difference in the classifier performance across the regions. Such geographic differences will have a significant effect on data enrichment efforts such as labeling entities with missing categories.

2019-06-10

Tran, T. K., Sato, H., Kubo, M.. 2018. One-Shot Learning Approach for Unknown Malware Classification. 2018 5th Asian Conference on Defense Technology (ACDT). :8-13.

Early detection of new kinds of malware always plays an important role in defending the network systems. Especially, if intelligent protection systems could themselves detect an existence of new malware types in their system, even with a very small number of malware samples, it must be a huge benefit for the organization as well as the social since it help preventing the spreading of that kind of malware. To deal with learning from few samples, term ``one-shot learning'' or ``fewshot learning'' was introduced, and mostly used in computer vision to recognize images, handwriting, etc. An approach introduced in this paper takes advantage of One-shot learning algorithms in solving the malware classification problem by using Memory Augmented Neural Network in combination with malware's API calls sequence, which is a very valuable source of information for identifying malware behavior. In addition, it also use some advantages of the development in Natural Language Processing field such as word2vec, etc. to convert those API sequences to numeric vectors before feeding to the one-shot learning network. The results confirm very good accuracies compared to the other traditional methods.

2019-04-01

Stein, G., Peng, Q.. 2018. Low-Cost Breaking of a Unique Chinese Language CAPTCHA Using Curriculum Learning and Clustering. 2018 IEEE International Conference on Electro/Information Technology (EIT). :0595–0600.

Text-based CAPTCHAs are still commonly used to attempt to prevent automated access to web services. By displaying an image of distorted text, they attempt to create a challenge image that OCR software can not interpret correctly, but a human user can easily determine the correct response to. This work focuses on a CAPTCHA used by a popular Chinese language question-and-answer website and how resilient it is to modern machine learning methods. While the majority of text-based CAPTCHAs focus on transcription tasks, the CAPTCHA solved in this work is based on localization of inverted symbols in a distorted image. A convolutional neural network (CNN) was created to evaluate the likelihood of a region in the image belonging to an inverted character. It is used with a feature map and clustering to identify potential locations of inverted characters. Training of the CNN was performed using curriculum learning and compared to other potential training methods. The proposed method was able to determine the correct response in 95.2% of cases of a simulated CAPTCHA and 67.6% on a set of real CAPTCHAs. Potential methods to increase difficulty of the CAPTCHA and the success rate of the automated solver are considered.

2019-03-06

Wang, Jiawen, Wang, Wai Ming, Tian, Zonggui, Li, Zhi. 2018. Classification of Multiple Affective Attributes of Customer Reviews: Using Classical Machine Learning and Deep Learning. Proceedings of the 2Nd International Conference on Computer Science and Application Engineering. :94:1-94:5.

Affective1 engineering is a methodology of designing products by collecting customer affective needs and translating them into product designs. It usually begins with questionnaire surveys to collect customer affective demands and responses. However, this process is expensive, which can only be conducted periodically in a small scale. With the rapid development of e-commerce, a larger number of customer product reviews are available on the Internet. Many studies have been done using opinion mining and sentiment analysis. However, the existing studies focus on the polarity classification from a single perspective (such as positive and negative). The classification of multiple affective attributes receives less attention. In this paper, 3-class classifications of four different affective attributes (i.e. Soft-Hard, Appealing-Unappealing, Handy-Bulky, and Reliable-Shoddy) are performed by using two classical machine learning algorithms (i.e. Softmax regression and Support Vector Machine) and two deep learning methods (i.e. Restricted Boltzmann machines and Deep Belief Network) on an Amazon dataset. The results show that the accuracy of deep learning methods is above 90%, while the accuracy of classical machine learning methods is about 64%. This indicates that deep learning methods are significantly better than classical machine learning methods.

2019-03-04

Lin, F., Beadon, M., Dixit, H. D., Vunnam, G., Desai, A., Sankar, S.. 2018. Hardware Remediation at Scale. 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). :14–17.

Large scale services have automated hardware remediation to maintain the infrastructure availability at a healthy level. In this paper, we share the current remediation flow at Facebook, and how it is being monitored. We discuss a class of hardware issues that are transient and typically have higher rates during heavy load. We describe how our remediation system was enhanced to be efficient in detecting this class of issues. As hardware and systems change in response to the advancement in technology and scale, we have also utilized machine learning frameworks for hardware remediation to handle the introduction of new hardware failure modes. We present an ML methodology that uses a set of predictive thresholds to monitor remediation efficiency over time. We also deploy a recommendation system based on natural language processing, which is used to recommend repair actions for efficient diagnosis and repair. We also describe current areas of research that will enable us to improve hardware availability further.

Husari, G., Niu, X., Chu, B., Al-Shaer, E.. 2018. Using Entropy and Mutual Information to Extract Threat Actions from Cyber Threat Intelligence. 2018 IEEE International Conference on Intelligence and Security Informatics (ISI). :1–6.

With the rapid growth of the cyber attacks, cyber threat intelligence (CTI) sharing becomes essential for providing advance threat notice and enabling timely response to cyber attacks. Our goal in this paper is to develop an approach to extract low-level cyber threat actions from publicly available CTI sources in an automated manner to enable timely defense decision making. Specifically, we innovatively and successfully used the metrics of entropy and mutual information from Information Theory to analyze the text in the cybersecurity domain. Combined with some basic NLP techniques, our framework, called ActionMiner has achieved higher precision and recall than the state-of-the-art Stanford typed dependency parser, which usually works well in general English but not cybersecurity texts.

Buck, Joshua W., Perugini, Saverio, Nguyen, Tam V.. 2018. Natural Language, Mixed-initiative Personal Assistant Agents. Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication. :82:1–82:8.

The increasing popularity and use of personal voice assistant technologies, such as Siri and Google Now, is driving and expanding progress toward the long-term and lofty goal of using artificial intelligence to build human-computer dialog systems capable of understanding natural language. While dialog-based systems such as Siri support utterances communicated through natural language, they are limited in the flexibility they afford to the user in interacting with the system and, thus, support primarily action-requesting and information-seeking tasks. Mixed-initiative interaction, on the other hand, is a flexible interaction technique where the user and the system act as equal participants in an activity, and is often exhibited in human-human conversations. In this paper, we study user support for mixed-initiative interaction with dialog-based systems through natural language using a bag-of-words model and k-nearest-neighbor classifier. We study this problem in the context of a toolkit we developed for automated, mixed-initiative dialog system construction, involving a dialog authoring notation and management engine based on lambda calculus, for specifying and implementing task-based, mixed-initiative dialogs. We use ordering at Subway through natural language, human-computer dialogs as a case study. Our results demonstrate that the dialogs authored with our toolkit support the end user's completion of a natural language, human-computer dialog in a mixed-initiative fashion. The use of natural language in the resulting mixed-initiative dialogs afford the user the ability to experience multiple self-directed paths through the dialog and makes the flexibility in communicating user utterances commensurate with that in dialog completion paths—an aspect missing from commercial assistants like Siri.

2019-02-22

Neal, T., Sundararajan, K., Woodard, D.. 2018. Exploiting Linguistic Style as a Cognitive Biometric for Continuous Verification. 2018 International Conference on Biometrics (ICB). :270-276.

This paper presents an assessment of continuous verification using linguistic style as a cognitive biometric. In stylometry, it is widely known that linguistic style is highly characteristic of authorship using representations that capture authorial style at character, lexical, syntactic, and semantic levels. In this work, we provide a contrast to previous efforts by implementing a one-class classification problem using Isolation Forests. Our approach demonstrates the usefulness of this classifier for accurately verifying the genuine user, and yields recognition accuracy exceeding 98% using very small training samples of 50 and 100-character blocks.

Vysotska, V., Lytvyn, V., Hrendus, M., Kubinska, S., Brodyak, O.. 2018. Method of Textual Information Authorship Analysis Based on Stylometry. 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT). 2:9-16.

The paper dwells on the peculiarities of stylometry technologies usage to determine the style of the author publications. Statistical linguistic analysis of the author's text allows taking advantage of text content monitoring based on Porter stemmer and NLP methods to determine the set of stop words. The latter is used in the methods of stylometry to determine the ownership of the analyzed text to a specific author in percentage points. There is proposed a formal approach to the definition of the author's style of the Ukrainian text in the article. The experimental results of the proposed method for determining the ownership of the analyzed text to a particular author upon the availability of the reference text fragment are obtained. The study was conducted on the basis of the Ukrainian scientific texts of a technical area.

Petrík, Juraj, Chudá, Daniela. 2018. Source Code Authorship Approaches Natural Language Processing. Proceedings of the 19th International Conference on Computer Systems and Technologies. :58-61.

This paper proposed method for source code authorship attribution using modern natural language processing methods. Our method based on text embedding with convolutional recurrent neural network reaches 94.5% accuracy within 500 authors in one dataset, which outperformed many state of the art models for authorship attribution. Our approach is dealing with source code as with natural language texts, so it is potentially programming language independent with more potential of future improving.

2019-02-08

Isaacson, D. M.. 2018. The ODNI-OUSD(I) Xpress Challenge: An Experimental Application of Artificial Intelligence Techniques to National Security Decision Support. 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC). :104-109.

Current methods for producing and disseminating analytic products contribute to the latency of relaying actionable information and analysis to the U.S. Intelligence Community's (IC's) principal customers, U.S. policymakers and warfighters. To circumvent these methods, which can often serve as a bottleneck, we report on the results of a public prize challenge that explored the potential for artificial intelligence techniques to generate useful analytic products. The challenge tasked solvers to develop algorithms capable of searching and processing nearly 15,000 unstructured text files into a 1-2 page analytic product without human intervention; these analytic products were subsequently evaluated and scored using established IC methodologies and criteria. Experimental results from this challenge demonstrate the promise for the ma-chine generation of analytic products to ensure that the IC warns and informs in a more timely fashion.

2019-01-31

Rodríguez, Juan M., Merlino, Hernán D., Pesado, Patricia, García-Martínez, Ramón. 2018. Evaluation of Open Information Extraction Methods Using Reuters-21578 Database. Proceedings of the 2Nd International Conference on Machine Learning and Soft Computing. :87–92.

The following article shows the precision, the recall and the F1-measure for three knowledge extraction methods under Open Information Extraction paradigm. These methods are: ReVerb, OLLIE and ClausIE. For the calculation of these three measures, a representative sample of Reuters-21578 was used; 103 newswire texts were taken randomly from that database. A big discrepancy was observed, after analyzing the obtained results, between the expected and the observed precision for ClausIE. In order to save the observed gap in ClausIE precision, a simple improvement is proposed for the method. Although the correction improved the precision of Clausie, ReVerb turned out to be the most precise method; however ClausIE is the one with the better F1-measure.

2019-01-21

Ayoade, G., Chandra, S., Khan, L., Hamlen, K., Thuraisingham, B.. 2018. Automated Threat Report Classification over Multi-Source Data. 2018 IEEE 4th International Conference on Collaboration and Internet Computing (CIC). :236–245.

With an increase in targeted attacks such as advanced persistent threats (APTs), enterprise system defenders require comprehensive frameworks that allow them to collaborate and evaluate their defense systems against such attacks. MITRE has developed a framework which includes a database of different kill-chains, tactics, techniques, and procedures that attackers employ to perform these attacks. In this work, we leverage natural language processing techniques to extract attacker actions from threat report documents generated by different organizations and automatically classify them into standardized tactics and techniques, while providing relevant mitigation advisories for each attack. A naïve method to achieve this is by training a machine learning model to predict labels that associate the reports with relevant categories. In practice, however, sufficient labeled data for model training is not always readily available, so that training and test data come from different sources, resulting in bias. A naïve model would typically underperform in such a situation. We address this major challenge by incorporating an importance weighting scheme called bias correction that efficiently utilizes available labeled data, given threat reports, whose categories are to be automatically predicted. We empirically evaluated our approach on 18,257 real-world threat reports generated between year 2000 and 2018 from various computer security organizations to demonstrate its superiority by comparing its performance with an existing approach.

2019-01-16

Hendler, Danny, Kels, Shay, Rubin, Amir. 2018. Detecting Malicious PowerShell Commands Using Deep Neural Networks. Proceedings of the 2018 on Asia Conference on Computer and Communications Security. :187–197.

Microsoft's PowerShell is a command-line shell and scripting language that is installed by default on Windows machines. Based on Microsoft's .NET framework, it includes an interface that allows programmers to access operating system services. While PowerShell can be configured by administrators for restricting access and reducing vulnerabilities, these restrictions can be bypassed. Moreover, PowerShell commands can be easily generated dynamically, executed from memory, encoded and obfuscated, thus making the logging and forensic analysis of code executed by PowerShell challenging. For all these reasons, PowerShell is increasingly used by cybercriminals as part of their attacks' tool chain, mainly for downloading malicious contents and for lateral movement. Indeed, a recent comprehensive technical report by Symantec dedicated to PowerShell's abuse by cybercrimials [52] reported on a sharp increase in the number of malicious PowerShell samples they received and in the number of penetration tools and frameworks that use PowerShell. This highlights the urgent need of developing effective methods for detecting malicious PowerShell commands. In this work, we address this challenge by implementing several novel detectors of malicious PowerShell commands and evaluating their performance. We implemented both "traditional" natural language processing (NLP) based detectors and detectors based on character-level convolutional neural networks (CNNs). Detectors' performance was evaluated using a large real-world dataset. Our evaluation results show that, although our detectors (and especially the traditional NLP-based ones) individually yield high performance, an ensemble detector that combines an NLP-based classifier with a CNN-based classifier provides the best performance, since the latter classifier is able to detect malicious commands that succeed in evading the former. Our analysis of these evasive commands reveals that some obfuscation patterns automatically detected by the CNN classifier are intrinsically difficult to detect using the NLP techniques we applied. Our detectors provide high recall values while maintaining a very low false positive rate, making us cautiously optimistic that they can be of practical value.

2018-11-14

Adams, S., Carter, B., Fleming, C., Beling, P. A.. 2018. Selecting System Specific Cybersecurity Attack Patterns Using Topic Modeling. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :490–497.

One challenge for cybersecurity experts is deciding which type of attack would be successful against the system they wish to protect. Often, this challenge is addressed in an ad hoc fashion and is highly dependent upon the skill and knowledge base of the expert. In this study, we present a method for automatically ranking attack patterns in the Common Attack Pattern Enumeration and Classification (CAPEC) database for a given system. This ranking method is intended to produce suggested attacks to be evaluated by a cybersecurity expert and not a definitive ranking of the "best" attacks. The proposed method uses topic modeling to extract hidden topics from the textual description of each attack pattern and learn the parameters of a topic model. The posterior distribution of topics for the system is estimated using the model and any provided text. Attack patterns are ranked by measuring the distance between each attack topic distribution and the topic distribution of the system using KL divergence.