Biblio

Filters: Keyword is text mining  [Clear All Filters]
2023-02-03
Ouamour, S., Sayoud, H..  2022.  Computational Identification of Author Style on Electronic Libraries - Case of Lexical Features. 2022 5th International Symposium on Informatics and its Applications (ISIA). :1–4.
In the present work, we intend to present a thorough study developed on a digital library, called HAT corpus, for a purpose of authorship attribution. Thus, a dataset of 300 documents that are written by 100 different authors, was extracted from the web digital library and processed for a task of author style analysis. All the documents are related to the travel topic and written in Arabic. Basically, three important rules in stylometry should be respected: the minimum document size, the same topic for all documents and the same genre too. In this work, we made a particular effort to respect those conditions seriously during the corpus preparation. That is, three lexical features: Fixed-length words, Rare words and Suffixes are used and evaluated by using a centroid based Manhattan distance. The used identification approach shows interesting results with an accuracy of about 0.94.
2022-11-02
Li, Lishuang, Lian, Ruiyuan, Lu, Hongbin.  2021.  Document-Level Biomedical Relation Extraction with Generative Adversarial Network and Dual-Attention Multi-Instance Learning. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). :438–443.
Document-level relation extraction (RE) aims to extract relations among entities within a document, which is more complex than its sentence-level counterpart, especially in biomedical text mining. Chemical-disease relation (CDR) extraction aims to extract complex semantic relationships between chemicals and diseases entities in documents. In order to identify the relations within and across multiple sentences at the same time, existing methods try to build different document-level heterogeneous graph. However, the entity relation representations captured by these models do not make full use of the document information and disregard the noise introduced in the process of integrating various information. In this paper, we propose a novel model DAM-GAN to document-level biomedical RE, which can extract entity-level and mention-level representations of relation instances with R-GCN and Dual-Attention Multi-Instance Learning (DAM) respectively, and eliminate the noise with Generative Adversarial Network (GAN). Entity-level representations of relation instances model the semantic information of all entity pairs from the perspective of the whole document, while the mention-level representations from the perspective of mention pairs related to these entity pairs in different sentences. Therefore, entity- and mention-level representations can be better integrated to represent relation instances. Experimental results demonstrate that our model achieves superior performance on public document-level biomedical RE dataset BioCreative V Chemical Disease Relation(CDR).
2022-05-19
Zhang, Cheng, Yamana, Hayato.  2021.  Improving Text Classification Using Knowledge in Labels. 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA). :193–197.
Various algorithms and models have been proposed to address text classification tasks; however, they rarely consider incorporating the additional knowledge hidden in class labels. We argue that hidden information in class labels leads to better classification accuracy. In this study, instead of encoding the labels into numerical values, we incorporated the knowledge in the labels into the original model without changing the model architecture. We combined the output of an original classification model with the relatedness calculated based on the embeddings of a sequence and a keyword set. A keyword set is a word set to represent knowledge in the labels. Usually, it is generated from the classes while it could also be customized by the users. The experimental results show that our proposed method achieved statistically significant improvements in text classification tasks. The source code and experimental details of this study can be found on Github11https://github.com/HeroadZ/KiL.
2022-04-13
Solanke, Abiodun A., Chen, Xihui, Ramírez-Cruz, Yunior.  2021.  Pattern Recognition and Reconstruction: Detecting Malicious Deletions in Textual Communications. 2021 IEEE International Conference on Big Data (Big Data). :2574–2582.
Digital forensic artifacts aim to provide evidence from digital sources for attributing blame to suspects, assessing their intents, corroborating their statements or alibis, etc. Textual data is a significant source of artifacts, which can take various forms, for instance in the form of communications. E-mails, memos, tweets, and text messages are all examples of textual communications. Complex statistical, linguistic and other scientific procedures can be manually applied to this data to uncover significant clues that point the way to factual information. While expert investigators can undertake this task, there is a possibility that critical information is missed or overlooked. The primary objective of this work is to aid investigators by partially automating the detection of suspicious e-mail deletions. Our approach consists in building a dynamic graph to represent the temporal evolution of communications, and then using a Variational Graph Autoencoder to detect possible e-mail deletions in this graph. Our model uses multiple types of features for representing node and edge attributes, some of which are based on metadata of the messages and the rest are extracted from the contents using natural language processing and text mining techniques. We use the autoencoder to detect missing edges, which we interpret as potential deletions; and to reconstruct their features, from which we emit hypotheses about the topics of deleted messages. We conducted an empirical evaluation of our model on the Enron e-mail dataset, which shows that our model is able to accurately detect a significant proportion of missing communications and to reconstruct the corresponding topic vectors.
2022-05-19
Kuilboer, Jean-Pierre, Stull, Tristan.  2021.  Text Analytics and Big Data in the Financial domain. 2021 16th Iberian Conference on Information Systems and Technologies (CISTI). :1–4.
This research attempts to provide some insights on the application of text mining and Natural Language Processing (NLP). The application domain is consumer complaints about financial institutions in the USA. As an advanced analytics discipline embedded within the Big Data paradigm, the practice of text analytics contains elements of emergent knowledge processes. Since our experiment should be able to scale up we make use of a pipeline based on Spark-NLP. The usage scenario is adapting the model to a specific industrial context and using the dataset offered by the "Consumer Financial Protection Bureau" to illustrate the application.
Rabbani, Mustafa Raza, Bashar, Abu, Atif, Mohd, Jreisat, Ammar, Zulfikar, Zehra, Naseem, Yusra.  2021.  Text mining and visual analytics in research: Exploring the innovative tools. 2021 International Conference on Decision Aid Sciences and Application (DASA). :1087–1091.
The aim of the study is to present an advanced overview and potential application of the innovative tools/software's/methods used for data visualization, text mining, scientific mapping, and bibliometric analysis. Text mining and data visualization has been a topic of research for several years for academic researchers and practitioners. With the advancement in technology and innovation in the data analysis techniques, there are many online and offline software tools available for text mining and visualisation. The purpose of this study is to present an advanced overview of latest, sophisticated, and innovative tools available for this purpose. The unique characteristic about this study is that it provides an overview with examples of the five most adopted software tools such as VOSviewer, Biblioshiny, Gephi, HistCite and CiteSpace in social science research. This study will contribute to the academic literature and will help the researchers and practitioners to apply these tools in future research to present their findings in a more scientific manner.
2022-02-24
Zhou, Andy, Sultana, Kazi Zakia, Samanthula, Bharath K..  2021.  Investigating the Changes in Software Metrics after Vulnerability Is Fixed. 2021 IEEE International Conference on Big Data (Big Data). :5658–5663.
Preventing software vulnerabilities while writing code is one of the most effective ways for avoiding cyber attacks on any developed system. Although developers follow some standard guiding principles for ensuring secure code, the code can still have security bottlenecks and be compromised by an attacker. Therefore, assessing software security while developing code can help developers in writing vulnerability free code. Researchers have already focused on metrics-based and text mining based software vulnerability prediction models. The metrics based models showed higher precision in predicting vulnerabilities although the recall rate is low. In addition, current research did not investigate the impact of individual software metric on the occurrences of vulnerabilities. The main objective of this paper is to track the changes in every software metric after the developer fixes a particular vulnerability. The results of our research will potentially motivate further research on building more accurate vulnerability prediction models based on the appropriate software metrics. In particular, we have compared a total of 250 files from Apache Tomcat and Apache CXF. These files were extracted from the Apache database and were chosen because Apache released these files as vulnerable in their publicly available security advisories. Using a static analysis tool, metrics of the targeted vulnerable files and relevant fixed files (files where vulnerable code is removed by the developers) were extracted and compared. We show that eight of the 40 metrics have an average increase of 2% from vulnerable to fixed files. These metrics include CountDeclClass, CountDeclClassMethod, CountDeclClassVariable, CountDeclInstanceVariable, CountDeclMethodDefault, CountLineCode, MaxCyclomaticStrict, MaxNesting. This study will help developers to assess software security through utilizing software metrics in secure coding practices.
2022-05-19
Fuentalba, Diego, Durán, Claudia, Guillaume, Charles, Carrasco, Raúl, Gutierrez, Sebastián, Pinto, Oscar.  2021.  Text Analytics Architecture in IoT Systems. 2021 Third South American Colloquium on Visible Light Communications (SACVLC). :01–06.
Management control and monitoring of production activities in intelligent environments in subway mines must be aligned with the strategies and objectives of each agent. It is required that in operations, the local structure of each service is fault-tolerant and that large amounts of data are transmitted online to executives to make effective and efficient decisions. The paper proposes an architecture that enables strategic text analysis on the Internet of Things devices through task partitioning with multiple agent systems and evaluates the feasibility of the design by building a prototype that improves communication. The results validate the system's design because Raspberry Pi can execute text mining algorithms and agents in about 3 seconds for 197 texts. This work emphasizes multiple agents for text analytics because the algorithms, along with the agents, use about 70% of a Raspberry Pi CPU.
2021-01-11
Zhao, F., Skums, P., Zelikovsky, A., Sevigny, E. L., Swahn, M. H., Strasser, S. M., Huang, Y., Wu, Y..  2020.  Computational Approaches to Detect Illicit Drug Ads and Find Vendor Communities Within Social Media Platforms. IEEE/ACM Transactions on Computational Biology and Bioinformatics. :1–1.
The opioid abuse epidemic represents a major public health threat to global populations. The role social media may play in facilitating illicit drug trade is largely unknown due to limited research. However, it is known that social media use among adults in the US is widespread, there is vast capability for online promotion of illegal drugs with delayed or limited deterrence of such messaging, and further, general commercial sale applications provide safeguards for transactions; however, they do not discriminate between legal and illegal sale transactions. These characteristics of the social media environment present challenges to surveillance which is needed for advancing knowledge of online drug markets and the role they play in the drug abuse and overdose deaths. In this paper, we present a computational framework developed to automatically detect illicit drug ads and communities of vendors.The SVM- and CNNbased methods for detecting illicit drug ads, and a matrix factorization based method for discovering overlapping communities have been extensively validated on the large dataset collected from Google+, Flickr and Tumblr. Pilot test results demonstrate that our computational methods can effectively identify illicit drug ads and detect vendor-community with accuracy. These methods hold promise to advance scientific knowledge surrounding the role social media may play in perpetuating the drug abuse epidemic.
2021-02-22
Bhagat, V., J, B. R..  2020.  Natural Language Processing on Diverse Data Layers Through Microservice Architecture. 2020 IEEE International Conference for Innovation in Technology (INOCON). :1–6.
With the rapid growth in Natural Language Processing (NLP), all types of industries find a need for analyzing a massive amount of data. Sentiment analysis is becoming a more exciting area for the businessmen and researchers in Text mining & NLP. This process includes the calculation of various sentiments with the help of text mining. Supplementary to this, the world is connected through Information Technology and, businesses are moving toward the next step of the development to make their system more intelligent. Microservices have fulfilled the need for development platforms which help the developers to use various development tools (Languages and applications) efficiently. With the consideration of data analysis for business growth, data security becomes a major concern in front of developers. This paper gives a solution to keep the data secured by providing required access to data scientists without disturbing the base system software. This paper has discussed data storage and exchange policies of microservices through common JavaScript Object Notation (JSON) response which performs the sentiment analysis of customer's data fetched from various microservices through secured APIs.
2021-11-29
Somsakul, Supawit, Prom-on, Santitham.  2020.  On the Network and Topological Analyses of Legal Documents Using Text Mining Approach. 2020 1st International Conference on Big Data Analytics and Practices (IBDAP). :1–6.
This paper presents a computational study of Thai legal documents using text mining and network analytic approach. Thai legal systems rely much on the existing judicial rulings. Thus, legal documents contain complex relationships and require careful examination. The objective of this study is to use text mining to model relationships between these legal documents and draw useful insights. A structure of document relationship was found as a result of the study in forms of a network that is related to the meaningful relations of legal documents. This can potentially be developed further into a document retrieval system based on how documents are related in the network.
2021-08-02
Pereira, José D’Abruzzo.  2020.  Techniques and Tools for Advanced Software Vulnerability Detection. 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). :123—126.
Software is frequently deployed with vulnerabilities that may allow hackers to gain access to the system or information, leading to money or reputation losses. Although there are many techniques to detect software vulnerabilities, their effectiveness is far from acceptable, especially in large software projects, as shown by several research works. This Ph.D. aims to study the combination of different techniques to improve the effectiveness of vulnerability detection (increasing the detection rate and decreasing the number of false-positives). Static Code Analysis (SCA) has a good detection rate and is the central technique of this work. However, as SCA reports many false-positives, we will study the combination of various SCA tools and the integration with other detection approaches (e.g., software metrics) to improve vulnerability detection capabilities. We will also study the use of such combination to prioritize the reported vulnerabilities and thus guide the development efforts and fixes in resource-constrained projects.
2020-02-17
Rodriguez, Ariel, Okamura, Koji.  2019.  Generating Real Time Cyber Situational Awareness Information Through Social Media Data Mining. 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC). 2:502–507.
With the rise of the internet many new data sources have emerged that can be used to help us gain insights into the cyber threat landscape and can allow us to better prepare for cyber attacks before they happen. With this in mind, we present an end to end real time cyber situational awareness system which aims to efficiently retrieve security relevant information from the social networking site Twitter.com. This system classifies and aggregates the data retrieved and provides real time cyber situational awareness information based on sentiment analysis and data analytics techniques. This research will assist security analysts to evaluate the level of cyber risk in their organization and proactively take actions to plan and prepare for potential attacks before they happen as well as contribute to the field through a cybersecurity tweet dataset.
2020-05-18
Bakhtin, Vadim V., Isaeva, Ekaterina V..  2019.  New TSBuilder: Shifting towards Cognition. 2019 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus). :179–181.
The paper reviews a project on the automation of term system construction. TSBuilder (Term System Builder) was developed in 2014 as a multilayer Rosenblatt's perceptron for supervised machine learning, namely 1-3 word terms identification in natural language texts and their rigid categorization. The program is being modified to reduce the rigidity of categorization which will bring text mining more in line with human thinking.We are expanding the range of parameters (semantical, morphological, and syntactical) for categorization, removing the restriction of the term length of three words, using convolution on a continuous sequence of terms, and present the probabilities of a term falling into different categories. The neural network will not assign a single category to a term but give N answers (where N is the number of predefined classes), each of which O ∈ [0, 1] is the probability of the term to belong to a given class.
2020-07-30
Srisopha, Kamonphop, Phonsom, Chukiat, Lin, Keng, Boehm, Barry.  2019.  Same App, Different Countries: A Preliminary User Reviews Study on Most Downloaded iOS Apps. 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). :76—80.
Prior work on mobile app reviews has demonstrated that user reviews contain a wealth of information and are seen as a potential source of requirements. However, most of the studies done in this area mainly focused on mining and analyzing user reviews from the US App Store, leaving reviews of users from other countries unexplored. In this paper, we seek to understand if the perception of the same apps between users from other countries and that from the US differs through analyzing user reviews. We retrieve 300,643 user reviews of the 15 most downloaded iOS apps of 2018, published directly by Apple, from nine English-speaking countries over the course of 5 months. We manually classify 3,358 reviews into several software quality and improvement factors. We leverage a random forest based algorithm to identify factors that can be used to differentiate reviews between the US and other countries. Our preliminary results show that all countries have some factors that are proportionally inconsistent with the US.
2020-10-29
Xylogiannopoulos, Konstantinos F., Karampelas, Panagiotis, Alhajj, Reda.  2019.  Text Mining for Malware Classification Using Multivariate All Repeated Patterns Detection. 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). :887—894.

Mobile phones have become nowadays a commodity to the majority of people. Using them, people are able to access the world of Internet and connect with their friends, their colleagues at work or even unknown people with common interests. This proliferation of the mobile devices has also been seen as an opportunity for the cyber criminals to deceive smartphone users and steel their money directly or indirectly, respectively, by accessing their bank accounts through the smartphones or by blackmailing them or selling their private data such as photos, credit card data, etc. to third parties. This is usually achieved by installing malware to smartphones masking their malevolent payload as a legitimate application and advertise it to the users with the hope that mobile users will install it in their devices. Thus, any existing application can easily be modified by integrating a malware and then presented it as a legitimate one. In response to this, scientists have proposed a number of malware detection and classification methods using a variety of techniques. Even though, several of them achieve relatively high precision in malware classification, there is still space for improvement. In this paper, we propose a text mining all repeated pattern detection method which uses the decompiled files of an application in order to classify a suspicious application into one of the known malware families. Based on the experimental results using a real malware dataset, the methodology tries to correctly classify (without any misclassification) all randomly selected malware applications of 3 categories with 3 different families each.

2020-11-20
Han, H., Wang, Q., Chen, C..  2019.  Policy Text Analysis Based on Text Mining and Fuzzy Cognitive Map. 2019 15th International Conference on Computational Intelligence and Security (CIS). :142—146.
With the introduction of computer methods, the amount of material and processing accuracy of policy text analysis have been greatly improved. In this paper, Text mining(TM) and latent semantic analysis(LSA) were used to collect policy documents and extract policy elements from them. Fuzzy association rule mining(FARM) technique and partial association test (PA) were used to discover the causal relationships and impact degrees between elements, and a fuzzy cognitive map (FCM) was developed to deduct the evolution of elements through a soft computing method. This non-interventionist approach avoids the validity defects caused by the subjective bias of researchers and provides policy makers with more objective policy suggestions from a neutral perspective. To illustrate the accuracy of this method, this study experimented by taking the state-owned capital layout adjustment related policies as an example, and proved that this method can effectively analyze policy text.
2019-02-14
Georgakopoulos, Spiros V., Tasoulis, Sotiris K., Vrahatis, Aristidis G., Plagianakos, Vassilis P..  2018.  Convolutional Neural Networks for Toxic Comment Classification. Proceedings of the 10th Hellenic Conference on Artificial Intelligence. :35:1-35:6.
Flood of information is produced in a daily basis through the global internet usage arising from the online interactive communications among users. While this situation contributes significantly to the quality of human life, unfortunately it involves enormous dangers, since online texts with high toxicity can cause personal attacks, online harassment and bullying behaviors. This has triggered both industrial and research community in the last few years while there are several attempts to identify an efficient model for online toxic comment prediction. However, these steps are still in their infancy and new approaches and frameworks are required. On parallel, the data explosion that appears constantly, makes the construction of new machine learning computational tools for managing this information, an imperative need. Thankfully advances in hardware, cloud computing and big data management allow the development of Deep Learning approaches appearing very promising performance so far. For text classification in particular the use of Convolutional Neural Networks (CNN) have recently been proposed approaching text analytics in a modern manner emphasizing in the structure of words in a document. In this work, we employ this approach to discover toxic comments in a large pool of documents provided by a current Kaggle's competition regarding Wikipedia's talk page edits. To justify this decision we choose to compare CNNs against the traditional bag-of-words approach for text analysis combined with a selection of algorithms proven to be very effective in text classification. The reported results provide enough evidence that CNN enhance toxic comment classification reinforcing research interest towards this direction.
Eclarin, Bobby A., Fajardo, Arnel C., Medina, Ruji P..  2018.  A Novel Feature Hashing With Efficient Collision Resolution for Bag-of-Words Representation of Text Data. Proceedings of the 2Nd International Conference on Natural Language Processing and Information Retrieval. :12-16.
Text Mining is widely used in many areas transforming unstructured text data from all sources such as patients' record, social media network, insurance data, and news, among others into an invaluable source of information. The Bag Of Words (BoW) representation is a means of extracting features from text data for use in modeling. In text classification, a word in a document is assigned a weight according to its frequency and frequency between different documents; therefore, words together with their weights form the BoW. One way to solve the issue of voluminous data is to use the feature hashing method or hashing trick. However, collision is inevitable and might change the result of the whole process of feature generation and selection. Using the vector data structure, the lookup performance is improved while resolving collision and the memory usage is also efficient.
2019-09-26
Jackson, K. A., Bennett, B. T..  2018.  Locating SQL Injection Vulnerabilities in Java Byte Code Using Natural Language Techniques. SoutheastCon 2018. :1-5.

With so much our daily lives relying on digital devices like personal computers and cell phones, there is a growing demand for code that not only functions properly, but is secure and keeps user data safe. However, ensuring this is not such an easy task, and many developers do not have the required skills or resources to ensure their code is secure. Many code analysis tools have been written to find vulnerabilities in newly developed code, but this technology tends to produce many false positives, and is still not able to identify all of the problems. Other methods of finding software vulnerabilities automatically are required. This proof-of-concept study applied natural language processing on Java byte code to locate SQL injection vulnerabilities in a Java program. Preliminary findings show that, due to the high number of terms in the dataset, using singular decision trees will not produce a suitable model for locating SQL injection vulnerabilities, while random forest structures proved more promising. Still, further work is needed to determine the best classification tool.

2020-10-12
Chung, Wingyan, Liu, Jinwei, Tang, Xinlin, Lai, Vincent S. K..  2018.  Extracting Textual Features of Financial Social Media to Detect Cognitive Hacking. 2018 IEEE International Conference on Intelligence and Security Informatics (ISI). :244–246.
Social media are increasingly reflecting and influencing the behavior of human and financial market. Cognitive hacking leverages the influence of social media to spread deceptive information with an intent to gain abnormal profits illegally or to cause losses. Measuring the information content in financial social media can be useful for identifying these attacks. In this paper, we developed an approach to identifying social media features that correlate with abnormal returns of the stocks of companies vulnerable to be targets of cognitive hacking. To test the approach, we collected price data and 865,289 social media messages on four technology companies from July 2017 to June 2018, and extracted features that contributed to abnormal stock movements. Preliminary results show that terms that are simple, motivate actions, incite emotion, and uses exaggeration are ranked high in the features of messages associated with abnormal price movements. We also provide selected messages to illustrate the use of these features in potential cognitive hacking attacks.
2019-03-04
Husari, G., Niu, X., Chu, B., Al-Shaer, E..  2018.  Using Entropy and Mutual Information to Extract Threat Actions from Cyber Threat Intelligence. 2018 IEEE International Conference on Intelligence and Security Informatics (ISI). :1–6.
With the rapid growth of the cyber attacks, cyber threat intelligence (CTI) sharing becomes essential for providing advance threat notice and enabling timely response to cyber attacks. Our goal in this paper is to develop an approach to extract low-level cyber threat actions from publicly available CTI sources in an automated manner to enable timely defense decision making. Specifically, we innovatively and successfully used the metrics of entropy and mutual information from Information Theory to analyze the text in the cybersecurity domain. Combined with some basic NLP techniques, our framework, called ActionMiner has achieved higher precision and recall than the state-of-the-art Stanford typed dependency parser, which usually works well in general English but not cybersecurity texts.
2018-04-11
Spanos, Georgios, Angelis, Lefteris, Toloudis, Dimitrios.  2017.  Assessment of Vulnerability Severity Using Text Mining. Proceedings of the 21st Pan-Hellenic Conference on Informatics. :49:1–49:6.

Software1 vulnerabilities are closely associated with information systems security, a major and critical field in today's technology. Vulnerabilities constitute a constant and increasing threat for various aspects of everyday life, especially for safety and economy, since the social impact from the problems that they cause is complicated and often unpredictable. Although there is an entire research branch in software engineering that deals with the identification and elimination of vulnerabilities, the growing complexity of software products and the variability of software production procedures are factors contributing to the ongoing occurrence of vulnerabilities, Hence, another area that is being developed in parallel focuses on the study and management of the vulnerabilities that have already been reported and registered in databases. The information contained in such databases includes, a textual description and a number of metrics related to vulnerabilities. The purpose of this paper is to investigate to what extend the assessment of the vulnerability severity can be inferred directly from the corresponding textual description, or in other words, to examine the informative power of the description with respect to the vulnerability severity. For this purpose, text mining techniques, i.e. text analysis and three different classification methods (decision trees, neural networks and support vector machines) were employed. The application of text mining to a sample of 70,678 vulnerabilities from a public data source shows that the description itself is a reliable and highly accurate source of information for vulnerability prioritization.

2017-12-28
Stuckman, J., Walden, J., Scandariato, R..  2017.  The Effect of Dimensionality Reduction on Software Vulnerability Prediction Models. IEEE Transactions on Reliability. 66:17–37.

Statistical prediction models can be an effective technique to identify vulnerable components in large software projects. Two aspects of vulnerability prediction models have a profound impact on their performance: 1) the features (i.e., the characteristics of the software) that are used as predictors and 2) the way those features are used in the setup of the statistical learning machinery. In a previous work, we compared models based on two different types of features: software metrics and term frequencies (text mining features). In this paper, we broaden the set of models we compare by investigating an array of techniques for the manipulation of said features. These techniques fall under the umbrella of dimensionality reduction and have the potential to improve the ability of a prediction model to localize vulnerabilities. We explore the role of dimensionality reduction through a series of cross-validation and cross-project prediction experiments. Our results show that in the case of software metrics, a dimensionality reduction technique based on confirmatory factor analysis provided an advantage when performing cross-project prediction, yielding the best F-measure for the predictions in five out of six cases. In the case of text mining, feature selection can make the prediction computationally faster, but no dimensionality reduction technique provided any other notable advantage.

2018-02-27
Nembhard, F., Carvalho, M., Eskridge, T..  2017.  A Hybrid Approach to Improving Program Security. 2017 IEEE Symposium Series on Computational Intelligence (SSCI). :1–8.

The security of computer programs and systems is a very critical issue. With the number of attacks launched on computer networks and software, businesses and IT professionals are taking steps to ensure that their information systems are as secure as possible. However, many programmers do not think about adding security to their programs until their projects are near completion. This is a major mistake because a system is as secure as its weakest link. If security is viewed as an afterthought, it is highly likely that the resulting system will have a large number of vulnerabilities, which could be exploited by attackers. One of the reasons programmers overlook adding security to their code is because it is viewed as a complicated or time-consuming process. This paper presents a tool that will help programmers think more about security and add security tactics to their code with ease. We created a model that learns from existing open source projects and documentation using machine learning and text mining techniques. Our tool contains a module that runs in the background to analyze code as the programmer types and offers suggestions of where security could be included. In addition, our tool fetches existing open source implementations of cryptographic algorithms and sample code from repositories to aid programmers in adding security easily to their projects.