Visible to the public Biblio

Filters: Keyword is topic modeling  [Clear All Filters]
2023-04-28
Suryotrisongko, Hatma, Ginardi, Hari, Ciptaningtyas, Henning Titi, Dehqan, Saeed, Musashi, Yasuo.  2022.  Topic Modeling for Cyber Threat Intelligence (CTI). 2022 Seventh International Conference on Informatics and Computing (ICIC). :1–7.
Topic modeling algorithms from the natural language processing (NLP) discipline have been used for various applications. For instance, topic modeling for the product recommendation systems in the e-commerce systems. In this paper, we briefly reviewed topic modeling applications and then described our proposed idea of utilizing topic modeling approaches for cyber threat intelligence (CTI) applications. We improved the previous work by implementing BERTopic and Top2Vec approaches, enabling users to select their preferred pre-trained text/sentence embedding model, and supporting various languages. We implemented our proposed idea as the new topic modeling module for the Open Web Application Security Project (OWASP) Maryam: Open-Source Intelligence (OSINT) framework. We also described our experiment results using a leaked hacker forum dataset (nulled.io) to attract more researchers and open-source communities to participate in the Maryam project of OWASP Foundation.
2022-05-19
Kuilboer, Jean-Pierre, Stull, Tristan.  2021.  Text Analytics and Big Data in the Financial domain. 2021 16th Iberian Conference on Information Systems and Technologies (CISTI). :1–4.
This research attempts to provide some insights on the application of text mining and Natural Language Processing (NLP). The application domain is consumer complaints about financial institutions in the USA. As an advanced analytics discipline embedded within the Big Data paradigm, the practice of text analytics contains elements of emergent knowledge processes. Since our experiment should be able to scale up we make use of a pipeline based on Spark-NLP. The usage scenario is adapting the model to a specific industrial context and using the dataset offered by the "Consumer Financial Protection Bureau" to illustrate the application.
2020-11-20
Antoniadis, I. I., Chatzidimitriou, K. C., Symeonidis, A. L..  2019.  Security and Privacy for Smart Meters: A Data-Driven Mapping Study. 2019 IEEE PES Innovative Smart Grid Technologies Europe (ISGT-Europe). :1—5.
Smart metering systems have been gaining popularity as a vital part of the general smart grid paradigm. Naturally, as new technologies arise to cover this emerging field, so do security and privacy related issues regarding the energy consumer's personal data. These challenges impose the need for the development of new methods through a better understanding of the state-of-the-art. This paper aims at identifying the main categories of security and privacy techniques utilized in smart metering systems from a three-point perspective: i) a field research survey, ii) EU initiatives and findings towards the same direction and iii) a data-driven analysis of the state-of-the-art and the identification of its main topics (or themes) using topic modeling techniques. Detailed quantitative results of this analysis, such as semantic interpretation of the identified topics and a graph representation of the topic trends over time, are presented.
2019-03-15
Deliu, I., Leichter, C., Franke, K..  2018.  Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process Using Support Vector Machines and Latent Dirichlet Allocation. 2018 IEEE International Conference on Big Data (Big Data). :5008-5013.

Traditional security controls, such as firewalls, anti-virus and IDS, are ill-equipped to help IT security and response teams keep pace with the rapid evolution of the cyber threat landscape. Cyber Threat Intelligence (CTI) can help remediate this problem by exploiting non-traditional information sources, such as hacker forums and "dark-web" social platforms. Security and response teams can use the collected intelligence to identify emerging threats. Unfortunately, when manual analysis is used to extract CTI from non-traditional sources, it is a time consuming, error-prone and resource intensive process. We address these issues by using a hybrid Machine Learning model that automatically searches through hacker forum posts, identifies the posts that are most relevant to cyber security and then clusters the relevant posts into estimations of the topics that the hackers are discussing. The first (identification) stage uses Support Vector Machines and the second (clustering) stage uses Latent Dirichlet Allocation. We tested our model, using data from an actual hacker forum, to automatically extract information about various threats such as leaked credentials, malicious proxy servers, malware that evades AV detection, etc. The results demonstrate our method is an effective means for quickly extracting relevant and actionable intelligence that can be integrated with traditional security controls to increase their effectiveness.

2018-11-14
Adams, S., Carter, B., Fleming, C., Beling, P. A..  2018.  Selecting System Specific Cybersecurity Attack Patterns Using Topic Modeling. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :490–497.

One challenge for cybersecurity experts is deciding which type of attack would be successful against the system they wish to protect. Often, this challenge is addressed in an ad hoc fashion and is highly dependent upon the skill and knowledge base of the expert. In this study, we present a method for automatically ranking attack patterns in the Common Attack Pattern Enumeration and Classification (CAPEC) database for a given system. This ranking method is intended to produce suggested attacks to be evaluated by a cybersecurity expert and not a definitive ranking of the "best" attacks. The proposed method uses topic modeling to extract hidden topics from the textual description of each attack pattern and learn the parameters of a topic model. The posterior distribution of topics for the system is estimated using the model and any provided text. Attack patterns are ranked by measuring the distance between each attack topic distribution and the topic distribution of the system using KL divergence.

2018-02-06
Zhang, Y., Mao, W., Zeng, D..  2017.  Topic Evolution Modeling in Social Media Short Texts Based on Recurrent Semantic Dependent CRP. 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). :119–124.

Social media has become an important platform for people to express opinions, share information and communicate with others. Detecting and tracking topics from social media can help people grasp essential information and facilitate many security-related applications. As social media texts are usually short, traditional topic evolution models built based on LDA or HDP often suffer from the data sparsity problem. Recently proposed topic evolution models are more suitable for short texts, but they need to manually specify topic number which is fixed during different time period. To address these issues, in this paper, we propose a nonparametric topic evolution model for social media short texts. We first propose the recurrent semantic dependent Chinese restaurant process (rsdCRP), which is a nonparametric process incorporating word embeddings to capture semantic similarity information. Then we combine rsdCRP with word co-occurrence modeling and build our short-text oriented topic evolution model sdTEM. We carry out experimental studies on Twitter dataset. The results demonstrate the effectiveness of our method to monitor social media topic evolution compared to the baseline methods.

2017-09-19
Hyun, Yoonjin, Kim, Namgyu.  2016.  Detecting Blog Spam Hashtags Using Topic Modeling. Proceedings of the 18th Annual International Conference on Electronic Commerce: E-Commerce in Smart Connected World. :43:1–43:6.

Tremendous amounts of data are generated daily. Accordingly, unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers as this data contains abundant information about various consumers' opinions. However, as the usefulness of text data is increasing, attempts to gain profits by distorting text data maliciously or non-maliciously are also increasing. In this sense, various types of spam detection techniques have been studied to prevent the side effects of spamming. The most representative studies include e-mail spam detection, web spam detection, and opinion spam detection. "Spam" is recognized on the basis of three characteristics and actions: (1) if a certain user is recognized as a spammer, then all content created by that user should be recognized as spam; (2) if certain content is exposed to other users (regardless of the users' intention), then content is recognized as spam; and (3) any content that contains malicious or non-malicious false information is recognized as spam. Many studies have been performed to solve type (1) and type (2) spamming by analyzing various metadata, such as user networks and spam words. In the case of type (3), however, relatively few studies have been conducted because it is difficult to determine the veracity of a certain word or information. In this study, we regard a hashtag that is irrelevant to the content of a blog post as spam and devise a methodology to detect such spam hashtags.

2015-05-05
Eun Hee Ko, Klabjan, D..  2014.  Semantic Properties of Customer Sentiment in Tweets. Advanced Information Networking and Applications Workshops (WAINA), 2014 28th International Conference on. :657-663.

An increasing number of people are using online social networking services (SNSs), and a significant amount of information related to experiences in consumption is shared in this new media form. Text mining is an emerging technique for mining useful information from the web. We aim at discovering in particular tweets semantic patterns in consumers' discussions on social media. Specifically, the purposes of this study are twofold: 1) finding similarity and dissimilarity between two sets of textual documents that include consumers' sentiment polarities, two forms of positive vs. negative opinions and 2) driving actual content from the textual data that has a semantic trend. The considered tweets include consumers' opinions on US retail companies (e.g., Amazon, Walmart). Cosine similarity and K-means clustering methods are used to achieve the former goal, and Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, is used for the latter purpose. This is the first study which discover semantic properties of textual data in consumption context beyond sentiment analysis. In addition to major findings, we apply LDA (Latent Dirichlet Allocations) to the same data and drew latent topics that represent consumers' positive opinions and negative opinions on social media.