Visible to the public Biblio

Filters: Keyword is text classification  [Clear All Filters]
2022-09-09
Raafat, Maryam A., El-Wakil, Rania Abdel-Fattah, Atia, Ayman.  2021.  Comparative study for Stylometric analysis techniques for authorship attribution. 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC). :176—181.
A text is a meaningful source of information. Capturing the right patterns in written text gives metrics to measure and infer to what extent this text belongs or is relevant to a specific author. This research aims to introduce a new feature that goes more in deep in the language structure. The feature introduced is based on an attempt to differentiate stylistic changes among authors according to the different sentence structure each author uses. The study showed the effect of introducing this new feature to machine learning models to enhance their performance. It was found that the prediction of authors was enhanced by adding sentence structure as an additional feature as the f1\_scores increased by 0.3% and when normalizing the data and adding the feature it increased by 5%.
2022-05-19
Zhang, Cheng, Yamana, Hayato.  2021.  Improving Text Classification Using Knowledge in Labels. 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA). :193–197.
Various algorithms and models have been proposed to address text classification tasks; however, they rarely consider incorporating the additional knowledge hidden in class labels. We argue that hidden information in class labels leads to better classification accuracy. In this study, instead of encoding the labels into numerical values, we incorporated the knowledge in the labels into the original model without changing the model architecture. We combined the output of an original classification model with the relatedness calculated based on the embeddings of a sequence and a keyword set. A keyword set is a word set to represent knowledge in the labels. Usually, it is generated from the classes while it could also be customized by the users. The experimental results show that our proposed method achieved statistically significant improvements in text classification tasks. The source code and experimental details of this study can be found on Github11https://github.com/HeroadZ/KiL.
2022-04-19
Luo, Jing, Xu, Guoqing.  2021.  XSS Attack Detection Methods Based on XLNet and GRU. 2021 4th International Conference on Robotics, Control and Automation Engineering (RCAE). :171–175.
With the progress of science and technology and the development of Internet technology, Internet technology has penetrated into various industries in today’s society. But this explosive growth is also troubling information security. Among them, XSS (cross-site scripting vulnerability) is one of the most influential vulnerabilities in Internet applications in recent years. Traditional network security detection technology is becoming more and more weak in the new network environment, and deep learning methods such as CNN and RNN can only learn the spatial or timing characteristics of data samples in a single way. In this paper, a generalized self-regression pretraining model XLNet and GRU XSS attack detection method is proposed, the self-regression pretrained model XLNet is introduced and combined with GRU to learn the time series and spatial characteristics of the data, and the generalization capability of the model is improved by using dropout. Faced with the increasingly complex and ever-changing XSS payload, this paper refers to the character-level convolution to establish a dictionary to encode the data samples, thus preserving the characteristics of the original data and improving the overall efficiency, and then transforming it into a two-dimensional spatial matrix to meet XLNet’s input requirements. The experimental results on the Github data set show that the accuracy of this method is 99.92 percent, the false positive rate is 0.02 percent, the accuracy rate is 11.09 percent higher than that of the DNN method, the false positive rate is 3.95 percent lower, and other evaluation indicators are better than GRU, CNN and other comparative methods, which can improve the detection accuracy and system stability of the whole detection system. This multi-model fusion method can make full use of the advantages of each model to improve the accuracy of system detection, on the other hand, it can also enhance the stability of the system.
2022-01-25
Marulli, Fiammetta, Balzanella, Antonio, Campanile, Lelio, Iacono, Mauro, Mastroianni, Michele.  2021.  Exploring a Federated Learning Approach to Enhance Authorship Attribution of Misleading Information from Heterogeneous Sources. 2021 International Joint Conference on Neural Networks (IJCNN). :1–8.
Authorship Attribution (AA) is currently applied in several applications, among which fraud detection and anti-plagiarism checks: this task can leverage stylometry and Natural Language Processing techniques. In this work, we explored some strategies to enhance the performance of an AA task for the automatic detection of false and misleading information (e.g., fake news). We set up a text classification model for AA based on stylometry exploiting recurrent deep neural networks and implemented two learning tasks trained on the same collection of fake and real news, comparing their performances: one is based on Federated Learning architecture, the other on a centralized architecture. The goal was to discriminate potential fake information from true ones when the fake news comes from heterogeneous sources, with different styles. Preliminary experiments show that a distributed approach significantly improves recall with respect to the centralized model. As expected, precision was lower in the distributed model. This aspect, coupled with the statistical heterogeneity of data, represents some open issues that will be further investigated in future work.
2021-11-29
Hu, Shengze, He, Chunhui, Ge, Bin, Liu, Fang.  2020.  Enhanced Word Embedding Method in Text Classification. 2020 6th International Conference on Big Data and Information Analytics (BigDIA). :18–22.
For the task of natural language processing (NLP), Word embedding technology has a certain impact on the accuracy of deep neural network algorithms. Considering that the current word embedding method cannot realize the coexistence of words and phrases in the same vector space. Therefore, we propose an enhanced word embedding (EWE) method. Before completing the word embedding, this method introduces a unique sentence reorganization technology to rewrite all the sentences in the original training corpus. Then, all the original corpus and the reorganized corpus are merged together as the training corpus of the distributed word embedding model, so as to realize the coexistence problem of words and phrases in the same vector space. We carried out experiment to demonstrate the effectiveness of the EWE algorithm on three classic benchmark datasets. The results show that the EWE method can significantly improve the classification performance of the CNN model.
Zhang, Qiang, Chai, Bo, Song, Bochuan, Zhao, Jingpeng.  2020.  A Hierarchical Fine-Tuning Based Approach for Multi-Label Text Classification. 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). :51–54.
Hierarchical Text classification has recently become increasingly challenging with the growing number of classification labels. In this paper, we propose a hierarchical fine-tuning based approach for hierarchical text classification. We use the ordered neurons LSTM (ONLSTM) model by combining the embedding of text and parent category for hierarchical text classification with a large number of categories, which makes full use of the connection between the upper-level and lower-level labels. Extensive experiments show that our model outperforms the state-of-the-art hierarchical model at a lower computation cost.
2020-12-14
Yu, L., Chen, L., Dong, J., Li, M., Liu, L., Zhao, B., Zhang, C..  2020.  Detecting Malicious Web Requests Using an Enhanced TextCNN. 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC). :768–777.
This paper proposes an approach that combines a deep learning-based method and a traditional machine learning-based method to efficiently detect malicious requests Web servers received. The first few layers of Convolutional Neural Network for Text Classification (TextCNN) are used to automatically extract powerful semantic features and in the meantime transferable statistical features are defined to boost the detection ability, specifically Web request parameter tampering. The semantic features from TextCNN and transferable statistical features from artificially-designing are grouped together to be fed into Support Vector Machine (SVM), replacing the last layer of TextCNN for classification. To facilitate the understanding of abstract features in form of numerical data in vectors extracted by TextCNN, this paper designs trace-back functions that map max-pooling outputs back to words in Web requests. After investigating the current available datasets for Web attack detection, HTTP Dataset CSIC 2010 is selected to test and verify the proposed approach. Compared with other deep learning models, the experimental results demonstrate that the approach proposed in this paper is competitive with the state-of-the-art.
2020-11-02
Pan, C., Huang, J., Gong, J., Yuan, X..  2019.  Few-Shot Transfer Learning for Text Classification With Lightweight Word Embedding Based Models. IEEE Access. 7:53296–53304.
Many deep learning architectures have been employed to model the semantic compositionality for text sequences, requiring a huge amount of supervised data for parameters training, making it unfeasible in situations where numerous annotated samples are not available or even do not exist. Different from data-hungry deep models, lightweight word embedding-based models could represent text sequences in a plug-and-play way due to their parameter-free property. In this paper, a modified hierarchical pooling strategy over pre-trained word embeddings is proposed for text classification in a few-shot transfer learning way. The model leverages and transfers knowledge obtained from some source domains to recognize and classify the unseen text sequences with just a handful of support examples in the target problem domain. The extensive experiments on five datasets including both English and Chinese text demonstrate that the simple word embedding-based models (SWEMs) with parameter-free pooling operations are able to abstract and represent the semantic text. The proposed modified hierarchical pooling method exhibits significant classification performance in the few-shot transfer learning tasks compared with other alternative methods.
2020-09-28
Liu, Kai, Zhou, Yun, Wang, Qingyong, Zhu, Xianqiang.  2019.  Vulnerability Severity Prediction With Deep Neural Network. 2019 5th International Conference on Big Data and Information Analytics (BigDIA). :114–119.
High frequency of network security incidents has also brought a lot of negative effects and even huge economic losses to countries, enterprises and individuals in recent years. Therefore, more and more attention has been paid to the problem of network security. In order to evaluate the newly included vulnerability text information accurately, and to reduce the workload of experts and the false negative rate of the traditional method. Multiple deep learning methods for vulnerability text classification evaluation are proposed in this paper. The standard Cross Site Scripting (XSS) vulnerability text data is processed first, and then classified using three kinds of deep neural networks (CNN, LSTM, TextRCNN) and one kind of traditional machine learning method (XGBoost). The dropout ratio of the optimal CNN network, the epoch of all deep neural networks and training set data were tuned via experiments to improve the fit on our target task. The results show that the deep learning methods evaluate vulnerability risk levels better, compared with traditional machine learning methods, but cost more time. We train our models in various training sets and test with the same testing set. The performance and utility of recurrent convolutional neural networks (TextRCNN) is highest in comparison to all other methods, which classification accuracy rate is 93.95%.
2020-08-28
BOUGHACI, Dalila, BENMESBAH, Mounir, ZEBIRI, Aniss.  2019.  An improved N-grams based Model for Authorship Attribution. 2019 International Conference on Computer and Information Sciences (ICCIS). :1—6.

Authorship attribution is the problem of studying an anonymous text and finding the corresponding author in a set of candidate authors. In this paper, we propose a method based on N-grams model for the problem of authorship attribution. Several measures are used to assign an anonymous text to an author. The different variants of the proposed method are implemented and validated on PAN benchmarks. The numerical results are encouraging and demonstrate the benefit of the proposed idea.

2020-05-18
Sel, Slhami, Hanbay, Davut.  2019.  E-Mail Classification Using Natural Language Processing. 2019 27th Signal Processing and Communications Applications Conference (SIU). :1–4.
Thanks to the rapid increase in technology and electronic communications, e-mail has become a serious communication tool. In many applications such as business correspondence, reminders, academic notices, web page memberships, e-mail is used as primary way of communication. If we ignore spam e-mails, there remain hundreds of e-mails received every day. In order to determine the importance of received e-mails, the subject or content of each e-mail must be checked. In this study we proposed an unsupervised system to classify received e-mails. Received e-mails' coordinates are determined by a method of natural language processing called as Word2Vec algorithm. According to the similarities, processed data are grouped by k-means algorithm with an unsupervised training model. In this study, 10517 e-mails were used in training. The success of the system is tested on a test group of 200 e-mails. In the test phase M3 model (window size 3, min. Word frequency 10, Gram skip) consolidated the highest success (91%). Obtained results are evaluated in section VI.
2020-01-27
Tang, Xuemei, Liang, Shichen, Liu, Zhiying.  2019.  Authorship Attribution of The Golden Lotus Based on Text Classification Methods. Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence. :69–72.

In this paper, we explore the authorship attribution of The Golden Lotus using the traditional machine learning method of text classification. There are four candidate authors: Shizhen Wang, Wei Xu, Kaixian Li and Zhideng Wang. We choose The Golden Lotus's poems and four candidate authors' poems as data set. According to the characteristics of Chinese ancient poem, we choose Chinese character, rhyme, genre and overlapped word as features. We use six supervised machine learning algorithms, including Logistic Regression, Random Forests, Decision Tree and Naive Bayes, SVM and KNN classifiers respectively for text binary classification and multi-classification. According to two experiments results, the style of writing of Wei Xu's poems is the most similar to that of The Golden Lotus. It is proved that among four authors, Wei Xu most likely be the author of The Golden Lotus.

2019-03-15
Deliu, I., Leichter, C., Franke, K..  2018.  Collecting Cyber Threat Intelligence from Hacker Forums via a Two-Stage, Hybrid Process Using Support Vector Machines and Latent Dirichlet Allocation. 2018 IEEE International Conference on Big Data (Big Data). :5008-5013.

Traditional security controls, such as firewalls, anti-virus and IDS, are ill-equipped to help IT security and response teams keep pace with the rapid evolution of the cyber threat landscape. Cyber Threat Intelligence (CTI) can help remediate this problem by exploiting non-traditional information sources, such as hacker forums and "dark-web" social platforms. Security and response teams can use the collected intelligence to identify emerging threats. Unfortunately, when manual analysis is used to extract CTI from non-traditional sources, it is a time consuming, error-prone and resource intensive process. We address these issues by using a hybrid Machine Learning model that automatically searches through hacker forum posts, identifies the posts that are most relevant to cyber security and then clusters the relevant posts into estimations of the topics that the hackers are discussing. The first (identification) stage uses Support Vector Machines and the second (clustering) stage uses Latent Dirichlet Allocation. We tested our model, using data from an actual hacker forum, to automatically extract information about various threats such as leaked credentials, malicious proxy servers, malware that evades AV detection, etc. The results demonstrate our method is an effective means for quickly extracting relevant and actionable intelligence that can be integrated with traditional security controls to increase their effectiveness.

2019-03-04
Aborisade, O., Anwar, M..  2018.  Classification for Authorship of Tweets by Comparing Logistic Regression and Naive Bayes Classifiers. 2018 IEEE International Conference on Information Reuse and Integration (IRI). :269–276.

At a time when all it takes to open a Twitter account is a mobile phone, the act of authenticating information encountered on social media becomes very complex, especially when we lack measures to verify digital identities in the first place. Because the platform supports anonymity, fake news generated by dubious sources have been observed to travel much faster and farther than real news. Hence, we need valid measures to identify authors of misinformation to avert these consequences. Researchers propose different authorship attribution techniques to approach this kind of problem. However, because tweets are made up of only 280 characters, finding a suitable authorship attribution technique is a challenge. This research aims to classify authors of tweets by comparing machine learning methods like logistic regression and naive Bayes. The processes of this application are fetching of tweets, pre-processing, feature extraction, and developing a machine learning model for classification. This paper illustrates the text classification for authorship process using machine learning techniques. In total, there were 46,895 tweets used as both training and testing data, and unique features specific to Twitter were extracted. Several steps were done in the pre-processing phase, including removal of short texts, removal of stop-words and punctuations, tokenizing and stemming of texts as well. This approach transforms the pre-processed data into a set of feature vector in Python. Logistic regression and naive Bayes algorithms were applied to the set of feature vectors for the training and testing of the classifier. The logistic regression based classifier gave the highest accuracy of 91.1% compared to the naive Bayes classifier with 89.8%.

2019-02-25
Peng, W., Huang, L., Jia, J., Ingram, E..  2018.  Enhancing the Naive Bayes Spam Filter Through Intelligent Text Modification Detection. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :849–854.

Spam emails have been a chronic issue in computer security. They are very costly economically and extremely dangerous for computers and networks. Despite of the emergence of social networks and other Internet based information exchange venues, dependence on email communication has increased over the years and this dependence has resulted in an urgent need to improve spam filters. Although many spam filters have been created to help prevent these spam emails from entering a user's inbox, there is a lack or research focusing on text modifications. Currently, Naive Bayes is one of the most popular methods of spam classification because of its simplicity and efficiency. Naive Bayes is also very accurate; however, it is unable to correctly classify emails when they contain leetspeak or diacritics. Thus, in this proposes, we implemented a novel algorithm for enhancing the accuracy of the Naive Bayes Spam Filter so that it can detect text modifications and correctly classify the email as spam or ham. Our Python algorithm combines semantic based, keyword based, and machine learning algorithms to increase the accuracy of Naive Bayes compared to Spamassassin by over two hundred percent. Additionally, we have discovered a relationship between the length of the email and the spam score, indicating that Bayesian Poisoning, a controversial topic, is actually a real phenomenon and utilized by spammers.

2019-02-14
Georgakopoulos, Spiros V., Tasoulis, Sotiris K., Vrahatis, Aristidis G., Plagianakos, Vassilis P..  2018.  Convolutional Neural Networks for Toxic Comment Classification. Proceedings of the 10th Hellenic Conference on Artificial Intelligence. :35:1-35:6.
Flood of information is produced in a daily basis through the global internet usage arising from the online interactive communications among users. While this situation contributes significantly to the quality of human life, unfortunately it involves enormous dangers, since online texts with high toxicity can cause personal attacks, online harassment and bullying behaviors. This has triggered both industrial and research community in the last few years while there are several attempts to identify an efficient model for online toxic comment prediction. However, these steps are still in their infancy and new approaches and frameworks are required. On parallel, the data explosion that appears constantly, makes the construction of new machine learning computational tools for managing this information, an imperative need. Thankfully advances in hardware, cloud computing and big data management allow the development of Deep Learning approaches appearing very promising performance so far. For text classification in particular the use of Convolutional Neural Networks (CNN) have recently been proposed approaching text analytics in a modern manner emphasizing in the structure of words in a document. In this work, we employ this approach to discover toxic comments in a large pool of documents provided by a current Kaggle's competition regarding Wikipedia's talk page edits. To justify this decision we choose to compare CNNs against the traditional bag-of-words approach for text analysis combined with a selection of algorithms proven to be very effective in text classification. The reported results provide enough evidence that CNN enhance toxic comment classification reinforcing research interest towards this direction.
Zhu, Yimin, Woo, Simon S..  2018.  Adversarial Product Review Generation with Word Replacements. Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. :2324-2326.

Machine learning algorithms including Deep Neural Networks (DNNs) have shown great success in many different areas. However, they are frequently susceptible to adversarial examples, which are maliciously crafted inputs to fool machine learning classifiers. On the other hand, humans cannot distinguish between non-adversarial and adversarial inputs. In this work, we focus on creating adversarial examples to change the polarity of positive and negative reviews with Amazon product review dataset. We introduce a simple heuristics algorithm to construct adversarial product reviews by replacing words with semantically and synthetically similar synonyms. We evaluate our approach against the state-of-the-art CNN-BLSTM classifier. Our preliminary results show the performance drop of the classifier against the adversarial examples. We also present the defense mechanism using adversarial training.

2019-01-16
Gao, J., Lanchantin, J., Soffa, M. L., Qi, Y..  2018.  Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers. 2018 IEEE Security and Privacy Workshops (SPW). :50–56.

Although various techniques have been proposed to generate adversarial samples for white-box attacks on text, little attention has been paid to a black-box attack, which is a more realistic scenario. In this paper, we present a novel algorithm, DeepWordBug, to effectively generate small text perturbations in a black-box setting that forces a deep-learning classifier to misclassify a text input. We develop novel scoring strategies to find the most important words to modify such that the deep classifier makes a wrong prediction. Simple character-level transformations are applied to the highest-ranked words in order to minimize the edit distance of the perturbation. We evaluated DeepWordBug on two real-world text datasets: Enron spam emails and IMDB movie reviews. Our experimental results indicate that DeepWordBug can reduce the classification accuracy from 99% to 40% on Enron and from 87% to 26% on IMDB. Our results strongly demonstrate that the generated adversarial sequences from a deep-learning model can similarly evade other deep models.

2018-04-11
Deliu, I., Leichter, C., Franke, K..  2017.  Extracting Cyber Threat Intelligence from Hacker Forums: Support Vector Machines versus Convolutional Neural Networks. 2017 IEEE International Conference on Big Data (Big Data). :3648–3656.

Hacker forums and other social platforms may contain vital information about cyber security threats. But using manual analysis to extract relevant threat information from these sources is a time consuming and error-prone process that requires a significant allocation of resources. In this paper, we explore the potential of Machine Learning methods to rapidly sift through hacker forums for relevant threat intelligence. Utilizing text data from a real hacker forum, we compared the text classification performance of Convolutional Neural Network methods against more traditional Machine Learning approaches. We found that traditional machine learning methods, such as Support Vector Machines, can yield high levels of performance that are on par with Convolutional Neural Network algorithms.

2018-02-06
Huang, Lulu, Matwin, Stan, de Carvalho, Eder J., Minghim, Rosane.  2017.  Active Learning with Visualization for Text Data. Proceedings of the 2017 ACM Workshop on Exploratory Search and Interactive Data Analytics. :69–74.

Labeled datasets are always limited, and oftentimes the quantity of labeled data is a bottleneck for data analytics. This especially affects supervised machine learning methods, which require labels for models to learn from the labeled data. Active learning algorithms have been proposed to help achieve good analytic models with limited labeling efforts, by determining which additional instance labels will be most beneficial for learning for a given model. Active learning is consistent with interactive analytics as it proceeds in a cycle in which the unlabeled data is automatically explored. However, in active learning users have no control of the instances to be labeled, and for text data, the annotation interface is usually document only. Both of these constraints seem to affect the performance of an active learning model. We hypothesize that visualization techniques, particularly interactive ones, will help to address these constraints. In this paper, we implement a pilot study of visualization in active learning for text classification, with an interactive labeling interface. We compare the results of three experiments. Early results indicate that visualization improves high-performance machine learning model building with an active learning algorithm.

2018-01-10
Devyatkin, D., Smirnov, I., Ananyeva, M., Kobozeva, M., Chepovskiy, A., Solovyev, F..  2017.  Exploring linguistic features for extremist texts detection (on the material of Russian-speaking illegal texts). 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). :188–190.

In this paper we present results of a research on automatic extremist text detection. For this purpose an experimental dataset in the Russian language was created. According to the Russian legislation we cannot make it publicly available. We compared various classification methods (multinomial naive Bayes, logistic regression, linear SVM, random forest, and gradient boosting) and evaluated the contribution of differentiating features (lexical, semantic and psycholinguistic) to classification quality. The results of experiments show that psycholinguistic and semantic features are promising for extremist text detection.

2017-11-20
You, L., Li, Y., Wang, Y., Zhang, J., Yang, Y..  2016.  A deep learning-based RNNs model for automatic security audit of short messages. 2016 16th International Symposium on Communications and Information Technologies (ISCIT). :225–229.

The traditional text classification methods usually follow this process: first, a sentence can be considered as a bag of words (BOW), then transformed into sentence feature vector which can be classified by some methods, such as maximum entropy (ME), Naive Bayes (NB), support vector machines (SVM), and so on. However, when these methods are applied to text classification, we usually can not obtain an ideal result. The most important reason is that the semantic relations between words is very important for text categorization, however, the traditional method can not capture it. Sentiment classification, as a special case of text classification, is binary classification (positive or negative). Inspired by the sentiment analysis, we use a novel deep learning-based recurrent neural networks (RNNs)model for automatic security audit of short messages from prisons, which can classify short messages(secure and non-insecure). In this paper, the feature of short messages is extracted by word2vec which captures word order information, and each sentence is mapped to a feature vector. In particular, words with similar meaning are mapped to a similar position in the vector space, and then classified by RNNs. RNNs are now widely used and the network structure of RNNs determines that it can easily process the sequence data. We preprocess short messages, extract typical features from existing security and non-security short messages via word2vec, and classify short messages through RNNs which accept a fixed-sized vector as input and produce a fixed-sized vector as output. The experimental results show that the RNNs model achieves an average 92.7% accuracy which is higher than SVM.