Biblio

List
Filter

Found 17 results

Filters: Keyword is text categorization [Clear All Filters]

2023-02-17

Georgieva-Trifonova, Tsvetanka. 2022. Research on Filtering Feature Selection Methods for E-Mail Spam Detection by Applying K-NN Classifier. 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA). :1–4.

In the present paper, the application of filtering methods to select features when detecting email spam using the K-NN classifier is examined. The experiments include computation of the accuracy and F-measure of the e-mail texts classification with different methods for feature selection, different number of selected features and two ways to find the distance between dataset examples when executing K-NN classifier - Euclidean distance and cosine similarity. The obtained results are summarized and analyzed.

2023-02-03

Praveen, Sivakami, Dcouth, Alysha, Mahesh, A S. 2022. NoSQL Injection Detection Using Supervised Text Classification. 2022 2nd International Conference on Intelligent Technologies (CONIT). :1–5.

For a long time, SQL injection has been considered one of the most serious security threats. NoSQL databases are becoming increasingly popular as big data and cloud computing technologies progress. NoSQL injection attacks are designed to take advantage of applications that employ NoSQL databases. NoSQL injections can be particularly harmful because they allow unrestricted code execution. In this paper we use supervised learning and natural language processing to construct a model to detect NoSQL injections. Our model is designed to work with MongoDB, CouchDB, CassandraDB, and Couchbase queries. Our model has achieved an F1 score of 0.95 as established by 10-fold cross validation.

2022-09-09

Raafat, Maryam A., El-Wakil, Rania Abdel-Fattah, Atia, Ayman. 2021. Comparative study for Stylometric analysis techniques for authorship attribution. 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC). :176—181.

A text is a meaningful source of information. Capturing the right patterns in written text gives metrics to measure and infer to what extent this text belongs or is relevant to a specific author. This research aims to introduce a new feature that goes more in deep in the language structure. The feature introduced is based on an attempt to differentiate stylistic changes among authors according to the different sentence structure each author uses. The study showed the effect of introducing this new feature to machine learning models to enhance their performance. It was found that the prediction of authors was enhanced by adding sentence structure as an additional feature as the f1\_scores increased by 0.3% and when normalizing the data and adding the feature it increased by 5%.

Gonçalves, Luís, Vimieiro, Renato. 2021. Approaching authorship attribution as a multi-view supervised learning task. 2021 International Joint Conference on Neural Networks (IJCNN). :1—8.

Authorship attribution is the problem of identifying the author of texts based on the author's writing style. It is usually assumed that the writing style contains traits inaccessible to conscious manipulation and can thus be safely used to identify the author of a text. Several style markers have been proposed in the literature, nevertheless, there is still no consensus on which best represent the choices of authors. Here we assume an agnostic viewpoint on the dispute for the best set of features that represents an author's writing style. We rather investigate how different sources of information may unveil different aspects of an author's style, complementing each other to improve the overall process of authorship attribution. For this we model authorship attribution as a multi-view learning task. We assess the effectiveness of our proposal applying it to a set of well-studied corpora. We compare the performance of our proposal to the state-of-the-art approaches for authorship attribution. We thoroughly analyze how the multi-view approach improves on methods that use a single data source. We confirm that our approach improves both in accuracy and consistency of the methods and discuss how these improvements are beneficial for linguists and domain specialists.

2022-05-19

Anusha, M, Leelavathi, R. 2021. Analysis on Sentiment Analytics Using Deep Learning Techniques. 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC). :542–547.

Sentiment analytics is the process of applying natural language processing and methods for text-based information to define and extract subjective knowledge of the text. Natural language processing and text classifications can deal with limited corpus data and more attention has been gained by semantic texts and word embedding methods. Deep learning is a powerful method that learns different layers of representations or qualities of information and produces state-of-the-art prediction results. In different applications of sentiment analytics, deep learning methods are used at the sentence, document, and aspect levels. This review paper is based on the main difficulties in the sentiment assessment stage that significantly affect sentiment score, pooling, and polarity detection. The most popular deep learning methods are a Convolution Neural Network and Recurrent Neural Network. Finally, a comparative study is made with a vast literature survey using deep learning models.

Qing-chao, Ni, Cong-jue, Yin, Dong-hua, Zhao. 2021. Research on Small Sample Text Classification Based on Attribute Extraction and Data Augmentation. 2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). :53–57.

With the development of deep learning and the progress of natural language processing technology, as well as the continuous disclosure of judicial data such as judicial documents, legal intelligence has gradually become a research hot spot. The crime classification task is an important branch of text classification, which can help people related to the law to improve their work efficiency. However, in the actual research, the sample data is small and the distribution of crime categories is not balanced. To solve these two problems, BERT was used as the encoder to solve the problem of small data volume, and attribute extraction network was added to solve the problem of unbalanced distribution. Finally, the accuracy of 90.35% on small sample data set could be achieved, and F1 value was 67.62, which was close to the best model performance under sufficient data. Finally, a text enhancement method based on back-translation technology is proposed. Different models are used to conduct experiments. Finally, it is found that LSTM model is improved to some extent, but BERT is not improved to some extent.

Zhang, Cheng, Yamana, Hayato. 2021. Improving Text Classification Using Knowledge in Labels. 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA). :193–197.

Various algorithms and models have been proposed to address text classification tasks; however, they rarely consider incorporating the additional knowledge hidden in class labels. We argue that hidden information in class labels leads to better classification accuracy. In this study, instead of encoding the labels into numerical values, we incorporated the knowledge in the labels into the original model without changing the model architecture. We combined the output of an original classification model with the relatedness calculated based on the embeddings of a sequence and a keyword set. A keyword set is a word set to represent knowledge in the labels. Usually, it is generated from the classes while it could also be customized by the users. The experimental results show that our proposed method achieved statistically significant improvements in text classification tasks. The source code and experimental details of this study can be found on Github11https://github.com/HeroadZ/KiL.

Wu, Juan. 2021. Long Text Filtering in English Translation based on LSTM Semantic Association. 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC). :740–743.

Translation studies is one of the fastest growing interdisciplinary research fields in the world today. Business English is an urgent research direction in the field of translation studies. To some extent, the quality of business English translation directly determines the success or failure of international trade and the economic benefits. On the basis of sequence information encoding and decoding model of LSTM, this paper proposes a strategy combining attention mechanism with bidirectional LSTM model to handle the question of feature extraction of text information. The proposed method reduces the semantic complexity and improves the overall correlation accuracy. The experimental results show its advantages.

2022-04-13

Issifu, Abdul Majeed, Ganiz, Murat Can. 2021. A Simple Data Augmentation Method to Improve the Performance of Named Entity Recognition Models in Medical Domain. 2021 6th International Conference on Computer Science and Engineering (UBMK). :763–768.

Easy Data Augmentation is originally developed for text classification tasks. It consists of four basic methods: Synonym Replacement, Random Insertion, Random Deletion, and Random Swap. They yield accuracy improvements on several deep neural network models. In this study we apply these methods to a new domain. We augment Named Entity Recognition datasets from medical domain. Although the augmentation task is much more difficult due to the nature of named entities which consist of word or word groups in the sentences, we show that we can improve the named entity recognition performance.

2022-01-25

Marulli, Fiammetta, Balzanella, Antonio, Campanile, Lelio, Iacono, Mauro, Mastroianni, Michele. 2021. Exploring a Federated Learning Approach to Enhance Authorship Attribution of Misleading Information from Heterogeneous Sources. 2021 International Joint Conference on Neural Networks (IJCNN). :1–8.

Authorship Attribution (AA) is currently applied in several applications, among which fraud detection and anti-plagiarism checks: this task can leverage stylometry and Natural Language Processing techniques. In this work, we explored some strategies to enhance the performance of an AA task for the automatic detection of false and misleading information (e.g., fake news). We set up a text classification model for AA based on stylometry exploiting recurrent deep neural networks and implemented two learning tasks trained on the same collection of fake and real news, comparing their performances: one is based on Federated Learning architecture, the other on a centralized architecture. The goal was to discriminate potential fake information from true ones when the fake news comes from heterogeneous sources, with different styles. Preliminary experiments show that a distributed approach significantly improves recall with respect to the centralized model. As expected, precision was lower in the distributed model. This aspect, coupled with the statistical heterogeneity of data, represents some open issues that will be further investigated in future work.

2021-11-29

Hu, Shengze, He, Chunhui, Ge, Bin, Liu, Fang. 2020. Enhanced Word Embedding Method in Text Classification. 2020 6th International Conference on Big Data and Information Analytics (BigDIA). :18–22.

For the task of natural language processing (NLP), Word embedding technology has a certain impact on the accuracy of deep neural network algorithms. Considering that the current word embedding method cannot realize the coexistence of words and phrases in the same vector space. Therefore, we propose an enhanced word embedding (EWE) method. Before completing the word embedding, this method introduces a unique sentence reorganization technology to rewrite all the sentences in the original training corpus. Then, all the original corpus and the reorganized corpus are merged together as the training corpus of the distributed word embedding model, so as to realize the coexistence problem of words and phrases in the same vector space. We carried out experiment to demonstrate the effectiveness of the EWE algorithm on three classic benchmark datasets. The results show that the EWE method can significantly improve the classification performance of the CNN model.

Zhang, Qiang, Chai, Bo, Song, Bochuan, Zhao, Jingpeng. 2020. A Hierarchical Fine-Tuning Based Approach for Multi-Label Text Classification. 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). :51–54.

Hierarchical Text classification has recently become increasingly challenging with the growing number of classification labels. In this paper, we propose a hierarchical fine-tuning based approach for hierarchical text classification. We use the ordered neurons LSTM (ONLSTM) model by combining the embedding of text and parent category for hierarchical text classification with a large number of categories, which makes full use of the connection between the upper-level and lower-level labels. Extensive experiments show that our model outperforms the state-of-the-art hierarchical model at a lower computation cost.

2020-11-02

Pan, C., Huang, J., Gong, J., Yuan, X.. 2019. Few-Shot Transfer Learning for Text Classification With Lightweight Word Embedding Based Models. IEEE Access. 7:53296–53304.

Many deep learning architectures have been employed to model the semantic compositionality for text sequences, requiring a huge amount of supervised data for parameters training, making it unfeasible in situations where numerous annotated samples are not available or even do not exist. Different from data-hungry deep models, lightweight word embedding-based models could represent text sequences in a plug-and-play way due to their parameter-free property. In this paper, a modified hierarchical pooling strategy over pre-trained word embeddings is proposed for text classification in a few-shot transfer learning way. The model leverages and transfers knowledge obtained from some source domains to recognize and classify the unseen text sequences with just a handful of support examples in the target problem domain. The extensive experiments on five datasets including both English and Chinese text demonstrate that the simple word embedding-based models (SWEMs) with parameter-free pooling operations are able to abstract and represent the semantic text. The proposed modified hierarchical pooling method exhibits significant classification performance in the few-shot transfer learning tasks compared with other alternative methods.

2020-09-28

Liu, Kai, Zhou, Yun, Wang, Qingyong, Zhu, Xianqiang. 2019. Vulnerability Severity Prediction With Deep Neural Network. 2019 5th International Conference on Big Data and Information Analytics (BigDIA). :114–119.

High frequency of network security incidents has also brought a lot of negative effects and even huge economic losses to countries, enterprises and individuals in recent years. Therefore, more and more attention has been paid to the problem of network security. In order to evaluate the newly included vulnerability text information accurately, and to reduce the workload of experts and the false negative rate of the traditional method. Multiple deep learning methods for vulnerability text classification evaluation are proposed in this paper. The standard Cross Site Scripting (XSS) vulnerability text data is processed first, and then classified using three kinds of deep neural networks (CNN, LSTM, TextRCNN) and one kind of traditional machine learning method (XGBoost). The dropout ratio of the optimal CNN network, the epoch of all deep neural networks and training set data were tuned via experiments to improve the fit on our target task. The results show that the deep learning methods evaluate vulnerability risk levels better, compared with traditional machine learning methods, but cost more time. We train our models in various training sets and test with the same testing set. The performance and utility of recurrent convolutional neural networks (TextRCNN) is highest in comparison to all other methods, which classification accuracy rate is 93.95%.

2020-08-28

BOUGHACI, Dalila, BENMESBAH, Mounir, ZEBIRI, Aniss. 2019. An improved N-grams based Model for Authorship Attribution. 2019 International Conference on Computer and Information Sciences (ICCIS). :1—6.

Authorship attribution is the problem of studying an anonymous text and finding the corresponding author in a set of candidate authors. In this paper, we propose a method based on N-grams model for the problem of authorship attribution. Several measures are used to assign an anonymous text to an author. The different variants of the proposed method are implemented and validated on PAN benchmarks. The numerical results are encouraging and demonstrate the benefit of the proposed idea.

2019-02-25

Popovac, M., Karanovic, M., Sladojevic, S., Arsenovic, M., Anderla, A.. 2018. Convolutional Neural Network Based SMS Spam Detection. 2018 26th Telecommunications Forum (℡FOR). :1–4.

SMS spam refers to undesired text message. Machine Learning methods for anti-spam filters have been noticeably effective in categorizing spam messages. Dataset used in this research is known as Tiago's dataset. Crucial step in the experiment was data preprocessing, which involved reducing text to lower case, tokenization, removing stopwords. Convolutional Neural Network was the proposed method for classification. Overall model's accuracy was 98.4%. Obtained model can be used as a tool in many applications.

2017-11-20

You, L., Li, Y., Wang, Y., Zhang, J., Yang, Y.. 2016. A deep learning-based RNNs model for automatic security audit of short messages. 2016 16th International Symposium on Communications and Information Technologies (ISCIT). :225–229.

The traditional text classification methods usually follow this process: first, a sentence can be considered as a bag of words (BOW), then transformed into sentence feature vector which can be classified by some methods, such as maximum entropy (ME), Naive Bayes (NB), support vector machines (SVM), and so on. However, when these methods are applied to text classification, we usually can not obtain an ideal result. The most important reason is that the semantic relations between words is very important for text categorization, however, the traditional method can not capture it. Sentiment classification, as a special case of text classification, is binary classification (positive or negative). Inspired by the sentiment analysis, we use a novel deep learning-based recurrent neural networks (RNNs)model for automatic security audit of short messages from prisons, which can classify short messages(secure and non-insecure). In this paper, the feature of short messages is extracted by word2vec which captures word order information, and each sentence is mapped to a feature vector. In particular, words with similar meaning are mapped to a similar position in the vector space, and then classified by RNNs. RNNs are now widely used and the network structure of RNNs determines that it can easily process the sequence data. We preprocess short messages, extract typical features from existing security and non-security short messages via word2vec, and classify short messages through RNNs which accept a fixed-sized vector as input and produce a fixed-sized vector as output. The experimental results show that the RNNs model achieves an average 92.7% accuracy which is higher than SVM.