Biblio
Filters: Keyword is text analytics [Clear All Filters]
Analysis on Sentiment Analytics Using Deep Learning Techniques. 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC). :542–547.
.
2021. Sentiment analytics is the process of applying natural language processing and methods for text-based information to define and extract subjective knowledge of the text. Natural language processing and text classifications can deal with limited corpus data and more attention has been gained by semantic texts and word embedding methods. Deep learning is a powerful method that learns different layers of representations or qualities of information and produces state-of-the-art prediction results. In different applications of sentiment analytics, deep learning methods are used at the sentence, document, and aspect levels. This review paper is based on the main difficulties in the sentiment assessment stage that significantly affect sentiment score, pooling, and polarity detection. The most popular deep learning methods are a Convolution Neural Network and Recurrent Neural Network. Finally, a comparative study is made with a vast literature survey using deep learning models.
A Visual Analytics Approach for the Diagnosis of Heterogeneous and Multidimensional Machine Maintenance Data. 2021 IEEE 14th Pacific Visualization Symposium (PacificVis). :196–205.
.
2021. Analysis of large, high-dimensional, and heterogeneous datasets is challenging as no one technique is suitable for visualizing and clustering such data in order to make sense of the underlying information. For instance, heterogeneous logs detailing machine repair and maintenance in an organization often need to be analyzed to diagnose errors and identify abnormal patterns, formalize root-cause analyses, and plan preventive maintenance. Such real-world datasets are also beset by issues such as inconsistent and/or missing entries. To conduct an effective diagnosis, it is important to extract and understand patterns from the data with support from analytic algorithms (e.g., finding that certain kinds of machine complaints occur more in the summer) while involving the human-in-the-loop. To address these challenges, we adopt existing techniques for dimensionality reduction (DR) and clustering of numerical, categorical, and text data dimensions, and introduce a visual analytics approach that uses multiple coordinated views to connect DR + clustering results across each kind of the data dimension stated. To help analysts label the clusters, each clustering view is supplemented with techniques and visualizations that contrast a cluster of interest with the rest of the dataset. Our approach assists analysts to make sense of machine maintenance logs and their errors. Then the gained insights help them carry out preventive maintenance. We illustrate and evaluate our approach through use cases and expert studies respectively, and discuss generalization of the approach to other heterogeneous data.
Research on Small Sample Text Classification Based on Attribute Extraction and Data Augmentation. 2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). :53–57.
.
2021. With the development of deep learning and the progress of natural language processing technology, as well as the continuous disclosure of judicial data such as judicial documents, legal intelligence has gradually become a research hot spot. The crime classification task is an important branch of text classification, which can help people related to the law to improve their work efficiency. However, in the actual research, the sample data is small and the distribution of crime categories is not balanced. To solve these two problems, BERT was used as the encoder to solve the problem of small data volume, and attribute extraction network was added to solve the problem of unbalanced distribution. Finally, the accuracy of 90.35% on small sample data set could be achieved, and F1 value was 67.62, which was close to the best model performance under sufficient data. Finally, a text enhancement method based on back-translation technology is proposed. Different models are used to conduct experiments. Finally, it is found that LSTM model is improved to some extent, but BERT is not improved to some extent.
Deeply Multi-channel guided Fusion Mechanism for Natural Scene Text Detection. 2021 7th International Conference on Big Data and Information Analytics (BigDIA). :149–156.
.
2021. Scene text detection methods have developed greatly in the past few years. However, due to the limitation of the diversity of the text background of natural scene, the previous methods often failed when detecting more complicated text instances (e.g., super-long text and arbitrarily shaped text). In this paper, a text detection method based on multi -channel bounding box fusion is designed to address the problem. Firstly, the convolutional neural network is used as the basic network for feature extraction, including shallow text feature map and deep semantic text feature map. Secondly, the whole convolutional network is used for upsampling of feature map and fusion of feature map at each layer, so as to obtain pixel-level text and non-text classification results. Then, two independent text detection boxes channels are designed: the boundary box regression channel and get the bounding box directly on the score map channel. Finally, the result is obtained by combining multi-channel boundary box fusion mechanism with the detection box of the two channels. Experiments on ICDAR2013 and ICDAR2015 demonstrate that the proposed method achieves competitive results in scene text detection.
G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression. 2021 IEEE 37th International Conference on Data Engineering (ICDE). :1679–1690.
.
2021. Text analytics directly on compression (TADOC) has proven to be a promising technology for big data analytics. GPUs are extremely popular accelerators for data analytics systems. Unfortunately, no work so far shows how to utilize GPUs to accelerate TADOC. We describe G-TADOC, the first framework that provides GPU-based text analytics directly on compression, effectively enabling efficient text analytics on GPUs without decompressing the input data. G-TADOC solves three major challenges. First, TADOC involves a large amount of dependencies, which makes it difficult to exploit massive parallelism on a GPU. We develop a novel fine-grained thread-level workload scheduling strategy for GPU threads, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, in developing G-TADOC, thousands of GPU threads writing to the same result buffer leads to inconsistency while directly using locks and atomic operations lead to large synchronization overheads. We develop a memory pool with thread-safe data structures on GPUs to handle such difficulties. Third, maintaining the sequence information among words is essential for lossless compression. We design a sequence-support strategy, which maintains high GPU parallelism while ensuring sequence information. Our experimental evaluations show that G-TADOC provides 31.1× average speedup compared to state-of-the-art TADOC.
Text mining and visual analytics in research: Exploring the innovative tools. 2021 International Conference on Decision Aid Sciences and Application (DASA). :1087–1091.
.
2021. The aim of the study is to present an advanced overview and potential application of the innovative tools/software's/methods used for data visualization, text mining, scientific mapping, and bibliometric analysis. Text mining and data visualization has been a topic of research for several years for academic researchers and practitioners. With the advancement in technology and innovation in the data analysis techniques, there are many online and offline software tools available for text mining and visualisation. The purpose of this study is to present an advanced overview of latest, sophisticated, and innovative tools available for this purpose. The unique characteristic about this study is that it provides an overview with examples of the five most adopted software tools such as VOSviewer, Biblioshiny, Gephi, HistCite and CiteSpace in social science research. This study will contribute to the academic literature and will help the researchers and practitioners to apply these tools in future research to present their findings in a more scientific manner.
API Pipeline for Visualising Text Analytics Features of Twitter Texts. 2021 International Conference of Women in Data Science at Taif University (WiDSTaif ). :1–6.
.
2021. Twitter text analysis is quite useful in analysing emotions, sentiments and feedbacks of consumers on products and services. This helps the service providers and the manufacturers to improve their products and services, address serious issues before they lead to a crisis and improve business acumen. Twitter texts also form a data source for various research studies. They are used in topic analysis, sentiment analysis, content analysis and thematic analysis. In this paper, we present a pipeline for searching, analysing and visualizing the text analytics features of twitter texts using web APIs. It allows to build a simple yet powerful twitter text analytics tool for researchers and other interested users.
Improving Text Classification Using Knowledge in Labels. 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA). :193–197.
.
2021. Various algorithms and models have been proposed to address text classification tasks; however, they rarely consider incorporating the additional knowledge hidden in class labels. We argue that hidden information in class labels leads to better classification accuracy. In this study, instead of encoding the labels into numerical values, we incorporated the knowledge in the labels into the original model without changing the model architecture. We combined the output of an original classification model with the relatedness calculated based on the embeddings of a sequence and a keyword set. A keyword set is a word set to represent knowledge in the labels. Usually, it is generated from the classes while it could also be customized by the users. The experimental results show that our proposed method achieved statistically significant improvements in text classification tasks. The source code and experimental details of this study can be found on Github11https://github.com/HeroadZ/KiL.
Text Analytics Architecture in IoT Systems. 2021 Third South American Colloquium on Visible Light Communications (SACVLC). :01–06.
.
2021. Management control and monitoring of production activities in intelligent environments in subway mines must be aligned with the strategies and objectives of each agent. It is required that in operations, the local structure of each service is fault-tolerant and that large amounts of data are transmitted online to executives to make effective and efficient decisions. The paper proposes an architecture that enables strategic text analysis on the Internet of Things devices through task partitioning with multiple agent systems and evaluates the feasibility of the design by building a prototype that improves communication. The results validate the system's design because Raspberry Pi can execute text mining algorithms and agents in about 3 seconds for 197 texts. This work emphasizes multiple agents for text analytics because the algorithms, along with the agents, use about 70% of a Raspberry Pi CPU.
Long Text Filtering in English Translation based on LSTM Semantic Association. 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC). :740–743.
.
2021. Translation studies is one of the fastest growing interdisciplinary research fields in the world today. Business English is an urgent research direction in the field of translation studies. To some extent, the quality of business English translation directly determines the success or failure of international trade and the economic benefits. On the basis of sequence information encoding and decoding model of LSTM, this paper proposes a strategy combining attention mechanism with bidirectional LSTM model to handle the question of feature extraction of text information. The proposed method reduces the semantic complexity and improves the overall correlation accuracy. The experimental results show its advantages.
Text Analytics and Big Data in the Financial domain. 2021 16th Iberian Conference on Information Systems and Technologies (CISTI). :1–4.
.
2021. This research attempts to provide some insights on the application of text mining and Natural Language Processing (NLP). The application domain is consumer complaints about financial institutions in the USA. As an advanced analytics discipline embedded within the Big Data paradigm, the practice of text analytics contains elements of emergent knowledge processes. Since our experiment should be able to scale up we make use of a pipeline based on Spark-NLP. The usage scenario is adapting the model to a specific industrial context and using the dataset offered by the "Consumer Financial Protection Bureau" to illustrate the application.
Individual versus Computer-Supported Collaborative Self-Explanations: How Do Their Writing Analytics Differ? 2020 IEEE 20th International Conference on Advanced Learning Technologies (ICALT). :132–134.
.
2020. Researchers have demonstrated the effectiveness of self-explanations (SE) as an instructional practice and study strategy. However, there is a lack of work studying the characteristics of SE responses prompted by collaborative activities. In this paper, we use writing analytics to investigate differences between SE text responses resulting from individual versus collaborative learning activities. A Coh-Metrix analysis suggests that students in the collaborative SE activity demonstrated a higher level of comprehension. Future research should explore how writing analytics can be incorporated into CSCL systems to support student performance of SE activities.
Determining Worker Type from Legal Text Data Using Machine Learning. 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech). :444–450.
.
2020. This project addresses a classic employment law question in Canada and elsewhere using machine learning approach: how do we know whether a worker is an employee or an independent contractor? This is a central issue for self-represented litigants insofar as these two legal categories entail very different rights and employment protections. In this interdisciplinary research study, we collaborated with the Conflict Analytics Lab to develop machine learning models aimed at determining whether a worker is an employee or an independent contractor. We present a number of supervised learning models including a neural network model that we implemented using data labeled by law researchers and compared the accuracy of the models. Our neural network model achieved an accuracy rate of 91.5%. A critical discussion follows to identify the key features in the data that influence the accuracy of our models and provide insights about the case outcomes.
Enhanced Word Embedding Method in Text Classification. 2020 6th International Conference on Big Data and Information Analytics (BigDIA). :18–22.
.
2020. For the task of natural language processing (NLP), Word embedding technology has a certain impact on the accuracy of deep neural network algorithms. Considering that the current word embedding method cannot realize the coexistence of words and phrases in the same vector space. Therefore, we propose an enhanced word embedding (EWE) method. Before completing the word embedding, this method introduces a unique sentence reorganization technology to rewrite all the sentences in the original training corpus. Then, all the original corpus and the reorganized corpus are merged together as the training corpus of the distributed word embedding model, so as to realize the coexistence problem of words and phrases in the same vector space. We carried out experiment to demonstrate the effectiveness of the EWE algorithm on three classic benchmark datasets. The results show that the EWE method can significantly improve the classification performance of the CNN model.
Deep Learning for Text Detection and Recognition in Complex Engineering Diagrams. 2020 International Joint Conference on Neural Networks (IJCNN). :1–7.
.
2020. Engineering drawings such as Piping and Instrumentation Diagrams contain a vast amount of text data which is essential to identify shapes, pipeline activities, tags, amongst others. These diagrams are often stored in undigitised format, such as paper copy, meaning the information contained within the diagrams is not readily accessible to inspect and use for further data analytics. In this paper, we make use of the benefits of recent deep learning advances by selecting models for both text detection and text recognition, and apply them to the digitisation of text from within real world complex engineering diagrams. Results show that 90% of text strings were detected including vertical text strings, however certain non text diagram elements were detected as text. Text strings were obtained by the text recognition method for 86% of detected text instances. The findings show that whilst the chosen Deep Learning methods were able to detect and recognise text which occurred in simple scenarios, more complex representations of text including those text strings located in close proximity to other drawing elements were highlighted as a remaining challenge.
Classification Between Machine Translated Text and Original Text By Part Of Speech Tagging Representation. 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA). :739–740.
.
2020. Classification between machine-translated text and original text are often tokenized on vocabulary of the corpi. With N-grams larger than uni-gram, one can create a model that estimates a decision boundary based on word frequency probability distribution; however, this approach is exponentially expensive because of high dimensionality and sparsity. Instead, we let samples of the corpi be represented by part-of-speech tagging which is significantly less vocabulary. With less trigram permutations, we can create a model with its tri-gram frequency probability distribution. In this paper, we explore less conventional ways of approaching techniques for handling documents, dictionaries, and the likes.
A Hierarchical Fine-Tuning Based Approach for Multi-Label Text Classification. 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). :51–54.
.
2020. Hierarchical Text classification has recently become increasingly challenging with the growing number of classification labels. In this paper, we propose a hierarchical fine-tuning based approach for hierarchical text classification. We use the ordered neurons LSTM (ONLSTM) model by combining the embedding of text and parent category for hierarchical text classification with a large number of categories, which makes full use of the connection between the upper-level and lower-level labels. Extensive experiments show that our model outperforms the state-of-the-art hierarchical model at a lower computation cost.
A New Approach to Use Big Data Tools to Substitute Unstructured Data Warehouse. 2020 IEEE Conference on Big Data and Analytics (ICBDA). :26–31.
.
2020. Data warehouse and big data have become the trend to help organise data effectively. Business data are originating in various kinds of sources with different forms from conventional structured data to unstructured data, it is the input for producing useful information essential for business sustainability. This research will navigate through the complicated designs of the common big data and data warehousing technologies to propose an effective approach to use these technologies for designing and building an unstructured textual data warehouse, a crucial and essential tool for most enterprises nowadays for decision making and gaining business competitive advantages. In this research, we utilised the IBM BigInsights Text Analytics, PostgreSQL, and Pentaho tools, an unstructured data warehouse is implemented and worked excellently with the unstructured text from Amazon review datasets, the new proposed approach creates a practical solution for building an unstructured data warehouse.
Study of Extractive Text Summarizer Using The Elmo Embedding. 2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC). :829–834.
.
2020. In recent times, data excessiveness has become a major problem in the field of education, news, blogs, social media, etc. Due to an increase in such a vast amount of text data, it became challenging for a human to extract only the valuable amount of data in a concise form. In other words, summarizing the text, enables human to retrieves the relevant and useful texts, Text summarizing is extracting the data from the document and generating the short or concise text of the document. One of the major approaches that are used widely is Automatic Text summarizer. Automatic text summarizer analyzes the large textual data and summarizes it into the short summaries containing valuable information of the data. Automatic text summarizer further divided into two types 1) Extractive text summarizer, 2) Abstractive Text summarizer. In this article, the extractive text summarizer approach is being looked for. Extractive text summarization is the approach in which model generates the concise summary of the text by picking up the most relevant sentences from the text document. This paper focuses on retrieving the valuable amount of data using the Elmo embedding in Extractive text summarization. Elmo embedding is a contextual embedding that had been used previously by many researchers in abstractive text summarization techniques, but this paper focus on using it in extractive text summarizer.
Comparison of Full-Text Articles and Abstracts for Visual Trend Analytics through Natural Language Processing. 2020 24th International Conference Information Visualisation (IV). :360–367.
.
2020. Scientific publications are an essential resource for detecting emerging trends and innovations in a very early stage, by far earlier than patents may allow. Thereby Visual Analytics systems enable a deep analysis by applying commonly unsupervised machine learning methods and investigating a mass amount of data. A main question from the Visual Analytics viewpoint in this context is, do abstracts of scientific publications provide a similar analysis capability compared to their corresponding full-texts? This would allow to extract a mass amount of text documents in a much faster manner. We compare in this paper the topic extraction methods LSI and LDA by using full text articles and their corresponding abstracts to obtain which method and which data are better suited for a Visual Analytics system for Technology and Corporate Foresight. Based on a easy replicable natural language processing approach, we further investigate the impact of lemmatization for LDA and LSI. The comparison will be performed qualitative and quantitative to gather both, the human perception in visual systems and coherence values. Based on an application scenario a visual trend analytics system illustrates the outcomes.
On the Network and Topological Analyses of Legal Documents Using Text Mining Approach. 2020 1st International Conference on Big Data Analytics and Practices (IBDAP). :1–6.
.
2020. This paper presents a computational study of Thai legal documents using text mining and network analytic approach. Thai legal systems rely much on the existing judicial rulings. Thus, legal documents contain complex relationships and require careful examination. The objective of this study is to use text mining to model relationships between these legal documents and draw useful insights. A structure of document relationship was found as a result of the study in forms of a network that is related to the meaningful relations of legal documents. This can potentially be developed further into a document retrieval system based on how documents are related in the network.
Interactive Mixed Brushing: Integrated Text and Visual Based Data Exploration. Proceedings of Computer Graphics International 2018. :77-86.
.
2018. Linking and brushing is an essential technique for interactive data exploration and analysis that leverages coordinated multiple views to identify, select, and combine data points of interest. We propose to augment this technique by directly exploring data space using textual queries. Textual and visual queries are freely combined and modified during the data exploration process. Visual queries are used to refine the results of textual queries and vice versa. This mixed brushing integrates procedural, textual, and visual based data exploration to provide a unified approach to brushing. We also propose an interface –- the Text Query Browser View, that allows users to specify and edit data queries as well as to browse the data query history. Further, we argue why an interactive, on-demand, data aggregation and derivation is necessary, and we provide a flexible mechanism that supports it. We have implemented the proposed approach within an existing visualization tool using a client-server architecture. The approach was illustrated and evaluated using two example data sets.
Convolutional Neural Networks for Toxic Comment Classification. Proceedings of the 10th Hellenic Conference on Artificial Intelligence. :35:1-35:6.
.
2018. Flood of information is produced in a daily basis through the global internet usage arising from the online interactive communications among users. While this situation contributes significantly to the quality of human life, unfortunately it involves enormous dangers, since online texts with high toxicity can cause personal attacks, online harassment and bullying behaviors. This has triggered both industrial and research community in the last few years while there are several attempts to identify an efficient model for online toxic comment prediction. However, these steps are still in their infancy and new approaches and frameworks are required. On parallel, the data explosion that appears constantly, makes the construction of new machine learning computational tools for managing this information, an imperative need. Thankfully advances in hardware, cloud computing and big data management allow the development of Deep Learning approaches appearing very promising performance so far. For text classification in particular the use of Convolutional Neural Networks (CNN) have recently been proposed approaching text analytics in a modern manner emphasizing in the structure of words in a document. In this work, we employ this approach to discover toxic comments in a large pool of documents provided by a current Kaggle's competition regarding Wikipedia's talk page edits. To justify this decision we choose to compare CNNs against the traditional bag-of-words approach for text analysis combined with a selection of algorithms proven to be very effective in text classification. The reported results provide enough evidence that CNN enhance toxic comment classification reinforcing research interest towards this direction.
A Novel Feature Hashing With Efficient Collision Resolution for Bag-of-Words Representation of Text Data. Proceedings of the 2Nd International Conference on Natural Language Processing and Information Retrieval. :12-16.
.
2018. Text Mining is widely used in many areas transforming unstructured text data from all sources such as patients' record, social media network, insurance data, and news, among others into an invaluable source of information. The Bag Of Words (BoW) representation is a means of extracting features from text data for use in modeling. In text classification, a word in a document is assigned a weight according to its frequency and frequency between different documents; therefore, words together with their weights form the BoW. One way to solve the issue of voluminous data is to use the feature hashing method or hashing trick. However, collision is inevitable and might change the result of the whole process of feature generation and selection. Using the vector data structure, the lookup performance is improved while resolving collision and the memory usage is also efficient.
Text Analysis for Decision Making Under Adversarial Environments. Proceedings of the 10th Hellenic Conference on Artificial Intelligence. :39:1-39:6.
.
2018. Sentiment analysis and other practices for text analytics on social media rely on publicly available and editable collections of data for training and evaluation. These data collections are subject to poisoning and data contamination attacks by adversaries having an interest in misleading the results of the performed analysis. We present the problem of adversarial text mining with a focus on decision making and we suggest cross-discipline, cross-application and cross-model strategies for more robust analyses. Our approach is practitioner-centric and is based on broadly-used interpretable models with applications in decision making.