Biblio

List
Filter

Found 101 results

Filters: Keyword is text analysis [Clear All Filters]

2020-08-28

BOUGHACI, Dalila, BENMESBAH, Mounir, ZEBIRI, Aniss. 2019. An improved N-grams based Model for Authorship Attribution. 2019 International Conference on Computer and Information Sciences (ICCIS). :1—6.

Authorship attribution is the problem of studying an anonymous text and finding the corresponding author in a set of candidate authors. In this paper, we propose a method based on N-grams model for the problem of authorship attribution. Several measures are used to assign an anonymous text to an author. The different variants of the proposed method are implemented and validated on PAN benchmarks. The numerical results are encouraging and demonstrate the benefit of the proposed idea.

Khomytska, Iryna, Teslyuk, Vasyl. 2019. Mathematical Methods Applied for Authorship Attribution on the Phonological Level. 2019 IEEE 14th International Conference on Computer Sciences and Information Technologies (CSIT). 3:7—11.

The proposed combination of statistical methods has proved efficient for authorship attribution. The complex analysis method based on the proposed combination of statistical methods has made it possible to minimize the number of phoneme groups by which the authorial differentiation of texts has been done.

Jafariakinabad, Fereshteh, Hua, Kien A.. 2019. Style-Aware Neural Model with Application in Authorship Attribution. 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA). :325—328.

Writing style is a combination of consistent decisions associated with a specific author at different levels of language production, including lexical, syntactic, and structural. In this paper, we introduce a style-aware neural model to encode document information from three stylistic levels and evaluate it in the domain of authorship attribution. First, we propose a simple way to jointly encode syntactic and lexical representations of sentences. Subsequently, we employ an attention-based hierarchical neural network to encode the syntactic and semantic structure of sentences in documents while rewarding the sentences which contribute more to capturing the writing style. Our experimental results, based on four benchmark datasets, reveal the benefits of encoding document information from all three stylistic levels when compared to the baseline methods in the literature.

Khomytska, Iryna, Teslyuk, Vasyl. 2019. The Software for Authorship and Style Attribution. 2019 IEEE 15th International Conference on the Experience of Designing and Application of CAD Systems (CADSM). :1—4.

A new program has been developed for style and authorship attribution. Differentiation of styles by transcription symbols has proved to be efficient The novel approach involves a combination of two ways of transforming texts into their transcription variants. The java programming language makes it possible to improve efficiency of style and authorship attribution.

2020-07-27

Gorodnichev, Mikhail G., Kochupalov, Alexander E., Gematudinov, Rinat A.. 2018. Asynchronous Rendering of Texts in iOS Applications. 2018 IEEE International Conference "Quality Management, Transport and Information Security, Information Technologies" (IT QM IS). :643–645.

This article is devoted to new asynchronous methods for rendering text information in mobile applications for iOS operating system.

2020-07-24

Wu, Chuxin, Zhang, Peng, Liu, Hongwei, Liu, Yuhong. 2019. Multi-keyword Ranked Searchable Encryption Supporting CP-ABE Test. 2019 Computing, Communications and IoT Applications (ComComAp). :220—225.

Internet of Things (IoT) and cloud computing are promising technologies that change the way people communicate and live. As the data collected through IoT devices often involve users' private information and the cloud is not completely trusted, users' private data are usually encrypted before being uploaded to cloud for security purposes. Searchable encryption, allowing users to search over the encrypted data, extends data flexibility on the premise of security. In this paper, to achieve the accurate and efficient ciphertext searching, we present an efficient multi-keyword ranked searchable encryption scheme supporting ciphertext-policy attribute-based encryption (CP-ABE) test (MRSET). For efficiency, numeric hierarchy supporting ranked search is introduced to reduce the dimensions of vectors and matrices. For practicality, CP-ABE is improved to support access right test, so that only documents that the user can decrypt are returned. The security analysis shows that our proposed scheme is secure, and the experimental result demonstrates that our scheme is efficient.

2020-07-13

Agrawal, Shriyansh, Sanagavarapu, Lalit Mohan, Reddy, YR. 2019. FACT - Fine grained Assessment of web page CredibiliTy. TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON). :1088–1097.

With more than a trillion web pages, there is a plethora of content available for consumption. Search Engine queries invariably lead to overwhelming information, parts of it relevant and some others irrelevant. Often the information provided can be conflicting, ambiguous, and inconsistent contributing to the loss of credibility of the content. In the past, researchers have proposed approaches for credibility assessment and enumerated factors influencing the credibility of web pages. In this work, we detailed a WEBCred framework for automated genre-aware credibility assessment of web pages. We developed a tool based on the proposed framework to extract web page features instances and identify genre a web page belongs to while assessing it's Genre Credibility Score ( GCS). We validated our approach on `Information Security' dataset of 8,550 URLs with 171 features across 7 genres. The supervised learning algorithm, Gradient Boosted Decision Tree classified genres with 88.75% testing accuracy over 10 fold cross-validation, an improvement over the current benchmark. We also examined our approach on `Health' domain web pages and had comparable results. The calculated GCS correlated 69% with crowdsourced Web Of Trust ( WOT) score and 13% with algorithm based Alexa ranking across 5 Information security groups. This variance in correlation states that our GCS approach aligns with human way ( WOT) as compared to algorithmic way (Alexa) of web assessment in both the experiments.

2020-07-09

Feyisetan, Oluwaseyi, Diethe, Tom, Drake, Thomas. 2019. Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text. 2019 IEEE International Conference on Data Mining (ICDM). :210—219.

Guaranteeing a certain level of user privacy in an arbitrary piece of text is a challenging issue. However, with this challenge comes the potential of unlocking access to vast data stores for training machine learning models and supporting data driven decisions. We address this problem through the lens of dx-privacy, a generalization of Differential Privacy to non Hamming distance metrics. In this work, we explore word representations in Hyperbolic space as a means of preserving privacy in text. We provide a proof satisfying dx-privacy, then we define a probability distribution in Hyperbolic space and describe a way to sample from it in high dimensions. Privacy is provided by perturbing vector representations of words in high dimensional Hyperbolic space to obtain a semantic generalization. We conduct a series of experiments to demonstrate the tradeoff between privacy and utility. Our privacy experiments illustrate protections against an authorship attribution algorithm while our utility experiments highlight the minimal impact of our perturbations on several downstream machine learning models. Compared to the Euclidean baseline, we observe \textbackslashtextgreater 20x greater guarantees on expected privacy against comparable worst case statistics.

2020-07-06

Balouchestani, Arian, Mahdavi, Mojtaba, Hallaj, Yeganeh, Javdani, Delaram. 2019. SANUB: A new method for Sharing and Analyzing News Using Blockchain. 2019 16th International ISC (Iranian Society of Cryptology) Conference on Information Security and Cryptology (ISCISC). :139–143.

Millions of news are being exchanged daily among people. With the appearance of the Internet, the way of broadcasting news has changed and become faster, however it caused many problems. For instance, the increase in the speed of broadcasting news leads to an increase in the speed of fake news creation. Fake news can have a huge impression on societies. Additionally, the existence of a central entity, such as news agencies, could lead to fraud in the news broadcasting process, e.g. generating fake news and publishing them for their benefits. Since Blockchain technology provides a reliable decentralized network, it can be used to publish news. In addition, Blockchain with the help of decentralized applications and smart contracts can provide a platform in which fake news can be detected through public participation. In this paper, we proposed a new method for sharing and analyzing news to detect fake news using Blockchain, called SANUB. SANUB provides features such as publishing news anonymously, news evaluation, reporter validation, fake news detection and proof of news ownership. The results of our analysis show that SANUB outperformed the existing methods.

2020-05-22

Platonov, A.V., Poleschuk, E.A., Bessmertny, I. A., Gafurov, N. R.. 2018. Using quantum mechanical framework for language modeling and information retrieval. 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT). :1—4.

This article shows the analogy between natural language texts and quantum-like systems on the example of the Bell test calculating. The applicability of the well-known Bell test for texts in Russian is investigated. The possibility of using this test for the text separation on the topics corresponding to the user query in information retrieval system is shown.

Khadilkar, Kunal, Kulkarni, Siddhivinayak, Bone, Poojarani. 2018. Plagiarism Detection Using Semantic Knowledge Graphs. 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA). :1—6.

Every day, huge amounts of unstructured text is getting generated. Most of this data is in the form of essays, research papers, patents, scholastic articles, book chapters etc. Many plagiarism softwares are being developed to be used in order to reduce the stealing and plagiarizing of Intellectual Property (IP). Current plagiarism softwares are mainly using string matching algorithms to detect copying of text from another source. The drawback of some of such plagiarism softwares is their inability to detect plagiarism when the structure of the sentence is changed. Replacement of keywords by their synonyms also fails to be detected by these softwares. This paper proposes a new method to detect such plagiarism using semantic knowledge graphs. The method uses Named Entity Recognition as well as semantic similarity between sentences to detect possible cases of plagiarism. The doubtful cases are visualized using semantic Knowledge Graphs for thorough analysis of authenticity. Rules for active and passive voice have also been considered in the proposed methodology.

2020-05-18

Bakhtin, Vadim V., Isaeva, Ekaterina V.. 2019. New TSBuilder: Shifting towards Cognition. 2019 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus). :179–181.

The paper reviews a project on the automation of term system construction. TSBuilder (Term System Builder) was developed in 2014 as a multilayer Rosenblatt's perceptron for supervised machine learning, namely 1-3 word terms identification in natural language texts and their rigid categorization. The program is being modified to reduce the rigidity of categorization which will bring text mining more in line with human thinking.We are expanding the range of parameters (semantical, morphological, and syntactical) for categorization, removing the restriction of the term length of three words, using convolution on a continuous sequence of terms, and present the probabilities of a term falling into different categories. The neural network will not assign a single category to a term but give N answers (where N is the number of predefined classes), each of which O ∈ [0, 1] is the probability of the term to belong to a given class.

Panahandeh, Mahnaz, Ghanbari, Shirin. 2019. Correction of Spaces in Persian Sentences for Tokenization. 2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI). :670–674.

The exponential growth of the Internet and its users and the emergence of Web 2.0 have caused a large volume of textual data to be created. Automatic analysis of such data can be used in making decisions. As online text is created by different producers with different styles of writing, pre-processing is a necessity prior to any processes related to natural language tasks. An essential part of textual preprocessing prior to the recognition of the word vocabulary is normalization, which includes the correction of spaces that particularly in the Persian language this includes both full-spaces between words and half-spaces. Through the review of user comments within social media services, it can be seen that in many cases users do not adhere to grammatical rules of inserting both forms of spaces, which increases the complexity of the identification of words and henceforth, reducing the accuracy of further processing on the text. In this study, current issues in the normalization and tokenization of preprocessing tools within the Persian language and essentially identifying and correcting the separation of words are and the correction of spaces are proposed. The results obtained and compared to leading preprocessing tools highlight the significance of the proposed methodology.

Zhu, Meng, Yang, Xudong. 2019. Chinese Texts Classification System. 2019 IEEE 2nd International Conference on Information and Computer Technologies (ICICT). :149–152.

In this article, we designed an automatic Chinese text classification system aiming to implement a system for classifying news texts. We propose two improved classification algorithms as two different choices for users to choose and then our system uses the chosen method for the obtaining of the classified result of the input text. There are two improved algorithms, one is k-Bayes using hierarchy conception based on NB method in machine learning field and another one adds attention layer to the convolutional neural network in deep learning field. Through experiments, our results showed that improved classification algorithms had better accuracy than based algorithms and our system is useful for making classifying news texts more reasonably and effectively.

Kermani, Fatemeh Hojati, Ghanbari, Shirin. 2019. Extractive Persian Summarizer for News Websites. 2019 5th International Conference on Web Research (ICWR). :85–89.

Automatic extractive text summarization is the process of condensing textual information while preserving the important concepts. The proposed method after performing pre-processing on input Persian news articles generates a feature vector of salient sentences from a combination of statistical, semantic and heuristic methods and that are scored and concatenated accordingly. The scoring of the salient features is based on the article's title, proper nouns, pronouns, sentence length, keywords, topic words, sentence position, English words, and quotations. Experimental results on measurements including recall, F-measure, ROUGE-N are presented and compared to other Persian summarizers and shown to provide higher performance.

Sel, Slhami, Hanbay, Davut. 2019. E-Mail Classification Using Natural Language Processing. 2019 27th Signal Processing and Communications Applications Conference (SIU). :1–4.

Thanks to the rapid increase in technology and electronic communications, e-mail has become a serious communication tool. In many applications such as business correspondence, reminders, academic notices, web page memberships, e-mail is used as primary way of communication. If we ignore spam e-mails, there remain hundreds of e-mails received every day. In order to determine the importance of received e-mails, the subject or content of each e-mail must be checked. In this study we proposed an unsupervised system to classify received e-mails. Received e-mails' coordinates are determined by a method of natural language processing called as Word2Vec algorithm. According to the similarities, processed data are grouped by k-means algorithm with an unsupervised training model. In this study, 10517 e-mails were used in training. The success of the system is tested on a test group of 200 e-mails. In the test phase M3 model (window size 3, min. Word frequency 10, Gram skip) consolidated the highest success (91%). Obtained results are evaluated in section VI.

Fahad, S.K. Ahammad, Yahya, Abdulsamad Ebrahim. 2018. Inflectional Review of Deep Learning on Natural Language Processing. 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE). :1–4.

In the age of knowledge, Natural Language Processing (NLP) express its demand by a huge range of utilization. Previously NLP was dealing with statically data. Contemporary time NLP is doing considerably with the corpus, lexicon database, pattern reorganization. Considering Deep Learning (DL) method recognize artificial Neural Network (NN) to nonlinear process, NLP tools become increasingly accurate and efficient that begin a debacle. Multi-Layer Neural Network obtaining the importance of the NLP for its capability including standard speed and resolute output. Hierarchical designs of data operate recurring processing layers to learn and with this arrangement of DL methods manage several practices. In this paper, this resumed striving to reach a review of the tools and the necessary methodology to present a clear understanding of the association of NLP and DL for truly understand in the training. Efficiency and execution both are improved in NLP by Part of speech tagging (POST), Morphological Analysis, Named Entity Recognition (NER), Semantic Role Labeling (SRL), Syntactic Parsing, and Coreference resolution. Artificial Neural Networks (ANN), Time Delay Neural Networks (TDNN), Recurrent Neural Network (RNN), Convolution Neural Networks (CNN), and Long-Short-Term-Memory (LSTM) dealings among Dense Vector (DV), Windows Approach (WA), and Multitask learning (MTL) as a characteristic of Deep Learning. After statically methods, when DL communicate the influence of NLP, the individual form of the NLP process and DL rule collaboration was started a fundamental connection.

Peng, Tianrui, Harris, Ian, Sawa, Yuki. 2018. Detecting Phishing Attacks Using Natural Language Processing and Machine Learning. 2018 IEEE 12th International Conference on Semantic Computing (ICSC). :300–301.

Phishing attacks are one of the most common and least defended security threats today. We present an approach which uses natural language processing techniques to analyze text and detect inappropriate statements which are indicative of phishing attacks. Our approach is novel compared to previous work because it focuses on the natural language text contained in the attack, performing semantic analysis of the text to detect malicious intent. To demonstrate the effectiveness of our approach, we have evaluated it using a large benchmark set of phishing emails.

2020-04-20

Raber, Frederic, Krüger, Antonio. 2018. Deriving Privacy Settings for Location Sharing: Are Context Factors Always the Best Choice? 2018 IEEE Symposium on Privacy-Aware Computing (PAC). :86–94.

Research has observed context factors like occasion and time as influential factors for predicting whether or not to share a location with online friends. In other domains like social networks, personality was also found to play an important role. Furthermore, users are seeking a fine-grained disclosement policy that also allows them to display an obfuscated location, like the center of the current city, to some of their friends. In this paper, we observe which context factors and personality measures can be used to predict the correct privacy level out of seven privacy levels, which include obfuscation levels like center of the street or current city. Our results show that a prediction is possible with a precision 20% better than a constant value. We will give design indications to determine which context factors should be recorded, and how much the precision can be increased if personality and privacy measures are recorded using either a questionnaire or automated text analysis.

Raber, Frederic, Krüger, Antonio. 2018. Deriving Privacy Settings for Location Sharing: Are Context Factors Always the Best Choice? 2018 IEEE Symposium on Privacy-Aware Computing (PAC). :86–94.

2020-04-10

Bagui, Sikha, Nandi, Debarghya, Bagui, Subhash, White, Robert Jamie. 2019. Classifying Phishing Email Using Machine Learning and Deep Learning. 2019 International Conference on Cyber Security and Protection of Digital Services (Cyber Security). :1—2.

In this work, we applied deep semantic analysis, and machine learning and deep learning techniques, to capture inherent characteristics of email text, and classify emails as phishing or non -phishing.

2020-03-30

Mao, Huajian, Chi, Chenyang, Yu, Jinghui, Yang, Peixiang, Qian, Cheng, Zhao, Dongsheng. 2019. QRStream: A Secure and Convenient Method for Text Healthcare Data Transferring. 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). :3458–3462.

With the increasing of health awareness, the users become more and more interested in their daily health information and healthcare activities results from healthcare organizations. They always try to collect them together for better usage. Traditionally, the healthcare data is always delivered by paper format from the healthcare organizations, and it is not easy and convenient for data usage and management. They would have to translate these data on paper to digital version which would probably introduce mistakes into the data. It would be necessary if there is a secure and convenient method for electronic health data transferring between the users and the healthcare organizations. However, for the security and privacy problems, almost no healthcare organization provides a stable and full service for health data delivery. In this paper, we propose a secure and convenient method, QRStream, which splits original health data and loads them onto QR code frame streaming for the data transferring. The results shows that QRStream can transfer text health data smoothly with an acceptable performance, for example, transferring 10K data in 10 seconds.

2020-03-18

Wu, Chia-Feng, Ti, Yen-Wu, Kuo, Sy-Yen, Yu, Chia-Mu. 2019. Benchmarking Dynamic Searchable Symmetric Encryption with Search Pattern Hiding. 2019 International Conference on Intelligent Computing and its Emerging Applications (ICEA). :65–69.

Searchable symmetric encryption (SSE) is an important technique for cloud computing. SSE allows encrypted critical data stored on an untrusted cloud server to be searched using keywords, returning correct data, but the keywords and data content are unknown by the server. However, an SSE database is not practical because the data is generally frequently modified even when stored on a remote server, since the server cannot update the encrypted data without decryption. Dynamic searchable symmetric encryption (DSSE) is designed to support this requirement. DSSE allows adding or deleting encrypted data on the server without decryption. Many DSSE systems have been proposed, based on link-list structures or blind storage (a new primitive). Each has advantages and drawbacks regarding function, extensibility, and efficiency. For a real system, the most important aspect is the tradeoff between performance and security. Therefore, we implemented several DSSE systems to compare their efficiency and security, and identify the various disadvantages with a view to developing an improved system.

2020-03-02

Sultana, Kazi Zakia, Chong, Tai-Yin. 2019. A Proposed Approach to Build an Automated Software Security Assessment Framework using Mined Patterns and Metrics. 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). :176–181.

Software security is a major concern of the developers who intend to deliver a reliable software. Although there is research that focuses on vulnerability prediction and discovery, there is still a need for building security-specific metrics to measure software security and vulnerability-proneness quantitatively. The existing methods are either based on software metrics (defined on the physical characteristics of code; e.g. complexity or lines of code) which are not security-specific or some generic patterns known as nano-patterns (Java method-level traceable patterns that characterize a Java method or function). Other methods predict vulnerabilities using text mining approaches or graph algorithms which perform poorly in cross-project validation and fail to be a generalized prediction model for any system. In this paper, we envision to construct an automated framework that will assist developers to assess the security level of their code and guide them towards developing secure code. To accomplish this goal, we aim to refine and redefine the existing nano-patterns and software metrics to make them more security-centric so that they can be used for measuring the software security level of a source code (either file or function) with higher accuracy. In this paper, we present our visionary approach through a series of three consecutive studies where we (1) will study the challenges of the current software metrics and nano-patterns in vulnerability prediction, (2) will redefine and characterize the nano-patterns and software metrics so that they can capture security-specific properties of code and measure the security level quantitatively, and finally (3) will implement an automated framework for the developers to automatically extract the values of all the patterns and metrics for the given code segment and then flag the estimated security level as a feedback based on our research results. We accomplished some preliminary experiments and presented the results which indicate that our vision can be practically implemented and will have valuable implications in the community of software security.

2020-01-28

Calot, Enrique P., Ierache, Jorge S., Hasperué, Waldo. 2019. Document Typist Identification by Classification Metrics Applying Keystroke Dynamics Under Unidealised Conditions. 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). 8:19–24.

Keystroke Dynamics is the study of typing patterns and rhythm for personal identification and traits. Keystrokes may be analysed as fixed text such as passwords or as continuous typed text such as documents. This paper reviews different classification metrics for continuous text, such as the A and R metrics, Canberra, Manhattan and Euclidean and introduces a variant of the Minkowski distance. To test the metrics, we adopted a substantial dataset containing 239 thousand records acquired under real, harsh, and unidealised conditions. We propose a new parameter for the Minkowski metric, and we reinforce another for the A metric, as initially stated by its authors.