Biblio
Once we had tried to propose an unbreakable CAPTCHA and we reached a result that limitation of time is effect to prevent computers from recognizing characters accurately while computers can finally recognize all text-based CAPTCHA in unlimited time. One of the existing usual ways to prevent computers from recognizing characters is distortion, and adding noise is also effective for the prevention. However, these kinds of prevention also make recognition of characters by human beings difficult. As a solution of the problems, an effective text-based CAPTCHA algorithm with amodal completion was proposed by our team. Our CAPTCHA causes computers a large amount of calculation costs while amodal completion helps human beings to recognize characters momentarily. Our CAPTCHA has evolved with aftereffects and combinations of complementary colors. We evaluated our CAPTCHA with deep learning which is attracting the most attention since deep learning is faster and more accurate than existing methods for recognition with computers. In this paper, we add jagged lines to edges of characters since edges are one of the most important parts for recognition in deep learning. In this paper, we also evaluate that how much the jagged lines decrease recognition of human beings and how much they prevent computers from the recognition. We confirm the effects of our method to deep learning.
Community question answering (cQA) has become an important issue due to the popularity of cQA archives on the Web. This paper focuses on addressing the lexical gap problem in question retrieval. Question retrieval in cQA archives aims to find the existing questions that are semantically equivalent or relevant to the queried questions. However, the lexical gap problem brings a new challenge for question retrieval in cQA. In this paper, we propose to model and learn distributed word representations with metadata of category information within cQA pages for question retrieval using two novel category powered models. One is a basic category powered model called MB-NET and the other one is an enhanced category powered model called ME-NET which can better learn the distributed word representations and alleviate the lexical gap problem. To deal with the variable size of word representation vectors, we employ the framework of fisher kernel to transform them into the fixed-length vectors. Experimental results on large-scale English and Chinese cQA data sets show that our proposed approaches can significantly outperform state-of-the-art retrieval models for question retrieval in cQA. Moreover, we further conduct our approaches on large-scale automatic evaluation experiments. The evaluation results show that promising and significant performance improvements can be achieved.
Protecting Critical Infrastructures (CIs) against contemporary cyber attacks has become a crucial as well as complex task. Modern attack campaigns, such as Advanced Persistent Threats (APTs), leverage weaknesses in the organization's business processes and exploit vulnerabilities of several systems to hit their target. Although their life-cycle can last for months, these campaigns typically go undetected until they achieve their goal. They usually aim at performing data exfiltration, cause service disruptions and can also undermine the safety of humans. Novel detection techniques and incident handling approaches are therefore required, to effectively protect CI's networks and timely react to this type of threats. Correlating large amounts of data, collected from a multitude of relevant sources, is necessary and sometimes required by national authorities to establish cyber situational awareness, and allow to promptly adopt suitable countermeasures in case of an attack. In this paper we propose three novel methods for security information correlation designed to discover relevant insights and support the establishment of cyber situational awareness.
Institutions use the information security (InfoSec) policy document as a set of rules and guidelines to govern the use of the institutional information resources. However, a common problem is that these policies are often not followed or complied with. This study explores the extent to which the problem lies with the policy documents themselves. The InfoSec policies are documented in the natural languages, which are prone to ambiguity and misinterpretation. Subsequently such policies may be ambiguous, thereby making it hard, if not impossible for users to comply with. A case study approach with a content analysis was conducted. The research explores the extent of the problem by using a case study of an educational institution in South Africa.
Keystroke dynamics analysis has been applied successfully to password or fixed short texts verification as a means to reduce their inherent security limitations, because their length and the fact of being typed often makes their characteristic timings fairly stable. On the other hand, free text analysis has been neglected until recent years due to the inherent difficulties of dealing with short term behavioral noise and long term effects over the typing rhythm. In this paper we examine finite context modeling of keystroke dynamics in free text and report promising results for user verification over an extensive data set collected from a real world environment outside the laboratory setting that we make publicly available.
Detecting early trends indicating cognitive decline can allow older adults to better manage their health, but current assessments present barriers precluding the use of such continuous monitoring by consumers. To explore the effects of cognitive status on computer interaction patterns, the authors collected typed text samples from older adults with and without pre-mild cognitive impairment (PreMCI) and constructed statistical models from keystroke and linguistic features for differentiating between the two groups. Using both feature sets, they obtained a 77.1 percent correct classification rate with 70.6 percent sensitivity, 83.3 percent specificity, and a 0.808 area under curve (AUC). These results are in line with current assessments for MC–a more advanced disease–but using an unobtrusive method. This research contributes a combination of features for text and keystroke analysis and enhances understanding of how clinicians or older adults themselves might monitor for PreMCI through patterns in typed text. It has implications for embedded systems that can enable healthcare providers and consumers to proactively and continuously monitor changes in cognitive function.
Document image binarization is performed to segment foreground text from background text in badly degraded documents. In this paper, a comprehensive survey has been conducted on some state-of-the-art document image binarization techniques. After describing these document images binarization techniques, their performance have been compared with the help of various evaluation performance metrics which are widely used for document image analysis and recognition. On the basis of this comparison, it has been found out that the adaptive contrast method is the best performing method. Accordingly, the partial results that we have obtained for the adaptive contrast method have been stated and also the mathematical model and block diagram of the adaptive contrast method has been described in detail.
As a very valuable cultural heritage, palm leaf manuscripts offer a new challenge in document analysis system due to the specific characteristics on physical support of the manuscript. With the aim of finding an optimal binarization method for palm leaf manuscript images, creating a new ground truth binarized image is a necessary step in document analysis of palm leaf manuscript. But, regarding to the human intervention in ground truthing process, an important remark about the subjectivity effect on the construction of ground truth binarized image has been analysed and reported. In this paper, we present an experiment in a real condition to analyse the existance of human subjectivity on the construction of ground truth binarized image of palm leaf manuscript images and to measure quantitatively the ground truth variability with several binarization evaluation metrics.
The development of internet comes with the other domain that is cyber-crime. The record and intelligently can be exposed to a user of illegal activity so that it has become important to make the technology reliable. Phishing techniques include domain of email messages. Phishing emails have hosted such a phishing website, where a click on the URL or the malware code as executing some actions to perform is socially engineered messages. Lexically analyzing the URLs can enhance the performance and help to differentiate between the original email and the phishing URL. As assessed in this study, in addition to textual analysis of phishing URL, email classification is successful and results in a highly precise anti phishing.
With the growing observed success of big data use, many challenges appeared. Timeless, scalability and privacy are the main problems that researchers attempt to figure out. Privacy preserving is now a highly active domain of research, many works and concepts had seen the light within this theme. One of these concepts is the de-identification techniques. De-identification is a specific area that consists of finding and removing sensitive information either by replacing it, encrypting it or adding a noise to it using several techniques such as cryptography and data mining. In this report, we present a new model of de-identification of textual data using a specific Immune System algorithm known as CLONALG.
With the advancement of technology, the world has not only become a better place to live in but have also lost the privacy and security of shared data. Information in any form is never safe from the hands of unauthorized accessing individuals. Here, in our paper we propose an approach by which we can preserve data using visual cryptography. In this paper, two sixteen segment displayed text is broken into two shares that does not reveal any information about the original images. By this process we have obtained satisfactory results in statistical and structural testes.
Unstructured data mining has become topical recently due to the availability of high-dimensional and voluminous digital content (known as "Big Data") across the enterprise spectrum. The Relational Database Management Systems (RDBMS) have been employed over the past decades for content storage and management, but, the ever-growing heterogeneity in today's data calls for a new storage approach. Thus, the NoSQL database has emerged as the preferred storage facility nowadays since the facility supports unstructured data storage. This creates the need to explore efficient data mining techniques from such NoSQL systems since the available tools and frameworks which are designed for RDBMS are often not directly applicable. In this paper, we focused on topics and terms mining, based on clustering, in document-based NoSQL. This is achieved by adapting the architectural design of an analytics-as-a-service framework and the proposal of the Viterbi algorithm to enhance the accuracy of the terms classification in the system. The results from the pilot testing of our work show higher accuracy in comparison to some previously proposed techniques such as the parallel search.
Interactive visualization provides valuable support for exploring, analyzing, and understanding textual documents. Certain tasks, however, require that insights derived from visual abstractions are verified by a human expert perusing the source text. So far, this problem is typically solved by offering overview-detail techniques, which present different views with different levels of abstractions. This often leads to problems with visual continuity. Focus-context techniques, on the other hand, succeed in accentuating interesting subsections of large text documents but are normally not suited for integrating visual abstractions. With VarifocalReader we present a technique that helps to solve some of these approaches' problems by combining characteristics from both. In particular, our method simplifies working with large and potentially complex text documents by simultaneously offering abstract representations of varying detail, based on the inherent structure of the document, and access to the text itself. In addition, VarifocalReader supports intra-document exploration through advanced navigation concepts and facilitates visual analysis tasks. The approach enables users to apply machine learning techniques and search mechanisms as well as to assess and adapt these techniques. This helps to extract entities, concepts and other artifacts from texts. In combination with the automatic generation of intermediate text levels through topic segmentation for thematic orientation, users can test hypotheses or develop interesting new research questions. To illustrate the advantages of our approach, we provide usage examples from literature studies.
Vector space models (VSMs) are mathematically well-defined frameworks that have been widely used in text processing. In these models, high-dimensional, often sparse vectors represent text units. In an application, the similarity of vectors -- and hence the text units that they represent -- is computed by a distance formula. The high dimensionality of vectors, however, is a barrier to the performance of methods that employ VSMs. Consequently, a dimensionality reduction technique is employed to alleviate this problem. This paper introduces a new method, called Random Manhattan Indexing (RMI), for the construction of L1 normed VSMs at reduced dimensionality. RMI combines the construction of a VSM and dimension reduction into an incremental, and thus scalable, procedure. In order to attain its goal, RMI employs the sparse Cauchy random projections.
DeepQA is a large-scale natural language processing (NLP) question-and-answer system that responds across a breadth of structured and unstructured data, from hundreds of analytics that are combined with over 50 models, trained through machine learning. After the 2011 historic milestone of defeating the two best human players in the Jeopardy! game show, the technology behind IBM Watson, DeepQA, is undergoing gamification into real-world business problems. Gamifying a business domain for Watson is a composite of functional, content, and training adaptation for nongame play. During domain gamification for medical, financial, government, or any other business, each system change affects the machine-learning process. As opposed to the original Watson Jeopardy!, whose class distribution of positive-to-negative labels is 1:100, in adaptation the computed training instances, question-and-answer pairs transformed into true-false labels, result in a very low positive-to-negative ratio of 1:100 000. Such initial extreme class imbalance during domain gamification poses a big challenge for the Watson machine-learning pipelines. The combination of ingested corpus sets, question-and-answer pairs, configuration settings, and NLP algorithms contribute toward the challenging data state. We propose several data engineering techniques, such as answer key vetting and expansion, source ingestion, oversampling classes, and question set modifications to increase the computed true labels. In addition, algorithm engineering, such as an implementation of the Newton-Raphson logistic regression with a regularization term, relaxes the constraints of class imbalance during training adaptation. We conclude by empirically demonstrating that data and algorithm engineering are complementary and indispensable to overcome the challenges in this first Watson gamification for real-world business problems.
In this paper we propose a twitter sentiment analytics that mines for opinion polarity about a given topic. Most of current semantic sentiment analytics depends on polarity lexicons. However, many key tone words are frequently bipolar. In this paper we demonstrate a technique which can accommodate the bipolarity of tone words by context sensitive tone lexicon learning mechanism where the context is modeled by the semantic neighborhood of the main target. Performance analysis shows that ability to contextualize the tone word polarity significantly improves the accuracy.
An increasing number of people are using online social networking services (SNSs), and a significant amount of information related to experiences in consumption is shared in this new media form. Text mining is an emerging technique for mining useful information from the web. We aim at discovering in particular tweets semantic patterns in consumers' discussions on social media. Specifically, the purposes of this study are twofold: 1) finding similarity and dissimilarity between two sets of textual documents that include consumers' sentiment polarities, two forms of positive vs. negative opinions and 2) driving actual content from the textual data that has a semantic trend. The considered tweets include consumers' opinions on US retail companies (e.g., Amazon, Walmart). Cosine similarity and K-means clustering methods are used to achieve the former goal, and Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, is used for the latter purpose. This is the first study which discover semantic properties of textual data in consumption context beyond sentiment analysis. In addition to major findings, we apply LDA (Latent Dirichlet Allocations) to the same data and drew latent topics that represent consumers' positive opinions and negative opinions on social media.
The huge amount of user log data collected by search engine providers creates new opportunities to understand user loyalty and defection behavior at an unprecedented scale. However, this also poses a great challenge to analyze the behavior and glean insights into the complex, large data. In this paper, we introduce LoyalTracker, a visual analytics system to track user loyalty and switching behavior towards multiple search engines from the vast amount of user log data. We propose a new interactive visualization technique (flow view) based on a flow metaphor, which conveys a proper visual summary of the dynamics of user loyalty of thousands of users over time. Two other visualization techniques, a density map and a word cloud, are integrated to enable analysts to gain further insights into the patterns identified by the flow view. Case studies and the interview with domain experts are conducted to demonstrate the usefulness of our technique in understanding user loyalty and switching behavior in search engines.
This paper describes a high-performance and space-efficient memory-resident datastore for text analytics systems based on a hash table for fast access, a dynamic trie for staging and a list of Level-Order Unary Degree Sequence (LOUDS) tries for compactness. We achieve efficient memory allocation and data placement by placing freqently access keys in the hash table, and infrequently accessed keys in the LOUDS tries without using conventional cache algorithms. Our algorithm also dynamically changes memory allocation sizes for these data structures according to the remaining available memory size. This technique yields 38.6% to 52.9% better throughput than a double array trie - a conventional fast and compact datastore.
Word clouds have emerged as a straightforward and visually appealing visualization method for text. They are used in various contexts as a means to provide an overview by distilling text down to those words that appear with highest frequency. Typically, this is done in a static way as pure text summarization. We think, however, that there is a larger potential to this simple yet powerful visualization paradigm in text analytics. In this work, we explore the usefulness of word clouds for general text analysis tasks. We developed a prototypical system called the Word Cloud Explorer that relies entirely on word clouds as a visualization method. It equips them with advanced natural language processing, sophisticated interaction techniques, and context information. We show how this approach can be effectively used to solve text analysis tasks and evaluate it in a qualitative user study.
Formal methods, models and tools for social big data analytics are largely limited to graph theoretical approaches such as social network analysis (SNA) informed by relational sociology. There are no other unified modeling approaches to social big data that integrate the conceptual, formal and software realms. In this paper, we first present and discuss a theory and conceptual model of social data. Second, we outline a formal model based on set theory and discuss the semantics of the formal model with a real-world social data example from Facebook. Third, we briefly present and discuss the Social Data Analytics Tool (SODATO) that realizes the conceptual model in software and provisions social data analysis based on the conceptual and formal models. Fourth and last, based on the formal model and sentiment analysis of text, we present a method for profiling of artifacts and actors and apply this technique to the data analysis of big social data collected from Facebook page of the fast fashion company, H&M.
Keeping up with rapid advances in research in various fields of Engineering and Technology is a challenging task. Decision makers including academics, program managers, venture capital investors, industry leaders and funding agencies not only need to be abreast of latest developments but also be able to assess the effect of growth in certain areas on their core business. Though analyst agencies like Gartner, McKinsey etc. Provide such reports for some areas, thought leaders of all organisations still need to amass data from heterogeneous collections like research publications, analyst reports, patent applications, competitor information etc. To help them finalize their own strategies. Text mining and data analytics researchers have been looking at integrating statistics, text analytics and information visualization to aid the process of retrieval and analytics. In this paper, we present our work on automated topical analysis and insight generation from large heterogeneous text collections of publications and patents. While most of the earlier work in this area provides search-based platforms, ours is an integrated platform for search and analysis. We have presented several methods and techniques that help in analysis and better comprehension of search results. We have also presented methods for generating insights about emerging and popular trends in research along with contextual differences between academic research and patenting profiles. We also present novel techniques to present topic evolution that helps users understand how a particular area has evolved over time.