Biblio | CPS-VO

Pappu, Aasish, Blanco, Roi, Mehdad, Yashar, Stent, Amanda, Thadani, Kapil. 2017. Lightweight Multilingual Entity Extraction and Linking. Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. :365–374.

Text analytics systems often rely heavily on detecting and linking entity mentions in documents to knowledge bases for downstream applications such as sentiment analysis, question answering and recommender systems. A major challenge for this task is to be able to accurately detect entities in new languages with limited labeled resources. In this paper we present an accurate and lightweight, multilingual named entity recognition (NER) and linking (NEL) system. The contributions of this paper are three-fold: 1) Lightweight named entity recognition with competitive accuracy; 2) Candidate entity retrieval that uses search click-log data and entity embeddings to achieve high precision with a low memory footprint; and 3) efficient entity disambiguation. Our system achieves state-of-the-art performance on TAC KBP 2013 multilingual data and on English AIDA CONLL data.

DiScala, Michael, Abadi, Daniel J.. 2016. Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data. Proceedings of the 2016 International Conference on Management of Data. :295–310.

Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.

Wilder, Nathan, Smith, Jared M., Mockus, Audris. 2016. Exploring a Framework for Identity and Attribute Linking Across Heterogeneous Data Systems. Proceedings of the 2Nd International Workshop on BIG Data Software Engineering. :19–25.

Online-activity-generated digital traces provide opportunities for novel services and unique insights as demonstrated in, for example, research on mining software repositories. The inability to link these traces within and among systems, such as Twitter, GitHub, or Reddit, inhibit the advances in this area. Furthermore, no single approach to integrate data from these disparate sources is likely to work. We aim to design Foreseer, an extensible framework, to design and evaluate identity matching techniques for public, large, and low-accuracy operational data. Foreseer consists of three functionally independent components designed to address the issues of discovery and preparation, storage and representation, and analysis and linking of traces from disparate online sources. The framework includes a domain specific language for manipulating traces, generating insights, and building novel services. We have applied it in a pilot study of roughly 10TB of data from Twitter, Reddit, and StackExchange including roughly 6M distinct entities and, using basic matching techniques, found roughly 83,000 matches among these sources. We plan to add additional entity extraction and identification algorithms, data from other sources, and design tools for facilitating dynamic ingestion and tagging of incoming data on a more robust infrastructure using Apache Spark or another distributed processing framework. We will then evaluate the utility and effectiveness of the framework in applications ranging from identifying malicious contributors in software repositories to the evaluation of the utility of privacy preservation schemes.

DiScala, Michael, Abadi, Daniel J.. 2016. Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data. Proceedings of the 2016 International Conference on Management of Data. :295–310.

Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.