Biblio
In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. bio-medical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.
In this era of information explosion, conflicts are often encountered when information is provided by multiple sources. Traditional truth discovery task aims to identify the truth the most trustworthy information, from conflicting sources in different scenarios. In this kind of tasks, truth is regarded as a fixed value or a set of fixed values. However, in a number of real-world cases, objective truth existence cannot be ensured and we can only identify single or multiple reliable facts from opinions. Different from traditional truth discovery task, we address this uncertainty and introduce the concept of trustworthy opinion of an entity, treat it as a random variable, and use its distribution to describe consistency or controversy, which is particularly difficult for data which can be numerically measured, i.e. quantitative information. In this study, we focus on the quantitative opinion, propose an uncertainty-aware approach called Kernel Density Estimation from Multiple Sources (KDEm) to estimate its probability distribution, and summarize trustworthy information based on this distribution. Experiments indicate that KDEm not only has outstanding performance on the classical numeric truth discovery task, but also shows good performance on multi-modality detection and anomaly detection in the uncertain-opinion setting.
A Cyber-Physical System (CPS) integrates physical devices (i.e., sensors) with cyber (i.e., informational) components to form a context sensitive system that responds intelligently to dynamic changes in real-world situations. Such a system has wide applications in the scenarios of traffic control, battlefield surveillance, environmental monitoring, and so on. A core element of CPS is the collection and assessment of information from noisy, dynamic, and uncertain physical environments integrated with many types of cyber-space resources. The potential of this integration is unbounded. To achieve this potential the raw data acquired from the physical world must be transformed into useable knowledge in real-time. Therefore, CPS brings a new dimension to knowledge discovery because of the emerging synergism of the physical and the cyber. The various properties of the physical world must be addressed in information management and knowledge discovery. This paper discusses the problems of mining sensor data in CPS: With a large number of wireless sensors deployed in a designated area, the task is real time detection of intruders that enter the area based on noisy sensor data. The framework of IntruMine is introduced to discover intruders from untrustworthy sensor data. IntruMine first analyzes the trustworthiness of sensor data, then detects the intruders' locations, and verifies the detections based on a graph model of the relationships between sensors and intruders.