Visible to the public RIDIR: Collaborative Research: Analytical tools for text based social data integrationConflict Detection Enabled

Project Details

Performance Period

Sep 01, 2017 - Aug 31, 2019

Institution(s)

University of California-San Diego

Award Number


When something happens in the world -- such as a natural disaster, an election, a protest, or a policy change -- many types of media record different accounts of the same event. Newspapers, social media posts and government documents all provide unique versions of events stored in different formats. Because each source provides its own perspective, synthesizing these stories vastly increase our ability to learn about both events and the dynamics of the media environment. Yet, social scientists are limited in their capacity to access these myriad perspectives because there are few tools for automatically combining these accounts into one integrated analysis. This project will provide a rich infrastructure for integrating texts from diverse sources documenting the same social phenomenon. Such integration often reveals much about underlying social dynamics.

This project will develop a tool to integrate documents with different formats with accounts of the same or closely related events through four main methods. First, the tool will allow users to align documents by topic, while accounting for structural and stylistic differences between documents. Second, the tool will compile different types of documents by a shared event or entity. Third, the tool will allow for user-provided schema to combine semi-structured documents. Last, the tool will facilitate data fusion, by identifying and resolving contradictions from multiple sources. The tool will be sufficiently flexible to fit multiple research purposes, allow for human feedback to assist with integration, and facilitate reproducibility by creating a common resource that can be the basis of future research by a whole community of scholars. The system itself will be applicable to almost any set of unstructured text data and will have broad applicability for questions across the social sciences.