Visible to the public EAGER: SaTC: Tracking Semantic Change in Medical InformationConflict Detection Enabled

Project Details

Performance Period

May 15, 2018 - Apr 30, 2020

Institution(s)

SUNY at Stony Brook

Award Number


Changes in the meaning of information as it passes through cyberspace can mislead those who access the information. This project will develop a new dataset and algorithms to identify and categorize medical information that remains true to the original meaning or undergoes distortion. Instead of imposing an external true/false label on this information, this project looks into a series of changes within the news coverage itself that gradually lead to a deviation from the original medical claims. Identifying important differences between original medical articles and news stories is a challenging, high risk-high reward venture. Broader impacts of this work include benefits to the research community by making novel contributions to understanding temporal changes in natural language information, as well as social benefits in the form of improved informational tools like question-answering. For the medical domain in particular, understanding temporal distortions and deviations from actual medical findings can reduce occurrences of harmful health choices, for instance, by embedding the research outcomes in news, social media, or search engines.

This project will develop a large dataset of medical scientific publications, and record their characteristics as they change over time across news by designing and developing discrete time-series representations of entities and their attributes and relations. This task will provide the basis for designing and implementing machine learning tasks that exploit stylometric features in natural language in conjunction with temporal distributions to identify and categorize such changes. This research will go beyond current approaches limited to true/false classification of individual articles, and hence be able to identify and analyze information change in narratives, including semantic changes and nuances, or selective emphasis of related information. The research entails an unsupervised and a semi-supervised machine learning approach with bootstrapping, and exploring a binary labeling task to distinguish distorted pieces of information from those that are faithful to the scientific finding, and a multi-label categorization to learn the type of semantic change occurring through time. The dataset will be disseminated via an archival location for natural language processing resources such as the Linguistic Data Consortium (https://www.ldc.upenn.edu/) to facilitate long-term availability to other researchers, and BitBucket or GitHub will be used to ensure the development, maintenance, sharing, and archiving of code.