Visible to the public SaTC: CORE: Small: Toward Fully Automated Data-Driven Analysis of Web CensorshipConflict Detection Enabled

Project Details

Performance Period

Oct 01, 2018 - Sep 30, 2021

Institution(s)

Carnegie-Mellon University

Award Number


Different countries have very different positions concerning freedom of expression, and may be increasingly using online filtering tools for web censorship. While the research community has devoted significant attention to the problem of anonymous communication, many other aspects of censorship and anonymization have not yet been studied in detail. We do not know, for instance, whether it is possible to automatically infer which techniques censors use and which topics censors are after. Grounded in empirical measurements, this project first aims to produce quantitative models of the techniques used in practice for actual web censorship on a global scale, and of the suspected targets of censorship. Informed by these measurements, a secondary goal is to devise novel techniques to perform automated censorship data collection with minimal human involvement. The team will integrate project findings and techniques into coursework, and, by making measurement data publicly available, will provide a useful resource to researchers in fields beyond the realm of computer security (for example, sociology or political science), as well as non-governmental organizations and policy-makers interested in freedom of speech issues.

At a technical level, the project will identify and construct reliable and representative corpora of web pages to test for censorship, and gather measurements of censorship occurrences on a global scale and over long intervals of time. A key objective will be to minimize human involvement during data collection to reduce risk to the individuals, while still seeking to collect data from a wide mix of vantage points representing a variety countries and organizations. This will involve developing techniques for improving geolocation of requests, combining domain name service-based measurement techniques with the use of virtual private networks and servers to increase coverage, avoiding detection by censors while taking these measurements, developing techniques to distinguish network or server errors from acts of censorship, addressing issues of cross-language access and censorship, and accounting for changes in censors' behavior over time. The team will analyze this data to determine whether one can automatically detect the occurrence of censorship with minimal assumptions about the form censorship may take, and whether one can automatically extract the probable topic of the censored material. Informed by these inferences around censored topics, the team will investigate the extent of keyword-based censorship in the wild, and how actual censoring practice aligns with published accounts of censor behavior from both censors themselves and from other organizations that monitor censorship.