There are very few publicly available network traces that contain application-level data, because of the enormous privacy risk that sharing such data creates. Application-level data is rich with personal and private information, such as human names, social security numbers, etc. that criminals can monetize. Yet such data is necessary for realistic testing of research products, and for understanding trends in the domain of networking and network applications.
This project develops a publicly accessible, diverse and fresh archive of content-rich network data, contributed by volunteer users, called Critter-at-home. Users join the Critter overlay whenever online, offering their data to interested researchers. Privacy of data contributors is protected by several means. First, contributors may opt to host their own data on their machines, thus retaining full control over it. Second, we process contributed data to modify all personal and private information (PPI) and we encrypt it. Third, no human apart from the contributor ever accesses the raw, PPI-sanitized, data. Instead, researchers query the data via our Critter-at-home framework, and they receive aggregate statistics (counts, distributions, etc.) of the traffic features they query for. Four, all contact with a contributor is at her discretion and is done through an anonymous network, where contributor identities are hidden.
The archive this project creates will greatly advance security research by providing necessary data for its validation and for data mining. This archive will further be valuable to a broader networking e.g., for realistic traffic generation, as ground truth in traffic classification, and for many other purposes.
|