Biblio
In local differential privacy (LDP), each user perturbs her data locally before sending the noisy data to a data collector. The latter then analyzes the data to obtain useful statistics. Unlike the setting of centralized differential privacy, in LDP the data collector never gains access to the exact values of sensitive data, which protects not only the privacy of data contributors but also the collector itself against the risk of potential data leakage. Existing LDP solutions in the literature are mostly limited to the case that each user possesses a tuple of numeric or categorical values, and the data collector computes basic statistics such as counts or mean values. To the best of our knowledge, no existing work tackles more complex data mining tasks such as heavy hitter discovery over set-valued data. In this paper, we present a systematic study of heavy hitter mining under LDP. We first review existing solutions, extend them to the heavy hitter estimation, and explain why their effectiveness is limited. We then propose LDPMiner, a two-phase mechanism for obtaining accurate heavy hitters with LDP. The main idea is to first gather a candidate set of heavy hitters using a portion of the privacy budget, and focus the remaining budget on refining the candidate set in a second phase, which is much more efficient budget-wise than obtaining the heavy hitters directly from the whole dataset. We provide both in-depth theoretical analysis and extensive experiments to compare LDPMiner against adaptations of previous solutions. The results show that LDPMiner significantly improves over existing methods. More importantly, LDPMiner successfully identifies the majority true heavy hitters in practical settings.
Malicious domains are key components to a variety of cyber attacks. Several recent techniques are proposed to identify malicious domains through analysis of DNS data. The general approach is to build classifiers based on DNS-related local domain features. One potential problem is that many local features, e.g., domain name patterns and temporal patterns, tend to be not robust. Attackers could easily alter these features to evade detection without affecting much their attack capabilities. In this paper, we take a complementary approach. Instead of focusing on local features, we propose to discover and analyze global associations among domains. The key challenges are (1) to build meaningful associations among domains; and (2) to use these associations to reason about the potential maliciousness of domains. For the first challenge, we take advantage of the modus operandi of attackers. To avoid detection, malicious domains exhibit dynamic behavior by, for example, frequently changing the malicious domain-IP resolutions and creating new domains. This makes it very likely for attackers to reuse resources. It is indeed commonly observed that over a period of time multiple malicious domains are hosted on the same IPs and multiple IPs host the same malicious domains, which creates intrinsic association among them. For the second challenge, we develop a graph-based inference technique over associated domains. Our approach is based on the intuition that a domain having strong associations with known malicious domains is likely to be malicious. Carefully established associations enable the discovery of a large set of new malicious domains using a very small set of previously known malicious ones. Our experiments over a public passive DNS database show that the proposed technique can achieve high true positive rates (over 95%) while maintaining low false positive rates (less than 0.5%). Further, even with a small set of known malicious domains (a couple of hundreds), our technique can discover a large set of potential malicious domains (in the scale of up to tens of thousands).