Visible to the public Streaming Algorithms for Robust Distinct Elements

TitleStreaming Algorithms for Robust Distinct Elements
Publication TypeConference Paper
Year of Publication2016
AuthorsChen, Di, Zhang, Qin
Conference NameProceedings of the 2016 International Conference on Management of Data
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-3531-7
KeywordsAlgorithm, composability, distinct elements, hash algorithms, Metrics, noisy data, pubcrawl, Resiliency, Scalability, streaming algorithms
Abstract

We study the problem of estimating distinct elements in the data stream model, which has a central role in traffic monitoring, query optimization, data mining and data integration. Different from all previous work, we study the problem in the noisy data setting, where two different looking items in the stream may reference the same entity (determined by a distance function and a threshold value), and the goal is to estimate the number of distinct entities in the stream. In this paper, we formalize the problem of robust distinct elements, and develop space and time-efficient streaming algorithms for datasets in the Euclidean space, using a novel technique we call bucket sampling. We also extend our algorithmic framework to other metric spaces by establishing a connection between bucket sampling and the theory of locality sensitive hashing. Moreover, we formally prove that our algorithms are still effective under small distinct elements ambiguity. Our experiments demonstrate the practicality of our algorithms.

URLhttp://doi.acm.org/10.1145/2882903.2882915
DOI10.1145/2882903.2882915
Citation Keychen_streaming_2016