Enhancing First Story Detection Using Word Embeddings
Title | Enhancing First Story Detection Using Word Embeddings |
Publication Type | Conference Paper |
Year of Publication | 2016 |
Authors | Moran, Sean, McCreadie, Richard, Macdonald, Craig, Ounis, Iadh |
Conference Name | Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-4503-4069-4 |
Keywords | document expansion, locality sensitive hashing, Metrics, nearest neighbor search, nearest neighbour search, paraphrase, pubcrawl, streaming data, Twitter |
Abstract | In this paper we show how word embeddings can be used to increase the effectiveness of a state-of-the art Locality Sensitive Hashing (LSH) based first story detection (FSD) system over a standard tweet corpus. Vocabulary mismatch, in which related tweets use different words, is a serious hindrance to the effectiveness of a modern FSD system. In this case, a tweet could be flagged as a first story even if a related tweet, which uses different but synonymous words, was already returned as a first story. In this work, we propose a novel approach to mitigate this problem of lexical variation, based on tweet expansion. In particular, we propose to expand tweets with semantically related paraphrases identified via automatically mined word embeddings over a background tweet corpus. Through experimentation on a large data stream comprised of 50 million tweets, we show that FSD effectiveness can be improved by 9.5% over a state-of-the-art FSD system. |
URL | http://doi.acm.org/10.1145/2911451.2914719 |
DOI | 10.1145/2911451.2914719 |
Citation Key | moran_enhancing_2016 |