Small-Term Distribution for Disk-Based Search
Title | Small-Term Distribution for Disk-Based Search |
Publication Type | Conference Paper |
Year of Publication | 2017 |
Authors | Kane, Andrew, Tompa, Frank Wm. |
Conference Name | Proceedings of the 2017 ACM Symposium on Document Engineering |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-4503-4689-4 |
Keywords | distributed search engines, index compression, index partitioning, information retrieval, Metrics, pubcrawl, query performance, resilience, Resiliency, run-time efficiency, Scalability, Web Caching |
Abstract | A disk-based search system distributes a large index across multiple disks on one or more machines, where documents are typically assigned to disks at random in order to achieve load balancing. However, random distribution degrades clustering, which is required for efficient index compression. Using the GOV2 dataset, we demonstrate the effect of various ordering techniques on index compression, and then quantify the effect of various document distribution approaches on compression and load balancing. We explore runtime performance by simulating a disk-based search system for a scaled-out 10xGOV2 index over ten disks using two standard approaches, document and term distribution, as well as a hybrid approach: small-term distribution. We find that small-term distribution has the best performance, especially in the presence of list caching, and argue that this rarely discussed distribution approach can improve disk-based search performance for many real-world installations. |
URL | https://dl.acm.org/citation.cfm?doid=3103010.3103022 |
DOI | 10.1145/3103010.3103022 |
Citation Key | kane_small-term_2017 |