Visible to the public Small-Term Distribution for Disk-Based Search

TitleSmall-Term Distribution for Disk-Based Search
Publication TypeConference Paper
Year of Publication2017
AuthorsKane, Andrew, Tompa, Frank Wm.
Conference NameProceedings of the 2017 ACM Symposium on Document Engineering
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-4689-4
Keywordsdistributed search engines, index compression, index partitioning, information retrieval, Metrics, pubcrawl, query performance, resilience, Resiliency, run-time efficiency, Scalability, Web Caching
Abstract

A disk-based search system distributes a large index across multiple disks on one or more machines, where documents are typically assigned to disks at random in order to achieve load balancing. However, random distribution degrades clustering, which is required for efficient index compression. Using the GOV2 dataset, we demonstrate the effect of various ordering techniques on index compression, and then quantify the effect of various document distribution approaches on compression and load balancing. We explore runtime performance by simulating a disk-based search system for a scaled-out 10xGOV2 index over ten disks using two standard approaches, document and term distribution, as well as a hybrid approach: small-term distribution. We find that small-term distribution has the best performance, especially in the presence of list caching, and argue that this rarely discussed distribution approach can improve disk-based search performance for many real-world installations.

URLhttps://dl.acm.org/citation.cfm?doid=3103010.3103022
DOI10.1145/3103010.3103022
Citation Keykane_small-term_2017