Biblio
A disk-based search system distributes a large index across multiple disks on one or more machines, where documents are typically assigned to disks at random in order to achieve load balancing. However, random distribution degrades clustering, which is required for efficient index compression. Using the GOV2 dataset, we demonstrate the effect of various ordering techniques on index compression, and then quantify the effect of various document distribution approaches on compression and load balancing. We explore runtime performance by simulating a disk-based search system for a scaled-out 10xGOV2 index over ten disks using two standard approaches, document and term distribution, as well as a hybrid approach: small-term distribution. We find that small-term distribution has the best performance, especially in the presence of list caching, and argue that this rarely discussed distribution approach can improve disk-based search performance for many real-world installations.
An Enterprise Content Management (ECM) system must withstand many queries to its access control subsystem in order to check permissions in support of browsing-oriented operations. This leads us to choose a subject-oriented representation for access control (i.e., maintaining a permissions list for each subject). Additionally, if identifiers (OIDs) are assigned to objects in a breadth-first traversal of the object hierarchy, we will encounter many contiguous OIDs when browsing under one object (e.g., folder). Based on these observations, we present a space-efficient data structure specifically tailored for representing permissions lists in ECM systems. In addition to achieving space efficiency, the operations to check, grant, or revoke a permission are very fast using our data structure. Furthermore, our design supports fast union and intersection of two or more permissions lists (determining the effective permissions inherited from several users' groups or the common permissions among sets of users). Finally, the data structure is scalable to support any increase in the number of objects and subjects. We evaluate our design by comparing it against a compressed (WAH) bitmap-based representation and a hashing-based representation, using both synthetic and real-world data under both random and breadth-first OID numbering schemes.