Compression-Based Selective Sampling for Learning to Rank

Submitted by grigby1 on Fri, 09/15/2017 - 10:42am

Title	Compression-Based Selective Sampling for Learning to Rank
Publication Type	Conference Paper
Year of Publication	2016
Authors	Silva, Rodrigo M., Gomes, Guilherme C.M., Alvim, Mário S., Gonçalves, Marcos A.
Conference Name	Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
Publisher	ACM
Conference Location	New York, NY, USA
ISBN Number	978-1-4503-4073-1
Keywords	active learning, composability, Compression, compressive sampling, learning to rank, privacy, pubcrawl, Resiliency, selective sampling
Abstract	Learning to rank (L2R) algorithms use a labeled training set to generate a ranking model that can be later used to rank new query results. These training sets are very costly and laborious to produce, requiring human annotators to assess the relevance or order of the documents in relation to a query. Active learning (AL) algorithms are able to reduce the labeling effort by actively sampling an unlabeled set and choosing data instances that maximize the effectiveness of a learning function. But AL methods require constant supervision, as documents have to be labeled at each round of the process. In this paper, we propose that certain characteristics of unlabeled L2R datasets allow for an unsupervised, compression-based selection process to be used to create small and yet highly informative and effective initial sets that can later be labeled and used to bootstrap a L2R system. We implement our ideas through a novel unsupervised selective sampling method, which we call Cover, that has several advantages over AL methods tailored to L2R. First, it does not need an initial labeled seed set and can select documents from scratch. Second, selected documents do not need to be labeled as the iterations of the method progress since it is unsupervised (i.e., no learning model needs to be updated). Thus, an arbitrarily sized training set can be selected without human intervention depending on the available budget. Third, the method is efficient and can be run on unlabeled collections containing millions of query-document instances. We run various experiments with two important L2R benchmarking collections to show that the proposed method allows for the creation of small, yet very effective training sets. It achieves full training-like performance with less than 10% of the original sets selected, outperforming the baselines in both effectiveness and scalability.
URL	http://doi.acm.org/10.1145/2983323.2983813
DOI	10.1145/2983323.2983813
Citation Key	silva_compression-based_2016

Groups:

Science of Security VO