Visible to the public Modeling and Learning Distributed Word Representation with Metadata for Question Retrieval

TitleModeling and Learning Distributed Word Representation with Metadata for Question Retrieval
Publication TypeJournal Article
Year of Publication2017
AuthorsZhou, G., Huang, J. X.
JournalIEEE Transactions on Knowledge and Data Engineering
Volume29
Pagination1226–1239
Date Publishedjun
ISSN1041-4347
KeywordsAggregates, basic category powered model, category information metadata, category powered models, community question answering, compositionality, Computational modeling, Context modeling, cQA archives, distributed processing, distributed word representation learning, distributed word representation modeling, fisher kernel, fixed-length vectors, information retrieval, Internet, Kernel, Knowledge discovery, large-scale automatic evaluation experiments, large-scale Chinese cQA data sets, large-scale English cQA data sets, learning (artificial intelligence), lexical gap problem, MB-NET, meta data, metadata, Metadata Discovery Problem, natural language processing, performance improvements, pubcrawl, question answering (information retrieval), question retrieval, Resiliency, Scalability, Semantics, text analysis, text mining, Vectors, web, word processing
Abstract

Community question answering (cQA) has become an important issue due to the popularity of cQA archives on the Web. This paper focuses on addressing the lexical gap problem in question retrieval. Question retrieval in cQA archives aims to find the existing questions that are semantically equivalent or relevant to the queried questions. However, the lexical gap problem brings a new challenge for question retrieval in cQA. In this paper, we propose to model and learn distributed word representations with metadata of category information within cQA pages for question retrieval using two novel category powered models. One is a basic category powered model called MB-NET and the other one is an enhanced category powered model called ME-NET which can better learn the distributed word representations and alleviate the lexical gap problem. To deal with the variable size of word representation vectors, we employ the framework of fisher kernel to transform them into the fixed-length vectors. Experimental results on large-scale English and Chinese cQA data sets show that our proposed approaches can significantly outperform state-of-the-art retrieval models for question retrieval in cQA. Moreover, we further conduct our approaches on large-scale automatic evaluation experiments. The evaluation results show that promising and significant performance improvements can be achieved.

DOI10.1109/TKDE.2017.2665625
Citation Keyzhou_modeling_2017