Biblio
Multimedia data available in various disciplines are usually heterogeneous, containing representations in multi-views, where the cross-modal search techniques become necessary and useful. It is a challenging problem due to the heterogeneity of data with multiple modalities, multi-views in each modality and the diverse data categories. In this paper, we propose a novel multi-view cross-modal hashing method named Multi-view Collective Tensor Decomposition (MCTD) to fuse these data effectively, which can exploit the complementary feature extracted from multi-modality multi-view while simultaneously discovering multiple separated subspaces by leveraging the data categories as supervision information. Our contributions are summarized as follows: 1) we exploit tensor modeling to get better representation of the complementary features and redefine a latent representation space; 2) a block-diagonal loss is proposed to explicitly pursue a more discriminative latent tensor space by exploring supervision information; 3) we propose a new feature projection method to characterize the data and to generate the latent representation for incoming new queries. An optimization algorithm is proposed to solve the objective function designed for MCTD, which works under an iterative updating procedure. Experimental results prove the state-of-the-art precision of MCTD compared with competing methods.
Recognizing Families In the Wild (RFIW) is a large-scale, multi-track automatic kinship recognition evaluation, supporting both kinship verification and family classification on scales much larger than ever before. It was organized as a Data Challenge Workshop hosted in conjunction with ACM Multimedia 2017. This was achieved with the largest image collection that supports kin-based vision tasks. In the end, we use this manuscript to summarize evaluation protocols, progress made and some technical background and performance ratings of the algorithms used, and a discussion on promising directions for both research and engineers to be taken next in this line of work.
Heterogeneous face recognition aims to identify or verify person identity by matching facial images of different modalities. In practice, it is known that its performance is highly influenced by modality inconsistency, appearance occlusions, illumination variations and expressions. In this paper, a new method named as ensemble of sparse cross-modal metrics is proposed for tackling these challenging issues. In particular, a weak sparse cross-modal metric learning method is firstly developed to measure distances between samples of two modalities. It learns to adjust rank-one cross-modal metrics to satisfy two sets of triplet based cross-modal distance constraints in a compact form. Meanwhile, a group based feature selection is performed to enforce that features in the same position of two modalities are selected simultaneously. By neglecting features that attribute to "noise" in the face regions (eye glasses, expressions and so on), the performance of learned weak metrics can be markedly improved. Finally, an ensemble framework is incorporated to combine the results of differently learned sparse metrics into a strong one. Extensive experiments on various face datasets demonstrate the benefit of such feature selection especially when heavy occlusions exist. The proposed ensemble metric learning has been shown superiority over several state-of-the-art methods in heterogeneous face recognition.