Biblio
Code clone detection is an important problem for software maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These techniques also depend on generic, handcrafted features to represent code fragments. We introduce learning-based detection techniques where everything for representing terms and fragments in source code is mined from the repository. Our code analysis supports a framework, which relies on deep learning, for automatically linking patterns mined at the lexical level with patterns mined at the syntactic level. We evaluated our novel learning-based approach for code clone detection with respect to feasibility from the point of view of software maintainers. We sampled and manually evaluated 398 file- and 480 method-level pairs across eight real-world Java systems; 93% of the file- and method-level samples were evaluated to be true positives. Among the true positives, we found pairs mapping to all four clone types. We compared our approach to a traditional structure-oriented technique and found that our learning-based approach detected clones that were either undetected or suboptimally reported by the prominent tool Deckard. Our results affirm that our learning-based approach is suitable for clone detection and a tenable technique for researchers.
Automated human facial image de-identification is a much needed technology for privacy-preserving social media and intelligent surveillance applications. Other than the usual face blurring techniques, in this work, we propose to achieve facial anonymity by slightly modifying existing facial images into "averaged faces" so that the corresponding identities are difficult to uncover. This approach preserves the aesthesis of the facial images while achieving the goal of privacy protection. In particular, we explore a deep learning-based facial identity-preserving (FIP) features. Unlike conventional face descriptors, the FIP features can significantly reduce intra-identity variances, while maintaining inter-identity distinctions. By suppressing and tinkering FIP features, we achieve the goal of k-anonymity facial image de-identification while preserving desired utilities. Using a face database, we successfully demonstrate that the resulting "averaged faces" will still preserve the aesthesis of the original images while defying facial image identity recognition.
This paper describes our proposed method targeting at the MSR Image Recognition Challenge MS-Celeb-1M. The challenge is to recognize one million celebrities from their face images captured in the real world. The challenge provides a large scale dataset crawled from the Web, which contains a large number of celebrities with many images for each subject. Given a new testing image, the challenge requires an identify for the image and the corresponding confidence score. To complete the challenge, we propose a two-stage approach consisting of data cleaning and multi-view deep representation learning. The data cleaning can effectively reduce the noise level of training data and thus improves the performance of deep learning based face recognition models. The multi-view representation learning enables the learned face representations to be more specific and discriminative. Thus the difficulties of recognizing faces out of a huge number of subjects are substantially relieved. Our proposed method achieves a coverage of 46.1% at 95% precision on the random set and a coverage of 33.0% at 95% precision on the hard set of this challenge.
This paper focuses on a practically very important problem of matching a real-world product photo to exactly the same item(s) in online shopping sites. The task is extremely challenging because the user photos (i.e., the queries in this scenario) are often captured in uncontrolled environments, while the product images in online shops are mostly taken by professionals with clean backgrounds and perfect lighting conditions. To tackle the problem, we study deep network architectures and training schemes, with the goal of learning a robust deep feature representation that is able to bridge the domain gap between the user photos and the online product images. Our contributions are two-fold. First, we propose an alternative of the popular contrastive loss used in siamese deep networks, namely robust contrastive loss, where we "relax" the penalty on positive pairs to alleviate over-fitting. Second, a multi-task fine-tuning approach is introduced to learn a better feature representation, which not only incorporates knowledge from the provided training photo pairs, but also explores additional information from the large ImageNet dataset to regularize the fine-tuning procedure. Experiments on two challenging real-world datasets demonstrate that both the robust contrastive loss and the multi-task fine-tuning approach are effective, leading to very promising results with a time cost suitable for real-time retrieval.
Maintaining a clean and hygienic civic environment is an indispensable yet formidable task, especially in developing countries. With the aim of engaging citizens to track and report on their neighborhoods, this paper presents a novel smartphone app, called SpotGarbage, which detects and coarsely segments garbage regions in a user-clicked geo-tagged image. The app utilizes the proposed deep architecture of fully convolutional networks for detecting garbage in images. The model has been trained on a newly introduced Garbage In Images (GINI) dataset, achieving a mean accuracy of 87.69%. The paper also proposes optimizations in the network architecture resulting in a reduction of 87.9% in memory usage and 96.8% in prediction time with no loss in accuracy, facilitating its usage in resource constrained smartphones.
Nowadays, a typical household owns multiple digital devices that can be connected to the Internet. Advertising companies always want to seamlessly reach consumers behind devices instead of the device itself. However, the identity of consumers becomes fragmented as they switch from one device to another. A naive attempt is to use deterministic features such as user name, telephone number and email address. However consumers might refrain from giving away their personal information because of privacy and security reasons. The challenge in ICDM2015 contest is to develop an accurate probabilistic model for predicting cross-device consumer identity without using the deterministic user information. In this paper we present an accurate and scalable cross-device solution using an ensemble of Gradient Boosting Decision Trees (GBDT) and Random Forest. Our final solution ranks 9th both on the public and private LB with F0.5 score of 0.855.