Biblio

List
Filter

Found 31 results

Filters: Keyword is Metadata Discovery Problem [Clear All Filters]

2022-02-24

Castellano, Giovanna, Vessio, Gennaro. 2021. Deep Convolutional Embedding for Digitized Painting Clustering. 2020 25th International Conference on Pattern Recognition (ICPR). :2708–2715.

Clustering artworks is difficult for several reasons. On the one hand, recognizing meaningful patterns in accordance with domain knowledge and visual perception is extremely difficult. On the other hand, applying traditional clustering and feature reduction techniques to the highly dimensional pixel space can be ineffective. To address these issues, we propose to use a deep convolutional embedding model for digitized painting clustering, in which the task of mapping the raw input data to an abstract, latent space is jointly optimized with the task of finding a set of cluster centroids in this latent feature space. Quantitative and qualitative experimental results show the effectiveness of the proposed method. The model is also capable of outperforming other state-of-the-art deep clustering approaches to the same problem. The proposed method can be useful for several art-related tasks, in particular visual link retrieval and historical knowledge discovery in painting datasets.

Singh, Parwinder, Acharya, Kartikeya Satish, Beliatis, Michail J., Presser, Mirko. 2021. Semantic Search System For Real Time Occupancy. 2021 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS). :49–55.

This paper presents an IoT enabled real time occupancy semantic search system leveraging ETSI defined context information and interface meta model standard- ``Next Generation Service Interface for Linked Data'' (NGSI-LD). It facilitates interoperability, integration and federation of information exchange related to spatial infrastructure among geo-distributed deployed IoT entities, different stakeholders, and process domains. This system, in the presented use case, solves the problem of adhoc booking of meetings in real time through semantic discovery of spatial data and metadata related to room occupancy and thus enables optimum utilization of spatial infrastructure in university campuses. Therefore, the proposed system has the capability to save on effort, cost and productivity in institutional spatial management contexts in the longer run and as well provide a new enriched user experience in smart public buildings. Additionally, the system empowers different stakeholders to plan, forecast and fulfill their spatial infrastructure requirements through semantic data search analysis and real time data driven planning. The initial performance results of the system have shown quick response enabled semantic discovery of data and metadata (textless2 seconds mostly). The proposed system would be a steppingstone towards smart management of spatial infrastructure which offers scalability, federation, vendor agnostic ecosystem, seamless interoperability and integration and security by design. The proposed system provides the fundamental work for its extension and potential in relevant spatial domains of the future.

2021-08-05

Ren, Xiaoli, Li, Xiaoyong, Deng, Kefeng, Ren, Kaijun, Zhou, Aolong, Song, Junqiang. 2020. Bringing Semantics to Support Ocean FAIR Data Services with Ontologies. 2020 IEEE International Conference on Services Computing (SCC). :30—37.

With the increasing attention to ocean and the development of data-intensive sciences, a large amount of ocean data has been acquired by various observing platforms and sensors, which poses new challenges to data management and utilization. Typically, nowadays we target to move ocean data management toward the FAIR principles of being findable, accessible, interoperable, and reusable. However, the data produced and managed by different organizations with wide diversity, various structures and increasing volume make it hard to be FAIR, and one of the most critical reason is the lack of unified data representation and publication methods. In this paper, we propose novel techniques to try to solve the problem by introducing semantics with ontologies. Specifically, we first propose a unified semantic model named OEDO to represent ocean data by defining the concepts of ocean observing field, specifying the relations between the concepts, and describing the properties with ocean metadata. Then, we further optimize the state-of-the-art quick service query list (QSQL) data structure, by extending the domain concepts with WordNet to improve data discovery. Moreover, based on the OEDO model and the optimized QSQL, we propose an ocean data service publishing method called DOLP to improve data discovery and data access. Finally, we conduct extensive experiments to demonstrate the effectiveness and efficiency of our proposals.

Ramasubramanian, Muthukumaran, Muhammad, Hassan, Gurung, Iksha, Maskey, Manil, Ramachandran, Rahul. 2020. ES2Vec: Earth Science Metadata Keyword Assignment using Domain-Specific Word Embeddings. 2020 SoutheastCon. :1—6.

Earth science metadata keyword assignment is a challenging problem. Dataset curators select appropriate keywords from the Global Change Master Directory (GCMD) set of keywords. The keywords are integral part of search and discovery of these datasets. Hence, selection of keywords are crucial in increasing the discoverability of datasets. Utilizing machine learning techniques, we provide users with automated keyword suggestions as an improved approach to complement manual selection. We trained a machine learning model that leverages the semantic embedding ability of Word2Vec models to process abstracts and suggest relevant keywords. A user interface tool we built to assist data curators in assignment of such keywords is also described.

Bogatu, Alex, Fernandes, Alvaro A. A., Paton, Norman W., Konstantinou, Nikolaos. 2020. Dataset Discovery in Data Lakes. 2020 IEEE 36th International Conference on Data Engineering (ICDE). :709—720.

Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value- adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash- based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times.

Alecakir, Huseyin, Kabukcu, Muhammet, Can, Burcu, Sen, Sevil. 2020. Discovering Inconsistencies between Requested Permissions and Application Metadata by using Deep Learning. 2020 International Conference on Information Security and Cryptology (ISCTURKEY). :56—56.

Android gives us opportunity to extract meaningful information from metadata. From the security point of view, the missing important information in metadata of an application could be a sign of suspicious application, which could be directed for extensive analysis. Especially the usage of dangerous permissions is expected to be explained in app descriptions. The permission-to-description fidelity problem in the literature aims to discover such inconsistencies between the usage of permissions and descriptions. This study proposes a new method based on natural language processing and recurrent neural networks. The effect of user reviews on finding such inconsistencies is also investigated in addition to application descriptions. The experimental results show that high precision is obtained by the proposed solution, and the proposed method could be used for triage of Android applications.

Wang, Xiaowen, Huang, Yan. 2020. Research on Semantic Based Metadata Method of SWIM Information Service. 2020 IEEE 2nd International Conference on Civil Aviation Safety and Information Technology (ICCASIT. :1121—1125.

Semantic metadata is an important means to promote the integration of information and services and improve the level of search and discovery automation. Aiming at the problems that machine is difficult to handle service metadata description and lack of information metadata description in current SWIM information services, this paper analyzes the methods of metadata sematic empowerment and mainstream semantic metadata standards related to air traffic control system, constructs the SWIM information, and service sematic metadata model based on semantic expansion. The method of semantic metadata model mapping is given from two aspects of service and data, which can be used to improve the level of information sharing and intelligent processing.

2020-12-11

Sabek, I., Chandramouli, B., Minhas, U. F.. 2019. CRA: Enabling Data-Intensive Applications in Containerized Environments. 2019 IEEE 35th International Conference on Data Engineering (ICDE). :1762—1765.

Today, a modern data center hosts a wide variety of applications comprising batch, interactive, machine learning, and streaming applications. In this paper, we factor out the commonalities in a large majority of these applications, into a generic dataflow layer called Common Runtime for Applications (CRA). In parallel, another trend, with containerization technologies (e.g., Docker), has taken a serious hold on cloud-scale data centers, with direct implications on building next generation of data center applications. Container orchestrators (e.g., Kubernetes) have made deployment a lot easy, and they solve many infrastructure level problems, e.g., service discovery, auto-restart, and replication. For best in class performance, there is a need to marry the next generation applications with containerization technologies. To that end, CRA leverages and builds upon the containerization and resource orchestration capabilities of Kubernetes/Docker, and makes it easy to build a wide range of cloud-edge applications on top. To the best of our knowledge, we are the first to present a cloud native runtime for building data center applications. We show the efficiency of CRA through various micro-benchmarking experiments.

Liu, F., Li, J., Wang, Y., Li, L.. 2019. Kubestorage: A Cloud Native Storage Engine for Massive Small Files. 2019 6th International Conference on Behavioral, Economic and Socio-Cultural Computing (BESC). :1—4.

Cloud Native, the emerging computing infrastructure has become a new trend for cloud computing, especially after the development of containerization technology such as docker and LXD, and the orchestration system for them like Kubernetes and Swarm. With the growing popularity of Cloud Native, the following problems have been raised: (i) most Cloud Native applications were designed for making full use of the cloud platform, but their file storage has not been completely optimized for adapting it. (ii) the traditional file system is designed as a utility for storing and retrieving files, usually built into the kernel of the operating systems. But when placing it to a large-scale condition, like a network storage server shared by thousands of computing instances, and stores millions of files, it will be slow and even unstable. (iii) most storage solutions use metadata for faster tracking of files, but the metadata itself will take up a lot of space, and the capacity of it is usually limited. If the file system store metadata directly into hard disk without caching, the tracking of massive small files will be a lot slower. (iv) The traditional object storage solution can't provide enough features to make itself more practical on the cloud such as caching and auto replication. This paper proposes a new storage engine based on the well-known Haystack storage engine, optimized in terms of service discovery and Automated fault tolerance, make it more suitable for Cloud Native infrastructure, deployment and applications. We use the object storage model to solve the large and high-frequency file storage needs, offering a simple and unified set of APIs for application to access. We also take advantage of Kubernetes' sophisticated and automated toolchains to make cloud storage easier to deploy, more flexible to scale, and more stable to run.

Zhou, Y., Zeng, Z.. 2019. Info-Retrieval with Relevance Feedback using Hybrid Learning Scheme for RS Image. 2019 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC). :135—138.

Relevance feedback can be considered as a learning problem. It has been extensively used to improve the performance of retrieval multimedia information. In this paper, after the relevance feedback upon content-based image retrieval (CBIR) discussed, a hybrid learning scheme on multi-target retrieval (MTR) with relevance feedback was proposed. Suppose the symbolic image database (SID) of object-level with combined image metadata and feature model was constructed. During the interactive query for remote sensing image, we calculate the similarity metric so as to get the relevant image sets from the image library. For the purpose of further improvement of the precision of image retrieval, a hybrid learning scheme parameter also need to be chosen. As a result, the idea of our hybrid learning scheme contains an exception maximization algorithm (EMA) used for retrieving the most relevant images from SID and an algorithm called supported vector machine (SVM) with relevance feedback used for learning the feedback information substantially. Experimental results show that our hybrid learning scheme with relevance feedback on MTR can improve the performance and accuracy compared the basic algorithms.

Kumar, S., Vasthimal, D. K.. 2019. Raw Cardinality Information Discovery for Big Datasets. 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS). :200—205.

Real-time discovery of all different types of unique attributes within unstructured data is a challenging problem to solve when dealing with multiple petabytes of unstructured data volume everyday. Popular discovery solutions such as the creation of offline jobs to uniquely identify attributes or running aggregation queries on raw data sets limits real time discovery use-cases and often results into poor resource utilization. The discovery information must be treated as a parallel problem to just storing raw data sets efficiently onto back-end big data systems. Solving the discovery problem by creating a parallel discovery data store infrastructure has multiple benefits as it allows such to channel the actual search queries against the raw data set in much more funneled manner instead of being widespread across the entire data sets. Such focused search queries and data separation are far more performant and requires less compute and memory footprint.

Correia, A., Fonseca, B., Paredes, H., Schneider, D., Jameel, S.. 2019. Development of a Crowd-Powered System Architecture for Knowledge Discovery in Scientific Domains. 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). :1372—1377.

A substantial amount of work is often overlooked due to the exponential rate of growth in global scientific output across all disciplines. Current approaches for addressing this issue are usually limited in scope and often restrict the possibility of obtaining multidisciplinary views in practice. To tackle this problem, researchers can now leverage an ecosystem of citizens, volunteers and crowd workers to perform complex tasks that are either difficult for humans and machines to solve alone. Motivated by the idea that human crowds and computer algorithms have complementary strengths, we present an approach where the machine will learn from crowd behavior in an iterative way. This approach is embodied in the architecture of SciCrowd, a crowd-powered human-machine hybrid system designed to improve the analysis and processing of large amounts of publication records. To validate the proposal's feasibility, a prototype was developed and an initial evaluation was conducted to measure its robustness and reliability. We conclude this paper with a set of implications for design.

Xie, J., Zhang, M., Ma, Y.. 2019. Using Format Migration and Preservation Metadata to Support Digital Preservation of Scientific Data. 2019 IEEE 10th International Conference on Software Engineering and Service Science (ICSESS). :1—6.

With the development of e-Science and data intensive scientific discovery, it needs to ensure scientific data available for the long-term, with the goal that the valuable scientific data should be discovered and re-used for downstream investigations, either alone, or in combination with newly generated data. As such, the preservation of scientific data enables that not only might experiment be reproducible and verifiable, but also new questions can be raised by other scientists to promote research and innovation. In this paper, we focus on the two main problems of digital preservation that are format migration and preservation metadata. Format migration includes both format verification and object transformation. The system architecture of format migration and preservation metadata is presented, mapping rules of object transformation are analyzed, data fixity and integrity and authenticity, digital signature and so on are discussed and an example is shown in detail.

Zhang, W., Byna, S., Niu, C., Chen, Y.. 2019. Exploring Metadata Search Essentials for Scientific Data Management. 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). :83—92.

Scientific experiments and observations store massive amounts of data in various scientific file formats. Metadata, which describes the characteristics of the data, is commonly used to sift through massive datasets in order to locate data of interest to scientists. Several indexing data structures (such as hash tables, trie, self-balancing search trees, sparse array, etc.) have been developed as part of efforts to provide an efficient method for locating target data. However, efficient determination of an indexing data structure remains unclear in the context of scientific data management, due to the lack of investigation on metadata, metadata queries, and corresponding data structures. In this study, we perform a systematic study of the metadata search essentials in the context of scientific data management. We study a real-world astronomy observation dataset and explore the characteristics of the metadata in the dataset. We also study possible metadata queries based on the discovery of the metadata characteristics and evaluate different data structures for various types of metadata attributes. Our evaluation on real-world dataset suggests that trie is a suitable data structure when prefix/suffix query is required, otherwise hash table should be used. We conclude our study with a summary of our findings. These findings provide a guideline and offers insights in developing metadata indexing methodologies for scientific applications.

2019-09-04

Paiker, N., Ding, X., Curtmola, R., Borcea, C.. 2018. Context-Aware File Discovery System for Distributed Mobile-Cloud Apps. 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). :198–203.

Recent research has proposed middleware to enable efficient distributed apps over mobile-cloud platforms. This paper presents a Context-Aware File Discovery Service (CAFDS) that allows distributed mobile-cloud applications to find and access files of interest shared by collaborating users. CAFDS enables programmers to search for files defined by context and content features, such as location, creation time, or the presence of certain object types within an image file. CAFDS provides low-latency through a cloud-based metadata server, which uses a decision tree to locate the nearest files that satisfy the context and content features requested by applications. We implemented CAFDS in Android and Linux. Experimental results show CAFDS achieves substantially lower latency than peer-to-peer solutions that cannot leverage context information.

Lawson, M., Lofstead, J.. 2018. Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales. 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage Data Intensive Scalable Computing Systems (PDSW-DISCS). :13–23.

Our previous work, which can be referred to as EMPRESS 1.0, showed that rich metadata management provides a relatively low-overhead approach to facilitating insight from scale-up scientific applications. However, this system did not provide the functionality needed for a viable production system or address whether such a system could scale. Therefore, we have extended our previous work to create EMPRESS 2.0, which incorporates the features required for a useful production system. Through a discussion of EMPRESS 2.0, this paper explores how to incorporate rich query functionality, fault tolerance, and atomic operations into a scalable, storage system independent metadata management system that is easy to use. This paper demonstrates that such a system offers significant performance advantages over HDF5, providing metadata querying that is 150X to 650X faster, and can greatly accelerate post-processing. Finally, since the current implementation of EMPRESS 2.0 relies on an RDBMS, this paper demonstrates that an RDBMS is a viable technology for managing data-oriented metadata.

Maltitz, M. von, Smarzly, S., Kinkelin, H., Carle, G.. 2018. A management framework for secure multiparty computation in dynamic environments. NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium. :1–7.

Secure multiparty computation (SMC) is a promising technology for privacy-preserving collaborative computation. In the last years several feasibility studies have shown its practical applicability in different fields. However, it is recognized that administration, and management overhead of SMC solutions are still a problem. A vital next step is the incorporation of SMC in the emerging fields of the Internet of Things and (smart) dynamic environments. In these settings, the properties of these contexts make utilization of SMC even more challenging since some vital premises for its application regarding environmental stability and preliminary configuration are not initially fulfilled. We bridge this gap by providing FlexSMC, a management and orchestration framework for SMC which supports the discovery of nodes, supports a trust establishment between them and realizes robustness of SMC session by handling nodes failures and communication interruptions. The practical evaluation of FlexSMC shows that it enables the application of SMC in dynamic environments with reasonable performance penalties and computation durations allowing soft real-time and interactive use cases.

Vanjari, M. S. P., Balsaraf, M. K. P.. 2018. Efficient Exploration of Algorithm in Scholarly Big Data Document. 2018 International Conference on Information , Communication, Engineering and Technology (ICICET). :1–5.

Algorithms are used to develop, analyzing, and applying in the computer field and used for developing new application. It is used for finding solutions to any problems in different condition. It transforms the problems into algorithmic ones on which standard algorithms are applied. Day by day Scholarly Digital documents are increasing. AlgorithmSeer is a search engine used for searching algorithms. The main aim of it provides a large algorithm database. It is used to automatically encountering and take these algorithms in this big collection of documents that enable algorithm indexing, searching, discovery, and analysis. An original set to identify and pull out algorithm representations in a big collection of scholarly documents is proposed, of scale able techniques used by AlgorithmSeer. Along with this, particularly important and relevant textual content can be accessed the platform and highlight portions by anyone with different levels of knowledge. In support of lectures and self-learning, the highlighted documents can be shared with others. But different levels of learners cannot use the highlighted part of text at same understanding level. The problem of guessing new highlights of partially highlighted documents can be solved by us.

Xiong, M., Li, A., Xie, Z., Jia, Y.. 2018. A Practical Approach to Answer Extraction for Constructing QA Solution. 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC). :398–404.

Question Answering system(QA) plays an increasingly important role in the Internet age. The proportion of using the QA is getting higher and higher for the Internet users to obtain knowledge and solve problems, especially in the modern agricultural filed. However, the answer quality in QA varies widely due to the agricultural expert's level. Answer quality assessment is important. Due to the lexical gap between questions and answers, the existing approaches are not quite satisfactory. A practical approach RCAS is proposed to rank the candidate answers, which utilizes the support sets to reduce the impact of lexical gap between questions and answers. Firstly, Similar questions are retrieved and support sets are produced with their high-quality answers. Based on the assumption that high quality answers would also have intrinsic similarity, the quality of candidate answers are then evaluated through their distance from the support sets. Secondly, Different from the existing approaches, previous knowledge from similar question-answer pairs are used to bridge the straight lexical and semantic gaps between questions and answers. Experiments are implemented on approximately 0.15 million question-answer pairs about agriculture, dietetics and food from Yahoo! Answers. The results show that our approach can rank the candidate answers more precisely.

Liang, J., Jiang, L., Cao, L., Li, L., Hauptmann, A.. 2018. Focal Visual-Text Attention for Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. :6135–6143.

Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering. However, to tackle real-life question answering problems on multimedia collections such as personal photos, we have to look at whole collections with sequences of photos or videos. When answering questions from a large collection, a natural problem is to identify snippets to support the answer. In this paper, we describe a novel neural network called Focal Visual-Text Attention network (FVTA) for collective reasoning in visual question answering, where both visual and text sequence information such as images and text metadata are presented. FVTA introduces an end-to-end approach that makes use of a hierarchical process to dynamically determine what media and what time to focus on in the sequential data to answer the question. FVTA can not only answer the questions well but also provides the justifications which the system results are based upon to get the answers. FVTA achieves state-of-the-art performance on the MemexQA dataset and competitive results on the MovieQA dataset.

Sefati, Shahin, Saadatpanah, Parsa, Sayyadi, Hassan, Neumann, Jan. 2018. Conversational Content Discovery via Comcast X1 Voice Interface. Proceedings of the 12th ACM Conference on Recommender Systems. :489–489.

The global market for intelligent voice-enabled devices is expanding at a fast pace. Comcast, one of the largest cable provides in the US with about 30 million users, has recently reinvented the way that customers can discover and access content on an entertainment platform by introducing a voice remote control for its Xfinity X1 entertainment platform. Spoken language input allows the customer to express what they are interested in on their terms, which has made it significantly more convenient for the users to find their favorite TV channel or movie compared to the traditional limits of a screen menu navigated with the keys of a TV remote. The more natural user experience via voice interface results in voice queries that are considerably more complex to handle compared to channel numbers typed in or movie titles selected on screen and this poses a challenge for the platform to understand the user intent and find the appropriate action for millions of voice queries that we receive every day. This also makes it necessary to adapt the underlying content recommendation algorithms to incorporate the richer intent context from the users. We describe some of the key components of our voice-powered content discovery platform that addresses specifically these issues. We discuss how we leverage multimodal data including voice queries and large database of metadata to enable a more natural search experience via voice queries for finding relevant movies, TV shows or even a specific episode of a series. We describe the models that encode semantic similarities between the content and their metadata to allow users to search for places, people, topics using keywords or phrases that do not explicitly appear in the movie/show titles as is traditionally the case. We describe how this category of voice search queries can be framed as a recommendation problem. Even though voice input is extremely powerful to capture the intent of our customers, the freedom to say anything makes it also more difficult for a voice remote user to know the range of possible queries that are supported by our system. We show how we can leverage millions of voice queries that we receive every day to build and train a deep learning-based recommender system that produces different types of recommendations such as educational suggestions and tips for voice commands that the platform support. Finally, it is important to consider that the true potential of the voice-powered entertainment experience is the result of the fusion of intents expressed in language with navigation of content on the screen via the remote navigation buttons. For all the applications and features discussed in this talk, our recommendation systems are adapted to provide the most relevant suggestions no matter if the voice interface is initiating the action, navigating through the results rendered on the TV screen and narrowing down the set of results by allowing the user to ask follow-up queries or select buttons.

2017-12-12

Soska, Kyle, Gates, Chris, Roundy, Kevin A., Christin, Nicolas. 2017. Automatic Application Identification from Billions of Files. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. :2021–2030.

Understanding how to group a set of binary files into the piece of software they belong to is highly desirable for software profiling, malware detection, or enterprise audits, among many other applications. Unfortunately, it is also extremely challenging: there is absolutely no uniformity in the ways different applications rely on different files, in how binaries are signed, or in the versioning schemes used across different pieces of software. In this paper, we show that, by combining information gleaned from a large number of endpoints (millions of computers), we can accomplish large-scale application identification automatically and reliably. Our approach relies on collecting metadata on billions of files every day, summarizing it into much smaller "sketches", and performing approximate k-nearest neighbor clustering on non-metric space representations derived from these sketches. We design and implement our proposed system using Apache Spark, show that it can process billions of files in a matter of hours, and thus could be used for daily processing. We further show our system manages to successfully identify which files belong to which application with very high precision, and adequate recall.

Sürer, Özge. 2017. Improving Similarity Measures Using Ontological Data. Proceedings of the Eleventh ACM Conference on Recommender Systems. :416–420.

The representation of structural data is important to capture the pattern between features. Interrelations between variables provide information beyond the standard variables. In this study, we show how ontology information may be used in a recommender systems to increase the efficiency of predictions. We propose two alternative similarity measures that incorporates the structural data representation. Experiments show that our ontology-based approach delivers improved classification accuracy when the dimension increases.

Shaabani, Nuhad, Meinel, Christoph. 2017. Incremental Discovery of Inclusion Dependencies. Proceedings of the 29th International Conference on Scientific and Statistical Database Management. :2:1–2:12.

Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications such as data profiling, data cleaning, entity resolution and schema matching. Their discovery in an unknown dataset is at the core of any data analysis effort. Therefore, several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are appropriate for applications on dynamic datasets, such as transactional datasets, scientific applications, and social network. In these cases, discovery techniques should be able to efficiently update the inclusion dependencies after an update in the dataset, without reprocessing the entire dataset. We present the first approach for incrementally updating the unary inclusion dependencies. In particular, our approach is based on the concept of attribute clustering from which the unary inclusion dependencies are efficiently derivable. We incrementally update the clusters after each update of the dataset. Updating the clusters does not need to access the dataset because of special data structures designed to efficiently support the updating process. We perform an exhaustive analysis of our approach by applying it to large datasets with several hundred attributes and more than 116,200,000 million tuples. The results show that the incremental discovery significantly reduces the runtime needed by the static discovery. This reduction in the runtime is up to 99.9996 % for both the insert and the delete.

Zhou, G., Huang, J. X.. 2017. Modeling and Learning Distributed Word Representation with Metadata for Question Retrieval. IEEE Transactions on Knowledge and Data Engineering. 29:1226–1239.

Community question answering (cQA) has become an important issue due to the popularity of cQA archives on the Web. This paper focuses on addressing the lexical gap problem in question retrieval. Question retrieval in cQA archives aims to find the existing questions that are semantically equivalent or relevant to the queried questions. However, the lexical gap problem brings a new challenge for question retrieval in cQA. In this paper, we propose to model and learn distributed word representations with metadata of category information within cQA pages for question retrieval using two novel category powered models. One is a basic category powered model called MB-NET and the other one is an enhanced category powered model called ME-NET which can better learn the distributed word representations and alleviate the lexical gap problem. To deal with the variable size of word representation vectors, we employ the framework of fisher kernel to transform them into the fixed-length vectors. Experimental results on large-scale English and Chinese cQA data sets show that our proposed approaches can significantly outperform state-of-the-art retrieval models for question retrieval in cQA. Moreover, we further conduct our approaches on large-scale automatic evaluation experiments. The evaluation results show that promising and significant performance improvements can be achieved.