Visible to the public Biblio

Filters: Keyword is query processing  [Clear All Filters]
2020-09-04
Merhav, Neri, Cohen, Asaf.  2019.  Universal Randomized Guessing with Application to Asynchronous Decentralized Brute—Force Attacks. 2019 IEEE International Symposium on Information Theory (ISIT). :485—489.
Consider the problem of guessing a random vector X by submitting queries (guesses) of the form "Is X equal to x?" until an affirmative answer is obtained. A key figure of merit is the number of queries required until the right vector is guessed, termed the guesswork. The goal is to devise a guessing strategy which minimizes a certain guesswork moment. We study a universal, decentralized scenario where the guesser does not know the distribution of X, and is not allowed to prepare a list of words to be guessed in advance, or to remember its past guesses. Such a scenario is useful, for example, if bots within a Botnet carry out a brute-force attack to guess a password or decrypt a message, yet cannot coordinate the guesses or even know how many bots actually participate in the attack. We devise universal decentralized guessing strategies, first, for memoryless sources, and then generalize them to finite-state sources. For both, we derive the guessing exponent and prove its asymptotic optimality by deriving a matching converse. The strategies are based on randomized guessing using a universal distribution. We also extend the results to guessing with side information (SI). Finally, we design simple algorithms for sampling from the universal distributions.
2020-08-24
Liu, Hongling.  2019.  Research on Feasibility Path of Technology Supervision and Technology Protection in Big Data Environment. 2019 International Conference on Intelligent Transportation, Big Data Smart City (ICITBS). :293–296.
Big data will bring revolutionary changes from life to thinking for society as a whole. At the same time, the massive data and potential value of big data are subject to many security risks. Aiming at the above problems, a data privacy protection model for big data platform is proposed. First, the data privacy protection model of big data for data owners is introduced in detail, including protocol design, logic design, complexity analysis and security analysis. Then, the query privacy protection model of big data for ordinary users is introduced in detail, including query protocol design and query mode design. Complexity analysis and safety analysis are performed. Finally, a stand-alone simulation experiment is built for the proposed privacy protection model. Experimental data is obtained and analyzed. The feasibility of the privacy protection model is verified.
2020-08-13
Yu, Lili, Su, Xiaoguang, Zhang, Lei.  2019.  Collaboration-Based Location Privacy Protection Method. 2019 IEEE 2nd International Conference on Electronics Technology (ICET). :639—643.
In the privacy protection method based on user collaboration, all participants and collaborators must share the maximum anonymity value set in the anonymous group. No user can get better quality of service by reducing the anonymity requirement. In this paper, a privacy protection algorithm random-QBE, which divides query information into blocks and exchanges randomly, is proposed. Through this method, personalized anonymity, query diversity and location anonymity in user cooperative privacy protection can be realized. And through multi-hop communication between collaborative users, this method can also satisfy the randomness of anonymous location, so that the location of the applicant is no longer located in the center of the anonymous group, which further increases the ability of privacy protection. Experiments show that the algorithm can complete the processing in a relatively short time and is suitable for deployment in real environment to protect user's location privacy.
2020-08-03
Juuti, Mika, Szyller, Sebastian, Marchal, Samuel, Asokan, N..  2019.  PRADA: Protecting Against DNN Model Stealing Attacks. 2019 IEEE European Symposium on Security and Privacy (EuroS P). :512–527.
Machine learning (ML) applications are increasingly prevalent. Protecting the confidentiality of ML models becomes paramount for two reasons: (a) a model can be a business advantage to its owner, and (b) an adversary may use a stolen model to find transferable adversarial examples that can evade classification by the original model. Access to the model can be restricted to be only via well-defined prediction APIs. Nevertheless, prediction APIs still provide enough information to allow an adversary to mount model extraction attacks by sending repeated queries via the prediction API. In this paper, we describe new model extraction attacks using novel approaches for generating synthetic queries, and optimizing training hyperparameters. Our attacks outperform state-of-the-art model extraction in terms of transferability of both targeted and non-targeted adversarial examples (up to +29-44 percentage points, pp), and prediction accuracy (up to +46 pp) on two datasets. We provide take-aways on how to perform effective model extraction attacks. We then propose PRADA, the first step towards generic and effective detection of DNN model extraction attacks. It analyzes the distribution of consecutive API queries and raises an alarm when this distribution deviates from benign behavior. We show that PRADA can detect all prior model extraction attacks with no false positives.
2020-07-20
Pengcheng, Li, Yi, Jinfeng, Zhang, Lijun.  2018.  Query-Efficient Black-Box Attack by Active Learning. 2018 IEEE International Conference on Data Mining (ICDM). :1200–1205.
Deep neural network (DNN) as a popular machine learning model is found to be vulnerable to adversarial attack. This attack constructs adversarial examples by adding small perturbations to the raw input, while appearing unmodified to human eyes but will be misclassified by a well-trained classifier. In this paper, we focus on the black-box attack setting where attackers have almost no access to the underlying models. To conduct black-box attack, a popular approach aims to train a substitute model based on the information queried from the target DNN. The substitute model can then be attacked using existing white-box attack approaches, and the generated adversarial examples will be used to attack the target DNN. Despite its encouraging results, this approach suffers from poor query efficiency, i.e., attackers usually needs to query a huge amount of times to collect enough information for training an accurate substitute model. To this end, we first utilize state-of-the-art white-box attack methods to generate samples for querying, and then introduce an active learning strategy to significantly reduce the number of queries needed. Besides, we also propose a diversity criterion to avoid the sampling bias. Our extensive experimental results on MNIST and CIFAR-10 show that the proposed method can reduce more than 90% of queries while preserve attacking success rates and obtain an accurate substitute model which is more than 85% similar with the target oracle.
2020-07-10
Cai, Zhipeng, Miao, Dongjing, Li, Yingshu.  2019.  Deletion Propagation for Multiple Key Preserving Conjunctive Queries: Approximations and Complexity. 2019 IEEE 35th International Conference on Data Engineering (ICDE). :506—517.

This paper studies the deletion propagation problem in terms of minimizing view side-effect. It is a problem funda-mental to data lineage and quality management which could be a key step in analyzing view propagation and repairing data. The investigated problem is a variant of the standard deletion propagation problem, where given a source database D, a set of key preserving conjunctive queries Q, and the set of views V obtained by the queries in Q, we try to identify a set T of tuples from D whose elimination prevents all the tuples in a given set of deletions on views △V while preserving any other results. The complexity of this problem has been well studied for the case with only a single query. Dichotomies, even trichotomies, for different settings are developed. However, no results on multiple queries are given which is a more realistic case. We study the complexity and approximations of optimizing the side-effect on the views, i.e., find T to minimize the additional damage on V after removing all the tuples of △V. We focus on the class of key-preserving conjunctive queries which is a dichotomy for the single query case. It is surprising to find that except the single query case, this problem is NP-hard to approximate within any constant even for a non-trivial set of multiple project-free conjunctive queries in terms of view side-effect. The proposed algorithm shows that it can be approximated within a bound depending on the number of tuples of both V and △V. We identify a class of polynomial tractable inputs, and provide a dynamic programming algorithm to solve the problem. Besides data lineage, study on this problem could also provide important foundations for the computational issues in data repairing. Furthermore, we introduce some related applications of this problem, especially for query feedback based data cleaning.

2020-06-22
Tong, Dong, Yong, Zeng, Mengli, Liu, Zhihong, Liu, Jianfeng, Ma, Xiaoyan, Zhu.  2019.  A Topology Based Differential Privacy Scheme for Average Path Length Query. 2019 International Conference on Networking and Network Applications (NaNA). :350–355.
Differential privacy is heavily used in privacy protection due to it provides strong protection against private data. The existing differential privacy scheme mainly implements the privacy protection of nodes or edges in the network by perturbing the data query results. Most of them cannot meet the privacy protection requirements of multiple types of information. In order to overcome these issues, a differential privacy security mechanism with average path length (APL) query is proposed in this paper, which realize the privacy protection of both network vertices and edge weights. Firstly, by describing APL, the reasons for choosing this attribute as the query function are analyzed. Secondly, global sensitivity of APL query under the need of node privacy protection and edge-weighted privacy protection is proved. Finally, the relationship between data availability and privacy control parameters in differential privacy is analyzed through experiments.
2020-06-08
Fang, Bo, Hua, Zhongyun, Huang, Hejiao.  2019.  Locality-Sensitive Hashing Scheme Based on Heap Sort of Hash Bucket. 2019 14th International Conference on Computer Science Education (ICCSE). :5–10.
Nearest neighbor search (NNS) is one of the current popular research directions, which widely used in machine learning, pattern recognition, image detection and so on. In the low dimension data, based on tree search method can get good results. But when the data dimension goes up, that will produce a curse of dimensional. The proposed Locality-Sensitive Hashing algorithm (LSH) greatly improves the efficiency of nearest neighbor query for high dimensional data. But the algorithm relies on the building a large number of hash table, which makes the space complexity very high. C2LSH based on dynamic collision improves the disadvantage of LSH, but its disadvantage is that it needs to detect the collision times of a large number of data points which Increased query time. Therefore, Based on LSH algorithm, later researchers put forward many improved algorithms, but still not ideal.In this paper, we put forward Locality-Sensitive Hashing Scheme Based on Heap Sort of Hash Bucket (HSLSH) algorithm aiming at the shortcomings of LSH and C2LSH. Its main idea is to take advantage of the efficiency of heapsort in massive data sorting to improve the efficiency of nearest neighbor query. It only needs to rely on a small number of hash functions can not only overcome the shortcoming of LSH need to build a large number of hash table, and avoids defects of C2LSH. Experiments show that our algorithm is more than 20% better than C2LSH in query accuracy and 40% percent lower in query time.
2020-05-22
Devarakonda, Ranjeet, Giansiracusa, Michael, Kumar, Jitendra.  2018.  Machine Learning and Social Media to Mine and Disseminate Big Scientific Data. 2018 IEEE International Conference on Big Data (Big Data). :5312—5315.

One of the challenges in supplying the communities with wider access to scientific databases is the need for knowledge of database languages like Structured Query Language (SQL). Although the SQL language has been published in many forms, not everybody is able to write SQL queries. Another challenge is that it might not be practical to make the public aware of the structure of databases. There is a need for novice users to query relational databases using their natural language. To solve this problem, many natural language interfaces to structured databases have been developed. The goal is to provide a more intuitive method for generating database queries and delivering responses. Through social media, which makes it possible to interact with a wide section of the population, and with the help of natural language processing, researchers at the Atmospheric Radiation Measurement (ARM) Data Center at Oak Ridge National Laboratory (ORNL) have developed a concept to enable easy search and retrieval of data from several environmental data centers for the scientific community through social media.Using a machine learning framework that maps natural language text to thousands of datasets, instruments, variables, and data streams, the prototype system would allow users to request data through Twitter and receive a link (via tweet) to applicable data results on the project's search catalog tailored to their key words. This automated identification of relevant data from various petascale archives at ORNL could increase convenience, access, and use of the project's data by the broader community. In this paper we discuss how some data-intensive projects at ORNL are using innovative ways to help in data discovery.

Kate, Abhilasha, Kamble, Satish, Bodkhe, Aishwarya, Joshi, Mrunal.  2018.  Conversion of Natural Language Query to SQL Query. 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA). :488—491.

This paper present an approach to automate the conversion of Natural Language Query to SQL Query effectively. Structured Query Language is a powerful tool for managing data held in a relational database management system. To retrieve or manage data user have to enter the correct SQL Query. But the users who don't have any knowledge about SQL are unable to retrieve the required data. To overcome this we proposed a model in Natural Language Processing for converting the Natural Language Query to SQL query. This helps novice user to get required content without knowing any complex details about SQL. This system can also deal with complex queries. This system is designed for Training and Placement cell officers who work on student database but don't have any knowledge about SQL. In this system, user can also enter the query using speech. System will convert speech into the text format. This query will get transformed to SQL query. System will execute the query and gives output to the user.

Sheth, Utsav, Dutta, Sanghamitra, Chaudhari, Malhar, Jeong, Haewon, Yang, Yaoqing, Kohonen, Jukka, Roos, Teemu, Grover, Pulkit.  2018.  An Application of Storage-Optimal MatDot Codes for Coded Matrix Multiplication: Fast k-Nearest Neighbors Estimation. 2018 IEEE International Conference on Big Data (Big Data). :1113—1120.
We propose a novel application of coded computing to the problem of the nearest neighbor estimation using MatDot Codes (Fahim et al., Allerton'17) that are known to be optimal for matrix multiplication in terms of recovery threshold under storage constraints. In approximate nearest neighbor algorithms, it is common to construct efficient in-memory indexes to improve query response time. One such strategy is Multiple Random Projection Trees (MRPT), which reduces the set of candidate points over which Euclidean distance calculations are performed. However, this may result in a high memory footprint and possibly paging penalties for large or high-dimensional data. Here we propose two techniques to parallelize MRPT that exploit data and model parallelism respectively by dividing both the data storage and the computation efforts among different nodes in a distributed computing cluster. This is especially critical when a single compute node cannot hold the complete dataset in memory. We also propose a novel coded computation strategy based on MatDot codes for the model-parallel architecture that, in a straggler-prone environment, achieves the storage-optimal recovery threshold, i.e., the number of nodes that are required to serve a query. We experimentally demonstrate that, in the absence of straggling, our distributed approaches require less query time than execution on a single processing node, providing near-linear speedups with respect to the number of worker nodes. Our experiments on real systems with simulated straggling, we also show that in a straggler-prone environment, our strategy achieves a faster query execution than the uncoded strategy.
Kang, Hyunjoong, Hong, Sanghyun, Lee, Kookjin, Park, Noseong, Kwon, Soonhyun.  2018.  On Integrating Knowledge Graph Embedding into SPARQL Query Processing. 2018 IEEE International Conference on Web Services (ICWS). :371—374.
SPARQL is a standard query language for knowledge graphs (KGs). However, it is hard to find correct answer if KGs are incomplete or incorrect. Knowledge graph embedding (KGE) enables answering queries on such KGs by inferring unknown knowledge and removing incorrect knowledge. Hence, our long-term goal in this line of research is to propose a new framework that integrates KGE and SPARQL, which opens various research problems to be addressed. In this paper, we solve one of the most critical problems, that is, optimizing the performance of nearest neighbor (NN) search. In our evaluations, we demonstrate that the search time of state-of-the-art NN search algorithms is improved by 40% without sacrificing answer accuracy.
Ito, Toshitaka, Itotani, Yuri, Wakabayashi, Shin'ichi, Nagayama, Shinobu, Inagi, Masato.  2018.  A Nearest Neighbor Search Engine Using Distance-Based Hashing. 2018 International Conference on Field-Programmable Technology (FPT). :150—157.
This paper proposes an FPGA-based nearest neighbor search engine for high-dimensional data, in which nearest neighbor search is performed based on distance-based hashing. The proposed hardware search engine implements a nearest neighbor search algorithm based on an extension of flexible distance-based hashing (FDH, for short), which finds an exact solution with high probability. The proposed engine is a parallel processing and pipelined circuit so that search results can be obtained in a short execution time. Experimental results show the effectiveness and efficiency of the proposed engine.
Ahsan, Ramoza, Bashir, Muzammil, Neamtu, Rodica, Rundensteiner, Elke A., Sarkozy, Gabor.  2019.  Nearest Neighbor Subsequence Search in Time Series Data. 2019 IEEE International Conference on Big Data (Big Data). :2057—2066.
Continuous growth in sensor data and other temporal sequence data necessitates efficient retrieval and similarity search support on these big time series datasets. However, finding exact similarity results, especially at the granularity of subsequences, is known to be prohibitively costly for large data sets. In this paper, we thus propose an efficient framework for solving this exact subsequence similarity match problem, called TINN (TIme series Nearest Neighbor search). Exploiting the range interval diversity properties of time series datasets, TINN captures similarity at two levels of abstraction, namely, relationships among subsequences within each long time series and relationships across distinct time series in the data set. These relationships are compactly organized in an augmented relationship graph model, with the former relationships encoded in similarity vectors at TINN nodes and the later captured by augmented edge types in the TINN Graph. Query processing strategy deploy novel pruning techniques on the TINN Graph, including node skipping, vertical and horizontal pruning, to significantly reduce the number of time series as well as subsequences to be explored. Comprehensive experiments on synthetic and real world time series data demonstrate that our TINN model consistently outperforms state-of-the-art approaches while still guaranteeing to retrieve exact matches.
Abdelhadi, Ameer M.S., Bouganis, Christos-Savvas, Constantinides, George A..  2019.  Accelerated Approximate Nearest Neighbors Search Through Hierarchical Product Quantization. 2019 International Conference on Field-Programmable Technology (ICFPT). :90—98.
A fundamental recurring task in many machine learning applications is the search for the Nearest Neighbor in high dimensional metric spaces. Towards answering queries in large scale problems, state-of-the-art methods employ Approximate Nearest Neighbors (ANN) search, a search that returns the nearest neighbor with high probability, as well as techniques that compress the dataset. Product-Quantization (PQ) based ANN search methods have demonstrated state-of-the-art performance in several problems, including classification, regression and information retrieval. The dataset is encoded into a Cartesian product of multiple low-dimensional codebooks, enabling faster search and higher compression. Being intrinsically parallel, PQ-based ANN search approaches are amendable for hardware acceleration. This paper proposes a novel Hierarchical PQ (HPQ) based ANN search method as well as an FPGA-tailored architecture for its implementation that outperforms current state of the art systems. HPQ gradually refines the search space, reducing the number of data compares and enabling a pipelined search. The mapping of the architecture on a Stratix 10 FPGA device demonstrates over ×250 speedups over current state-of-the-art systems, opening the space for addressing larger datasets and/or improving the query times of current systems.
Rattaphun, Munlika, Prayoonwong, Amorntip, Chiu, Chih- Yi.  2019.  Indexing in k-Nearest Neighbor Graph by Hash-Based Hill-Climbing. 2019 16th International Conference on Machine Vision Applications (MVA). :1—4.
A main issue in approximate nearest neighbor search is to achieve an excellent tradeoff between search accuracy and computation cost. In this paper, we address this issue by leveraging k-nearest neighbor graph and hill-climbing to accelerate vector quantization in the query assignment process. A modified hill-climbing algorithm is proposed to traverse k-nearest neighbor graph to find closest centroids for a query, rather than calculating the query distances to all centroids. Instead of using random seeds in the original hill-climbing algorithm, we generate high-quality seeds based on the hashing technique. It can boost the query assignment efficiency due to a better start-up in hill-climbing. We evaluate the experiment on the benchmarks of SIFT1M and GIST1M datasets, and show the proposed hashing-based seed generation effectively improves the search performance.
Li, Xiaodong.  2019.  DURS: A Distributed Method for k-Nearest Neighbor Search on Uncertain Graphs. 2019 20th IEEE International Conference on Mobile Data Management (MDM). :377—378.
Large graphs are increasingly prevalent in mobile networks, social networks, traffic networks and biological networks. These graphs are often uncertain, where edges are augmented with probabilities that indicates the chance to exist. Recently k-nearest neighbor search has been studied within the field of uncertain graphs, but the scalability and efficiency issues are not well solved. Moreover, solutions are implemented on a single machine and thus cannot fit large uncertain graphs. In this paper, we develop a framework, called DURS, to distribute k-nearest neighbor search into several machines and re-partition the uncertain graphs to balance the work loads and reduce the communication costs. Evaluation results show that DURS is essential to make the system scalable when answering k-nearest neighbor queries on uncertain graphs.
2020-04-20
Khan, Muhammad Imran, Foley, Simon N., O'Sullivan, Barry.  2019.  PriDe: A Quantitative Measure of Privacy-Loss in Interactive Querying Settings. 2019 10th IFIP International Conference on New Technologies, Mobility and Security (NTMS). :1–5.
This paper presents, PriDe, a model to measure the deviation of an analyst's (user) querying behaviour from normal querying behaviour. The deviation is measured in terms of privacy, that is to say, how much of the privacy loss has incurred due to this shift in querying behaviour. The shift is represented in terms of a score - a privacy-loss score, the higher the score the more the loss in privacy. Querying behaviour of analysts are modelled using n-grams of SQL query and subsequently, behavioural profiles are constructed. Profiles are then compared in terms of privacy resulting in a quantified score indicating the privacy loss.
Xiang, Wei.  2019.  An Efficient Location Privacy Preserving Model based on Geohash. 2019 6th International Conference on Behavioral, Economic and Socio-Cultural Computing (BESC). :1–5.
With the rapid development of location-aware mobile devices, location-based services have been widely used. When LBS (Location Based Services) bringing great convenience and profits, it also brings great hidden trouble, among which user privacy security is one of them. The paper builds a LBS privacy protection model and develops algorithm depend on the technology of one dimensional coding of Geohash geographic information. The results of experiments and data measurements show that the model the model has reached k-anonymity effect and has good performance in avoiding attacking from the leaked information in a continuous query with the user's background knowledge. It also has a preferable performance in time cost of system process.
Sule, Rupali, Chaudhari, Sangita.  2018.  Preserving Location Privacy in Geosocial Applications using Error Based Transformation. 2018 International Conference on Smart City and Emerging Technology (ICSCET). :1–4.
Geo-social applications deal with constantly sharing user's current geographic information in terms of location (Latitude and Longitude). Such application can be used by many people to get information about their surrounding with the help of their friend's locations and their recommendations. But without any privacy protection, these systems can be easily misused by tracking the users. We are proposing Error Based Transformation (ERB) approach for location transformation which provides significantly improved location privacy without adding uncertainty in to query results or relying on strong assumptions about server security. The key insight is to apply secure user-specific, distance-preserving coordinate transformations to all location data shared with the server. Only the friends of a user can get exact co-ordinates by applying inverse transformation with secret key shared with them. Servers can evaluate all location queries correctly on transformed data. ERB privacy mechanism guarantee that servers are unable to see or infer actual location data from the transformed data. ERB privacy mechanism is successful against a powerful adversary model where prototype measurements used to show that it provides with very little performance overhead making it suitable for today's mobile device.
Sule, Rupali, Chaudhari, Sangita.  2018.  Preserving Location Privacy in Geosocial Applications using Error Based Transformation. 2018 International Conference on Smart City and Emerging Technology (ICSCET). :1–4.
Geo-social applications deal with constantly sharing user's current geographic information in terms of location (Latitude and Longitude). Such application can be used by many people to get information about their surrounding with the help of their friend's locations and their recommendations. But without any privacy protection, these systems can be easily misused by tracking the users. We are proposing Error Based Transformation (ERB) approach for location transformation which provides significantly improved location privacy without adding uncertainty in to query results or relying on strong assumptions about server security. The key insight is to apply secure user-specific, distance-preserving coordinate transformations to all location data shared with the server. Only the friends of a user can get exact co-ordinates by applying inverse transformation with secret key shared with them. Servers can evaluate all location queries correctly on transformed data. ERB privacy mechanism guarantee that servers are unable to see or infer actual location data from the transformed data. ERB privacy mechanism is successful against a powerful adversary model where prototype measurements used to show that it provides with very little performance overhead making it suitable for today's mobile device.
2020-03-30
Souza, Renan, Azevedo, Leonardo, Lourenço, Vítor, Soares, Elton, Thiago, Raphael, Brandão, Rafael, Civitarese, Daniel, Brazil, Emilio, Moreno, Marcio, Valduriez, Patrick et al..  2019.  Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering. 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS). :1–10.
Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stackholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle, while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the O&G industry, along with its evaluation using 239,616 CUDA cores in parallel.
Miao, Hui, Deshpande, Amol.  2019.  Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization. 2019 IEEE 35th International Conference on Data Engineering (ICDE). :1710–1713.
Increasingly modern data science platforms today have non-intrusive and extensible provenance ingestion mechanisms to collect rich provenance and context information, handle modifications to the same file using distinguishable versions, and use graph data models (e.g., property graphs) and query languages (e.g., Cypher) to represent and manipulate the stored provenance/context information. Due to the schema-later nature of the metadata, multiple versions of the same files, and unfamiliar artifacts introduced by team members, the resulting "provenance graphs" are quite verbose and evolving; further, it is very difficult for the users to compose queries and utilize this valuable information just using standard graph query model. In this paper, we propose two high-level graph query operators to address the verboseness and evolving nature of such provenance graphs. First, we introduce a graph segmentation operator, which queries the retrospective provenance between a set of source vertices and a set of destination vertices via flexible boundary criteria to help users get insight about the derivation relationships among those vertices. We show the semantics of such a query in terms of a context-free grammar, and develop efficient algorithms that run orders of magnitude faster than state-of-the-art. Second, we propose a graph summarization operator that combines similar segments together to query prospective provenance of the underlying project. The operator allows tuning the summary by ignoring vertex details and characterizing local structures, and ensures the provenance meaning using path constraints. We show the optimal summary problem is PSPACE-complete and develop effective approximation algorithms. We implement the operators on top of Neo4j, evaluate our query techniques extensively, and show the effectiveness and efficiency of the proposed methods.
Tabassum, Anika, Nady, Anannya Islam, Rezwanul Huq, Mohammad.  2019.  Mathematical Formulation and Implementation of Query Inversion Techniques in RDBMS for Tracking Data Provenance. 2019 7th International Conference on Information and Communication Technology (ICoICT). :1–6.
Nowadays the massive amount of data is produced from different sources and lots of applications are processing these data to discover insights. Sometimes we may get unexpected results from these applications and it is not feasible to trace back to the data origin manually to find the source of errors. To avoid this problem, data must be accompanied by the context of how they are processed and analyzed. Especially, data-intensive applications like e-Science always require transparency and therefore, we need to understand how data has been processed and transformed. In this paper, we propose mathematical formulation and implementation of query inversion techniques to trace the provenance of data in a relational database management system (RDBMS). We build mathematical formulations of inverse queries for most of the relational algebra operations and show the formula for join operations in this paper. We, then, implement these formulas of inversion techniques and the experiment shows that our proposed inverse queries can successfully trace back to original data i.e. finding data provenance.
2020-03-18
Mohd Kamal, Ahmad Akmal Aminuddin, Iwamura, Keiichi.  2019.  Searchable Encryption Using Secret-Sharing Scheme for Multiple Keyword Search Using Conjunctive and Disjunctive Searching. 2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech). :149–156.
The main searching functions realized by searchable encryption can be divided into searching using one query and searching using multiple queries. Searchable encryption using one query has been widely studied and researched; however, few methods of searchable encryption can accommodate search using multiple queries. In addition, most of the method proposed thus far utilize the concept of index search. Therefore, a new problem exists, in which an additional process of updating or deleting an index when new documents are added or removed is required. Hence, the overall computation cost increases. Another problem is that a document that is not registered in the index cannot be searched. Therefore, herein, using a secret-sharing scheme that is known to offer a low computational cost, we propose a method that can realize both logical conjunctive (AND) and logical disjunctive (OR) search over multiple conditions, without the construction of any index. Hence, we can realize direct searching over sentences, thus achieving a more efficient search method.