Biblio
Recently, a large amount of research studies aiming at the privacy-preserving data publishing have been conducted. We find that most K-anonymity algorithms fail to consider the characteristics of attribute values distribution in data and the contribution value differences in quasi-identifier attributes when service-oriented. In this paper, the importance of distribution characteristics of attribute values and the differences in contribution value of quasi-identifier attributes to anonymous results are illustrated. In order to maximize the utility of released data, a service-oriented adaptive anonymity algorithm is proposed. We establish a model of reaction dispersion degree to quantify the characteristics of attribute value distribution and introduce the concept of utility weight related to the contribution value of quasi-identifier attributes. The priority coefficient and the characterization coefficient of partition quality are defined to optimize selection strategies of dimension and splitting value in anonymity group partition process adaptively, which can reduce unnecessary information loss so as to further improve the utility of anonymized data. The rationality and validity of the algorithm are verified by theoretical analysis and multiple experiments.
This paper describes a machine assistance approach to grading decisions for values that might be missing or need validation, using a mathematical algebraic form of an Expert System, instead of the traditional textual or logic forms and builds a neural network computational graph structure. This Experts System approach is also structured into a neural network like format of: input, hidden and output layers that provide a structured approach to the knowledge-base organization, this provides a useful abstraction for reuse for data migration applications in big data, Cyber and relational databases. The approach is further enhanced with a Bayesian probability tree approach to grade the confidences of value probabilities, instead of the traditional grading of the rule probabilities, and estimates the most probable value in light of all evidence presented. This is ground work for a Machine Learning (ML) experts system approach in a form that is closer to a Neural Network node structure.
Model explanations based on pure observational data cannot compute the effects of features reliably, due to their inability to estimate how each factor alteration could affect the rest. We argue that explanations should be based on the causal model of the data and the derived intervened causal models, that represent the data distribution subject to interventions. With these models, we can compute counterfactuals, new samples that will inform us how the model reacts to feature changes on our input. We propose a novel explanation methodology based on Causal Counterfactuals and identify the limitations of current Image Generative Models in their application to counterfactual creation.
With the development of network services and people's privacy requirements continue to increase. On the basis of providing anonymous user communication, it is necessary to protect the anonymity of the server. At the same time, there are many threatening crime messages in the dark network. However, many scholars lack the ability or expertise to conduct research on dark-net threat intelligence. Therefore, this paper designs a framework based on Hadoop is hidden threat intelligence. The framework uses HDFS as the underlying storage system to build a HBase-based distributed database to store and manage threat intelligence information. According to the heterogeneous type of the forum, the web crawler is used to collect data through the anonymous TOR tool. The framework is used to identify the characteristics of key dark network criminal networks, which is the basis for the later dark network research.
Big data processing systems are becoming increasingly more present in cloud workloads. Consequently, they are starting to incorporate more sophisticated mechanisms from traditional database and distributed systems. We focus in this work on the use of caching policies, which for big data raise important new challenges. Not only they must respond to new variants of the trade-off between hit rate, response time, and the space consumed by the cache, but they must do so at possibly higher volume and velocity than web and database workloads. Previous caching policies have not been tested experimentally with big data workloads. We address these challenges in this work. We propose the Read Density family of policies, which is a principled approach to quantify the utility of cached objects through a family of utility functions that depend on the frequency of reads of an object. We further design the Approximate Histogram, which is a policy-based technique based on an array of counters. This technique promises to achieve runtime-space efficient computation of the metric required by the cache policy. We evaluate through trace-based simulation the caching policies from the Read Density family, and compare them with over ten state-of-the-art alternatives. We use two workload traces representative for big data processing, collected from commercial Spark and MapReduce deployments. While we achieve comparable performance to the state-of-art with less parameters, meaningful performance improvement for big data workloads remain elusive.
With the rapid increase in the use of mobile devices in people's daily lives, mobile data traffic is exploding in recent years. In the edge computing environment where edge servers are deployed around mobile users, caching popular data on edge servers can ensure mobile users' fast access to those data and reduce the data traffic between mobile users and the centralized cloud. Existing studies consider the data cache problem with a focus on the reduction of network delay and the improvement of mobile devices' energy efficiency. In this paper, we attack the data caching problem in the edge computing environment from the service providers' perspective, who would like to maximize their venues of caching their data. This problem is complicated because data caching produces benefits at a cost and there usually is a trade-off in-between. In this paper, we formulate the data caching problem as an integer programming problem, and maximizes the revenue of the service provider while satisfying a constraint for data access latency. Extensive experiments are conducted on a real-world dataset that contains the locations of edge servers and mobile users, and the results reveal that our approach significantly outperform the baseline approaches.
In mobile wireless sensor networks (MWSN), data imprecision is a common problem. Decision making in real time applications may be greatly affected by a minor error. Even though there are many existing techniques that take advantage of the spatio-temporal characteristics exhibited in mobile environments, few measure the trustworthiness of sensor data accuracy. We propose a unique online context-aware data cleaning method that measures trustworthiness by employing an initial candidate reduction through the analysis of trust parameters used in financial markets theory. Sensors with similar trajectory behaviors are assigned trust scores estimated through the calculation of “betas” for finding the most accurate data to trust. Instead of devoting all the trust into a single candidate sensor's data to perform the cleaning, a Diversified Trust Portfolio (DTP) is generated based on the selected set of spatially autocorrelated candidate sensors. Our results show that samples cleaned by the proposed method exhibit lower percent error when compared to two well-known and effective data cleaning algorithms in tested outdoor and indoor scenarios.
The rapid development of Internet has resulted in massive information overloading recently. These information is usually represented by high-dimensional feature vectors in many related applications such as recognition, classification and retrieval. These applications usually need efficient indexing and search methods for such large-scale and high-dimensional database, which typically is a challenging task. Some efforts have been made and solved this problem to some extent. However, most of them are implemented in a single machine, which is not suitable to handle large-scale database.In this paper, we present a novel data index structure and nearest neighbor search algorithm implemented on Apache Spark. We impose a grid on the database and index data by non-empty grid cells. This grid-based index structure is simple and easy to be implemented in parallel. Moreover, we propose to build a scalable KNN graph on the grids, which increase the efficiency of this index structure by a low cost in parallel implementation. Finally, experiments are conducted in both public databases and synthetic databases, showing that the proposed methods achieve overall high performance in both efficiency and accuracy.
Rapid advancement in wearable technology has unlocked a tremendous potential of its applications in the medical domain. Among the challenges in making the technology more useful for medical purposes is the lack of confidence in the data thus generated and communicated. Incentives have led to attacks on such systems. We propose a novel lightweight scheme to securely log the data from bodyworn sensing devices by utilizing neighboring devices as witnesses who store the fingerprints of data in Bloom filters to be later used for forensics. Medical data from each sensor is stored at various locations of the system in chronological epoch-level blocks chained together, similar to the blockchain. Besides secure logging, the scheme offers to secure other contextual information such as localization and timestamping. We prove the effectiveness of the scheme through experimental results. We define performance parameters of our scheme and quantify their cost benefit trade-offs through simulation.
Named Data Networking (NDN) is one of the future internet architectures, which is a clean-slate approach. NDN provides intelligent data retrieval using the principles of name-based symmetrical forwarding of Interest/Data packets and innetwork caching. The continually increasing demand for rapid dissemination of large-scale scientific data is driving the use of NDN in data-intensive science experiments. In this paper, we establish an intercontinental NDN testbed. In the testbed, an NDN-based application that targets climate science as an example data intensive science application is designed and implemented, which has differentiated features compared to those of previous studies. We verify experimental justification of using NDN for climate science in the intercontinental network, through performance comparisons between classical delivery techniques and NDN-based climate data delivery.
Outsourcing services to third-party providers comes with a high security cost-to fully trust the providers. Using trusted hardware can help, but current trusted execution environments do not adequately support services that process very large scale datasets. We present LASTGT, a system that bridges this gap by supporting the execution of self-contained services over a large state, with a small and generic trusted computing base (TCB). LASTGT uses widely deployed trusted hardware to guarantee integrity and verifiability of the execution on a remote platform, and it securely supplies data to the service through simple techniques based on virtual memory. As a result, LASTGT is general and applicable to many scenarios such as computational genomics and databases, as we show in our experimental evaluation based on an implementation of LAST-GT on a secure hypervisor. We also describe a possible implementation on Intel SGX.
Blockchain is an emerging technology for decentralized and transactional data sharing across a large network of untrusted participants. It enables new forms of distributed software architectures, where components can find agreements on their shared states without trusting a central integration point or any particular participating components. Considering the blockchain as a software connector helps make explicitly important architectural considerations on the resulting performance and quality attributes (for example, security, privacy, scalability and sustainability) of the system. Based on our experience in several projects using blockchain, in this paper we provide rationales to support the architectural decision on whether to employ a decentralized blockchain as opposed to other software solutions, like traditional shared data storage. Additionally, we explore specific implications of using the blockchain as a software connector including design trade-offs regarding quality attributes.
Data sharing is a significant functionality in cloud storage. These cloud storage provider are answerable for keeping the data obtainable and available in addition to the physical environment protected and running. Here we can securely, efficiently, and flexibly share data with others in cloud storage. A new public-key cryptosystems is planned which create constant-size cipher texts such that efficient allocation of decryption rights for any set of cipher texts are achievable. The uniqueness means that one can aggregate any set of secret keys and make them as packed in as a single key, but encircling the power of all the keys being aggregated. This packed in aggregate key can be easily sent to others or be stored in a smart card with very restricted secure storage. In KAC, users encrypt a file with single key, that means every file have each file, also there will be aggregate keys for two or more files, which formed by using the tree structure. Through this, the user can share more files with a single key at a time.
Recent technology shifts such as cloud computing, the Internet of Things, and big data lead to a significant transfer of sensitive data out of trusted edge networks. To counter resulting privacy concerns, we must ensure that this sensitive data is not inadvertently forwarded to third-parties, used for unintended purposes, or handled and stored in violation of legal requirements. Related work proposes to solve this challenge by annotating data with privacy policies before data leaves the control sphere of its owner. However, we find that existing privacy policy languages are either not flexible enough or require excessive processing, storage, or bandwidth resources which prevents their widespread deployment. To fill this gap, we propose CPPL, a Compact Privacy Policy Language which compresses privacy policies by taking advantage of flexibly specifiable domain knowledge. Our evaluation shows that CPPL reduces policy sizes by two orders of magnitude compared to related work and can check several thousand of policies per second. This allows for individual per-data item policies in the context of cloud computing, the Internet of Things, and big data.
Fuzzy c-means algorithm is used to identity clusters of similar objects within a data set, while it is not directly applied to incomplete data. In this paper, we proposed a novel fuzzy c-means algorithm based on missing attribute interval size for the clustering of incomplete data. In the new algorithm, incomplete data set was transformed to interval data set according to the nearest neighbor rule. The missing attribute value was replaced by the corresponding interval median and the interval size was set as the additional property for the incomplete data to control the effect of interval size in clustering. Experiments on standard UCI data set show that our approach outperforms other clustering methods for incomplete data.