Biblio
Security and privacy of big data becomes challenging as data grows and more accessible by more and more clients. Large-scale data storage is becoming a necessity for healthcare, business segments, government departments, scientific endeavors and individuals. Our research will focus on the privacy, security and how we can make sure that big data is secured. Managing security policy is a challenge that our framework will handle for big data. Privacy policy needs to be integrated, flexible, context-aware and customizable. We will build a framework to receive data from customer and then analyze data received, extract privacy policy and then identify the sensitive data. In this paper we will present the techniques for privacy policy which will be created to be used in our framework.
Many of the game-changing innovations the Internet brought and continues to bring to all of our daily professional and private lifes come with privacy-related costs. The more day-to-day activities are based on the Internet, the more personal data are generated, collected, stored and used. Big Data, Internet of Things, cyber-physical-systems and similar trends will be based on even more personal information all of us use and produce constantly. Three major points are to be noted here: First, there is no common European or even worldwide agreement whether and in how far these collections need to be limited. There is, though, no common privacy law âĂŞ neither in Europe nore worldwide. Second, laws that do exist constantly fail in steering the developments. Technology innovations come so fast, are so disruptive and so market-demand driven, that an ex-post control by law and courts constantly comes late and/or is circumvented and/or ignored. Third, lack of consensus and lack of steering lead to huge data accumulations and market monopolies built up very quickly and held by very few companies working on a global level with data driven business models. These early movers are in many cases in very dominant market positions making it not only more difficult to regulate their behavior but also to keep the markets open for future competitors. This workshop will evaluate current European and international attempts to deal with this situation. Although all four panelists have a legal background, the meeting will be less interested in an in-depth review of existing laws and their impact, but more in the underlying technological and ethical principles (and their inconsistencies) leading to the sitation described. Specific attention will be attributed to technology driven attempts to deal with the situation, such as privacy by design, privacy by default, usable privacy etc.
In recent years, prodigious explosion of social network services may trigger new business models. However, it has negative aspects such as personal information spill or spamming, as well. Amongst conventional spam detection approaches, the studies which are based on vertex degrees or Local Clustering Coefficient have been caused false positive results so that normal vertices can be specified as spammers. In this paper, we propose a novel approach by employing the circuit structure in the social networks, which demonstrates the advantages of our work through the experiment.
Online-activity-generated digital traces provide opportunities for novel services and unique insights as demonstrated in, for example, research on mining software repositories. The inability to link these traces within and among systems, such as Twitter, GitHub, or Reddit, inhibit the advances in this area. Furthermore, no single approach to integrate data from these disparate sources is likely to work. We aim to design Foreseer, an extensible framework, to design and evaluate identity matching techniques for public, large, and low-accuracy operational data. Foreseer consists of three functionally independent components designed to address the issues of discovery and preparation, storage and representation, and analysis and linking of traces from disparate online sources. The framework includes a domain specific language for manipulating traces, generating insights, and building novel services. We have applied it in a pilot study of roughly 10TB of data from Twitter, Reddit, and StackExchange including roughly 6M distinct entities and, using basic matching techniques, found roughly 83,000 matches among these sources. We plan to add additional entity extraction and identification algorithms, data from other sources, and design tools for facilitating dynamic ingestion and tagging of incoming data on a more robust infrastructure using Apache Spark or another distributed processing framework. We will then evaluate the utility and effectiveness of the framework in applications ranging from identifying malicious contributors in software repositories to the evaluation of the utility of privacy preservation schemes.
The rapid growth in the size and scope of datasets in science and technology has created a need for novel foundational perspectives on data analysis that blend the inferential and computational sciences. That classical perspectives from these fields are not adequate to address emerging problems in "Big Data" is apparent from their sharply divergent nature at an elementary level-in computer science, the growth of the number of data points is a source of "complexity" that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data points is a source of "simplicity" in that inferences are generally stronger and asymptotic results can be invoked. On a formal level, the gap is made evident by the lack of a role for computational concepts such as "runtime" in core statistical theory and the lack of a role for statistical concepts such as "risk" in core computational theory. I present several research vignettes aimed at bridging computation and statistics, including the problem of inference under privacy and communication constraints, and ways to exploit parallelism so as to trade off the speed and accuracy of inference.
Cloud computing has been a great enabler for both the Internet of Things and Big Data. However, as with all new computing developments, development of the technology is usually much faster than consideration for, and development of, solutions for security and privacy. In a previous paper, we proposed that a unikernel solution could be used to improve security and privacy in a cloud scenario. In this paper, we outline how we might apply this approach to the Internet of Things, which can demonstrate an improvement over existing approaches.
As more and more cyber security incident data ranging from systems logs to vulnerability scan results are collected, manually analyzing these collected data to detect important cyber security events become impossible. Hence, data mining techniques are becoming an essential tool for real-world cyber security applications. For example, a report from Gartner [gartner12] claims that "Information security is becoming a big data analytics problem, where massive amounts of data will be correlated, analyzed and mined for meaningful patterns". Of course, data mining/analytics is a means to an end where the ultimate goal is to provide cyber security analysts with prioritized actionable insights derived from big data. This raises the question, can we directly apply existing techniques to cyber security applications? One of the most important differences between data mining for cyber security and many other data mining applications is the existence of malicious adversaries that continuously adapt their behavior to hide their actions and to make the data mining models ineffective. Unfortunately, traditional data mining techniques are insufficient to handle such adversarial problems directly. The adversaries adapt to the data miner's reactions, and data mining algorithms constructed based on a training dataset degrades quickly. To address these concerns, over the last couple of years new and novel data mining techniques which is more resilient to such adversarial behavior are being developed in machine learning and data mining community. We believe that lessons learned as a part of this research direction would be beneficial for cyber security researchers who are increasingly applying machine learning and data mining techniques in practice. To give an overview of recent developments in adversarial data mining, in this three hour long tutorial, we introduce the foundations, the techniques, and the applications of adversarial data mining to cyber security applications. We first introduce various approaches proposed in the past to defend against active adversaries, such as a minimax approach to minimize the worst case error through a zero-sum game. We then discuss a game theoretic framework to model the sequential actions of the adversary and the data miner, while both parties try to maximize their utilities. We also introduce a modified support vector machine method and a relevance vector machine method to defend against active adversaries. Intrusion detection and malware detection are two important application areas for adversarial data mining models that will be discussed in details during the tutorial. Finally, we discuss some practical guidelines on how to use adversarial data mining ideas in generic cyber security applications and how to leverage existing big data management tools for building data mining algorithms for cyber security.
Even though some seem to think privacy is dead, we are all still wearing clothes, as Bruce Schneier observed at a recent conference on surveillance[1]. Yet big data and big data analytics are leaving some of us feeling a bit more naked than before. This talk will provide some personal observations on privacy today and then outline some research areas where progress is needed to enable society to gain the benefits of analyzing large datasets without giving up more privacy than necessary. Not since the early 1970s, when computing pioneer Willis Ware chaired the committee that produced the initial Fair Information Practice Principles [2] has privacy been so much in the U.S. public eye. Snowden's revelations, as well as a growing awareness that merely living our lives seems to generate an expanding "digital exhaust." Have triggered many workshops and meetings. A national strategy for privacy research is in preparation by a Federal interagency group. The ability to analyze large datasets rapidly and to extract commercially useful insights from them is spawning new industries. Must this industrial growth come at the cost of substantial privacy intrusions?
The usual approach to security for cloud-hosted applications is strong separation. However, it is often the case that the same data is used by different applications, particularly given the increase in data-driven (`big data' and IoT) applications. We argue that access control for the cloud should no longer be application-specific but should be data-centric, associated with the data that can flow between applications. Indeed, the data may originate outside cloud services from diverse sources such as medical monitoring, environmental sensing etc. Information Flow Control (IFC) potentially offers data-centric, system-wide data access control. It has been shown that IFC can be provided at operating system level as part of a PaaS offering, with an acceptable overhead. In this paper we consider how IFC can be integrated with application-specific access control, transparently from application developers, while building from simple IFC primitives, access control policies that align with the data management obligations of cloud providers and tenants.
The massive amount of data that is being collected by today's society has the potential to advance scientific knowledge and boost innovations. However, people often lack sufficient computing resources to analyze their large-scale data in a cost-effective and timely way. Cloud computing offers access to vast computing resources on an on-demand and pay-per-use basis, which is a practical way for people to analyze their huge data sets. However, since their data contain sensitive information that needs to be kept secret for ethical, security, or legal reasons, many people are reluctant to adopt cloud computing. For the first time in the literature, we propose a secure outsourcing algorithm for large-scale quadratic programs (QPs), which is one of the most fundamental problems in data analysis. Specifically, based on simple linear algebra operations, we design a low-complexity QP transformation that protects the private data in a QP. We show that the transformed QP is computationally indistinguishable under a chosen plaintext attack (CPA), i.e., CPA-secure. We then develop a parallel algorithm to solve the transformed QP at the cloud, and efficiently find the solution to the original QP at the user. We implement the proposed algorithm on the Amazon Elastic Compute Cloud (EC2) and a laptop. We find that our proposed algorithm offers significant time savings for the user and is scalable to the size of the QP.
While cloud computing has become an attractive platform for supporting data intensive applications, a major obstacle to the adoption of cloud computing in sectors such as health and defense is the privacy risk associated with releasing datasets to third-parties in the cloud for analysis. A widely-adopted technique for data privacy preservation is to anonymize data via local recoding. However, most existing local-recoding techniques are either serial or distributed without directly optimizing scalability, thus rendering them unsuitable for big data applications. In this paper, we propose a highly scalable approach to local-recoding anonymization in cloud computing, based on Locality Sensitive Hashing (LSH). Specifically, a novel semantic distance metric is presented for use with LSH to measure the similarity between two data records. Then, LSH with the MinHash function family can be employed to divide datasets into multiple partitions for use with MapReduce to parallelize computation while preserving similarity. By using our efficient LSH-based scheme, we can anonymize each partition through the use of a recursive agglomerative \$k\$-member clustering algorithm. Extensive experiments on real-life datasets show that our approach significantly improves the scalability and time-efficiency of local-recoding anonymization by orders of magnitude over existing approaches.
- « first
- ‹ previous
- 1
- 2
- 3