Biblio
In the era of Big Data, software systems can be affected by its growing complexity, both with respect to functional and non-functional requirements. As more and more people use software applications over the web, the ability to recognize if some of this traffic is malicious or legitimate is a challenge. The traffic load of security controllers, as well as the complexity of security rules to detect attacks can grow to levels where current solutions may not suffice. In this work, we propose a hierarchical distributed architecture for security control in order to partition responsibility and workload among many security controllers. In addition, our architecture proposes a more simplified way of defining security rules to allow security to be enforced on an operational level, rather than a development level.
NoSQL databases have become popular with enterprises due to their scalable and flexible storage management of big data. Nevertheless, their popularity also brings up security concerns. Most NoSQL databases lacked secure data encryption, relying on developers to implement cryptographic methods at application level or middleware layer as a wrapper around the database. While this approach protects the integrity of data, it increases the difficulty of executing queries. We were motivated to design a system that not only provides NoSQL databases with the necessary data security, but also supports the execution of query over encrypted data. Furthermore, how to exploit the distributed fashion of NoSQL databases to deliver high performance and scalability with massive client accesses is another important challenge. In this research, we introduce Crypt-NoSQL, the first prototype to support execution of query over encrypted data on NoSQL databases with high performance. Three different models of Crypt-NoSQL were proposed and performance was evaluated with Yahoo! Cloud Service Benchmark (YCSB) considering an enormous number of clients. Our experimental results show that Crypt-NoSQL can process queries over encrypted data with high performance and scalability. A guidance of establishing service level agreement (SLA) for Crypt-NoSQL as a cloud service is also proposed.
Cloud computing emerges as an endowment technological data for the longer term and increasing on one of the standards of utility computing is most likely claimed to symbolize a wholly new paradigm for viewing and getting access to computational assets. As a result of protection problem many purchasers hesitate in relocating their touchy data on the clouds, regardless of gigantic curiosity in cloud-based computing. Security is a tremendous hassle, considering the fact that so much of firms present a alluring goal for intruders and the particular considerations will pursue to lower the advancement of distributed computing if not located. Hence, this recent scan and perception is suitable to honeypot. Distributed Denial of Service (DDoS) is an assault that threats the availability of the cloud services. It's fundamental investigate the most important features of DDoS Defence procedures. This paper provides exact techniques that been carried out to the DDoS attack. These approaches are outlined in these paper and use of applied sciences for special kind of malfunctioning within the cloud.
For the last several decades, the rapid development of information technology and computer performance accelerates generation, transportation and accumulation of digital data, it came to be called "Big Data". In this context, researchers and companies are eager to utilize the data to create new values or manage a wide range of issues, and much focus is being placed on "Data Science" to extract useful information (knowledge) from digital data. Data Science has been developed from several independent fields such as Mathematics/Operations Research, Computer Science, Data Engineering, Visualization and Statistics since 1800s. In addition, Artificial Intelligence converges on this stream recent years. On the other hand, the national projects have been established to utilize data for society with concerns surrounding the security and privacy. In this paper, through detailed analysis on history of this field, processes of development and integration among related fields are discussed as well as comparative aspects between Japan and the United States. This paper also includes a brief discussion of future directions.
The network intrusion detection problem domain is described with mathematical knowledge in this paper, and a novel IDS detection model based on immune mechanism is designed. We study the key modules of IDS system, detector tolerance module and the algorithms of IDS detection intensively. Then, the continuous bit matching algorithm for computing affinity is improved by further analysis. At the same time, we adopt controllable variation and random variation, as well as dynamic demotion to improve the dynamic clonal selection algorithm. Finally the experimental simulations verify that the novel artificial immune algorithm has better detection rate and lower noise factor.
Cloud computing enables the outsourcing of big data analytics, where a third-party server is responsible for data management and processing. In this paper, we consider the outsourcing model in which a third-party server provides record matching as a service. In particular, given a target record, the service provider returns all records from the outsourced dataset that match the target according to specific distance metrics. Identifying matching records in databases plays an important role in information integration and entity resolution. A major security concern of this outsourcing paradigm is whether the service provider returns the correct record matching results. To solve the problem, we design EARRING, an Efficient Authentication of outsouRced Record matchING framework. EARRING requires the service provider to construct the verification object (VO) of the record matching results. From the VO, the client is able to catch any incorrect result with cheap computational cost. Experiment results on real-world datasets demonstrate the efficiency of EARRING.
This paper describes the various malware datasets that we have obtained permissions to host at the University of Arizona as part of a National Science Foundation funded project. It also describes some other malware datasets that we are in the process of obtaining permissions to host at the University of Arizona. We have also discussed some preliminary work we have carried out on malware analysis using big data platforms.
Open Science Big Data is emerging as an important area of research and software development. Although there are several high quality frameworks for Big Data, additional capabilities are needed for Open Science Big Data. These include data provenance, citable reusable data, data sources providing links to research literature, relationships to other data and theories, transparent analysis/reproducibility, data privacy, new optimizations/advanced algorithms, data curation, data storage and transfer. An important part of science is explanation of results, ideally leading to theory formation. In this paper, we examine means for supporting the use of theory in big data analytics as well as using big data to assist in theory formation. One approach is to fit data in a way that is compatible with some theory, existing or new. Functional Data Analysis allows precise fitting of data as well as penalties for lack of smoothness or even departure from theoretical expectations. This paper discusses principal differential analysis and related techniques for fitting data where, for example, a time-based process is governed by an ordinary differential equation. Automation in theory formation is also considered. Case studies in the fields of computational economics and finance are considered.
Understanding how to group a set of binary files into the piece of software they belong to is highly desirable for software profiling, malware detection, or enterprise audits, among many other applications. Unfortunately, it is also extremely challenging: there is absolutely no uniformity in the ways different applications rely on different files, in how binaries are signed, or in the versioning schemes used across different pieces of software. In this paper, we show that, by combining information gleaned from a large number of endpoints (millions of computers), we can accomplish large-scale application identification automatically and reliably. Our approach relies on collecting metadata on billions of files every day, summarizing it into much smaller "sketches", and performing approximate k-nearest neighbor clustering on non-metric space representations derived from these sketches. We design and implement our proposed system using Apache Spark, show that it can process billions of files in a matter of hours, and thus could be used for daily processing. We further show our system manages to successfully identify which files belong to which application with very high precision, and adequate recall.
Security and privacy of big data becomes challenging as data grows and more accessible by more and more clients. Large-scale data storage is becoming a necessity for healthcare, business segments, government departments, scientific endeavors and individuals. Our research will focus on the privacy, security and how we can make sure that big data is secured. Managing security policy is a challenge that our framework will handle for big data. Privacy policy needs to be integrated, flexible, context-aware and customizable. We will build a framework to receive data from customer and then analyze data received, extract privacy policy and then identify the sensitive data. In this paper we will present the techniques for privacy policy which will be created to be used in our framework.
Recently, the increase of interconnectivity has led to a rising amount of IoT enabled devices in botnets. Such botnets are currently used for large scale DDoS attacks. To keep track with these malicious activities, Honeypots have proven to be a vital tool. We developed and set up a distributed and highly-scalable WAN Honeypot with an attached backend infrastructure for sophisticated processing of the gathered data. For the processed data to be understandable we designed a graphical frontend that displays all relevant information that has been obtained from the data. We group attacks originating in a short period of time in one source as sessions. This enriches the data and enables a more in-depth analysis. We produced common statistics like usernames, passwords, username/password combinations, password lengths, originating country and more. From the information gathered, we were able to identify common dictionaries used for brute-force login attacks and other more sophisticated statistics like login attempts per session and attack efficiency.
This paper proposes a highly scalable framework that can be applied to detect network anomaly at per flow level by constructing a meta-model for a family of machine learning algorithms or statistical data models. The approach is scalable and attainable because raw data needs to be accessed only one time and it will be processed, computed and transformed into a meta-model matrix in a much smaller size that can be resident in the system RAM. The calculation of meta-model matrix can be achieved through disposable updating operations at per row level: once a per-flow information is proceeded, it is no longer needed in calculating the meta-model matrix. While the proposed framework covers both Gaussian and non-Gaussian data, the focus of this work is on the linear regression models. Specifically, a new concept called meta-model sufficient statistics is proposed to analyze a group of models, where exact, not the approximate, results are derived. In addition, the proposed framework can quickly discover an optimal statistical or computer model from a family of candidate models without the need of rescanning the raw dataset. This suggest an extremely efficient and effectively theory and method is possible for big data security analysis.
Big data is the next frontier for modernization, rivalry, and profitability. It is the foundation of all the major trends such as social networks, mobile devices, healthcare, stock markets etc. Big data is efficiently stored in the cloud because of its high-volume, high-speed and high-assortment data resources. An unauthorized user access control is the gravest threat of huge information in the cloud environment because of the remote file storage. Attribute Based Encryption (ABE) is an efficient access control procedure to guarantee end-to-end security for huge information in the cloud. Most often existing ABE working principle is based on bilinear pairing. In this paper, we construct a peculiar ABE for big data in the cloud. Our proposed scheme is based on quadratic residue and attribute union which is based on fundamental arithmetic theorem.
Based on Storm, a distributed, reliable, fault-tolerant real-time data stream processing system, we propose a recognition system of web intrusion detection. The system is based on machine learning, feature selection algorithm by TF-IDF(Term Frequency–Inverse Document Frequency) and the optimised cosine similarity algorithm, at low false positive rate and a higher detection rate of attacks and malicious behavior in real-time to protect the security of user data. From comparative analysis of experiments we find that the system for intrusion recognition rate and false positive rate has improved to some extent, it can be better to complete the intrusion detection work.
We propose a modular framework which deploys state-of-the art techniques in dynamic pattern matching as well as machine learning algorithms for Big Data predictive and be-havioural analytics to detect threats and attacks in Managed File Transfer and collaboration platforms. We leverage the use of the kill chain model by looking for indicators of compromise either for long-term attacks as Advanced Persistent Threats, zero-day attacks or DDoS attacks. The proposed engine can act complimentary to existing security services as SIEMs, IDS, IPS and firewalls.
The rapid growth in the size and scope of datasets in science and technology has created a need for novel foundational perspectives on data analysis that blend the inferential and computational sciences. That classical perspectives from these fields are not adequate to address emerging problems in "Big Data" is apparent from their sharply divergent nature at an elementary level-in computer science, the growth of the number of data points is a source of "complexity" that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data points is a source of "simplicity" in that inferences are generally stronger and asymptotic results can be invoked. On a formal level, the gap is made evident by the lack of a role for computational concepts such as "runtime" in core statistical theory and the lack of a role for statistical concepts such as "risk" in core computational theory. I present several research vignettes aimed at bridging computation and statistics, including the problem of inference under privacy and communication constraints, and ways to exploit parallelism so as to trade off the speed and accuracy of inference.
Modelling network traffic is gaining importance to counter modern security threats of ever increasing sophistication. It is though surprisingly difficult and costly to construct reliable classifiers on top of telemetry data due to the variety and complexity of signals that no human can manage to interpret in full. Obtaining training data with sufficiently large and variable body of labels can thus be seen as a prohibitive problem. The goal of this work is to detect infected computers by observing their HTTP(S) traffic collected from network sensors, which are typically proxy servers or network firewalls, while relying on only minimal human input in the model training phase. We propose a discriminative model that makes decisions based on a computer's all traffic observed during a predefined time window (5 minutes in our case). The model is trained on traffic samples collected over equally-sized time windows for a large number of computers, where the only labels needed are (human) verdicts about the computer as a whole (presumed infected vs. presumed clean). As part of training, the model itself learns discriminative patterns in traffic targeted to individual servers and constructs the final high-level classifier on top of them. We show the classifier to perform with very high precision, and demonstrate that the learned traffic patterns can be interpreted as Indicators of Compromise. We implement the discriminative model as a neural network with special structure reflecting two stacked multi instance problems. The main advantages of the proposed configuration include not only improved accuracy and ability to learn from gross labels, but also automatic learning of server types (together with their detectors) that are typically visited by infected computers.
With the increasing capture of photos and their proliferation on social media, there is a pressing need for a more intuitive and versatile image search and exploration system. Image search systems have long been confined to the binds of the 2D legacy screens and the keyword text-box. With the recent advances in Virtual Reality (VR) technology, a move towards an immersive VR environment will redefine the image navigation experience. To this end, we propose a VR platform that gathers images from various sources, and addresses the 5 Ws of image search - what, where, when, who and why. We achieve this by providing the user with two modes of interactive exploration - (i) A mode that allows for a graph based navigation of an image dataset, using a steering wheel visualization, along multiple dimensions of time, location, visual concept, people, etc. and (ii) Another mode that provides an intuitive exploration of the image dataset using a logical hierarchy of visual concepts. Our contributions include creating a VR image exploration experience that is intuitive and allows image navigation along multiple dimensions.
We introduce a system-level Simulation and Analysis Engine (SAE) framework based on dynamic binary instrumentation for fine-grained and customizable instruction-level introspection of everything that executes on the processor. SAE can instrument the BIOS, kernel, drivers, and user processes. It can also instrument multiple systems simultaneously using a single instrumentation interface, which is essential for studying scale-out applications. SAE is an x86 instruction set simulator designed specifically to enable rapid prototyping, evaluation, and validation of architectural extensions and program analysis tools using its flexible APIs. It is fast enough to execute full platform workloads–-a modern operating system can boot in a few minutes–-thus enabling research, evaluation, and validation of complex functionalities related to multicore configurations, virtualization, security, and more. To reach high speeds, SAE couples tightly with a virtual platform and employs both a just-in-time (JIT) compiler that helps simulate simple instructions efficiently and a fast interpreter for simulating new or complex instructions. We describe SAE's architecture and instrumentation engine design and show the framework's usefulness for single- and multi-system architectural and program analysis studies.
One essential requirement for supporting analytics for Big Medical Data systems is the provision of a suitable level of traceability to data or processes ('Items') in large volumes of data. Systems should be designed from the outset to support usage of such Items across the spectrum of medical use and over time in order to promote traceability, to simplify maintenance and to assist analytics. The philosophy proposed in this paper is to design medical data systems using a 'description-driven' approach in which meta-data and the description of medical items are saved alongside the data, simplifying item re-use over time and thereby enabling the traceability of these items over time and their use in analytics. Details are given of a big data system in neuroimaging to demonstrate aspects of provenance data capture, collaborative analysis and longitudinal information traceability. Evidence is presented that the description-driven approach leads to simplicity of design and ease of maintenance following the adoption of a unified approach to Item management.
With the advent of social networks and cloud computing, the amount of multimedia data produced and communicated within social networks is rapidly increasing. In the meantime, social networking platforms based on cloud computing have made multimedia big data sharing in social networks easier and more efficient. The growth of social multimedia, as demonstrated by social networking sites such as Facebook and YouTube, combined with advances in multimedia content analysis, underscores potential risks for malicious use, such as illegal copying, piracy, plagiarism, and misappropriation. Therefore, secure multimedia sharing and traitor tracing issues have become critical and urgent in social networks. In this article, a joint fingerprinting and encryption (JFE) scheme based on tree-structured Haar wavelet transform (TSHWT) is proposed with the purpose of protecting media distribution in social network environments. The motivation is to map hierarchical community structure of social networks into a tree structure of Haar wavelet transform for fingerprinting and encryption. First, fingerprint code is produced using social network analysis (SNA). Second, the content is decomposed based on the structure of fingerprint code by the TSHWT. Then, the content is fingerprinted and encrypted in the TSHWT domain. Finally, the encrypted contents are delivered to users via hybrid multicast-unicast. The proposed method, to the best of our knowledge, is the first scalable JFE method for fingerprinting and encryption in the TSHWT domain using SNA. The use of fingerprinting along with encryption using SNA not only provides a double layer of protection for social multimedia sharing in social network environment but also avoids big data superposition effect. Theory analysis and experimental results show the effectiveness of the proposed JFE scheme.
Distributed Denial-of-Service (DDoS) attacks have steadily gained in popularity over the last decade, their intensity ranging from mere nuisance to severe. The increased number of attacks, combined with the loss of revenue for the targets, has given rise to a market for DDoS Protection Service (DPS) providers, to whom victims can outsource the cleansing of their traffic by using traffic diversion. In this paper, we investigate the adoption of cloud-based DPSs worldwide. We focus on nine leading providers. Our outlook on adoption is made on the basis of active DNS measurements. We introduce a methodology that allows us, for a given domain name, to determine if traffic diversion to a DPS is in effect. It also allows us to distinguish various methods of traffic diversion and protection. For our analysis we use a long-term, large-scale data set that covers well over 50\textbackslash% of all names in the global domain namespace, in daily snapshots, over a period of 1.5 years. Our results show that DPS adoption has grown by 1.24x in our measurement period, a prominent trend compared to the overall expansion of the namespace. Our study also reveals that adoption is often lead by big players such as large Web hosters, which activate or deactivate DDoS protection for millions of domain names at once.
The image and multimedia data produced by individuals and enterprises is increasing every day. Motivated by the advances in cloud computing, there is a growing need to outsource such computational intensive image feature detection tasks to cloud for its economic computing resources and on-demand ubiquitous access. However, the concerns over the effective protection of private image and multimedia data when outsourcing it to cloud platform become the major barrier that impedes the further implementation of cloud computing techniques over massive amount of image and multimedia data. To address this fundamental challenge, we study the state-of-the-art image feature detection algorithms and focus on Scalar Invariant Feature Transform (SIFT), which is one of the most important local feature detection algorithms and has been broadly employed in different areas, including object recognition, image matching, robotic mapping, and so on. We analyze and model the privacy requirements in outsourcing SIFT computation and propose Secure Scalar Invariant Feature Transform (SecSIFT), a high-performance privacy-preserving SIFT feature detection system. In contrast to previous works, the proposed design is not restricted by the efficiency limitations of current homomorphic encryption scheme. In our design, we decompose and distribute the computation procedures of the original SIFT algorithm to a set of independent, co-operative cloud servers and keep the outsourced computation procedures as simple as possible to avoid utilizing a computationally expensive homomorphic encryption scheme. The proposed SecSIFT enables implementation with practical computation and communication complexity. Extensive experimental results demonstrate that SecSIFT performs comparably to original SIFT on image benchmarks while capable of preserving the privacy in an efficient way.
Data intensive computing research and technology developments offer the potential of providing significant improvements in several security log management challenges. Approaches to address the complexity, timeliness, expense, diversity, and noise issues have been identified. These improvements are motivated by the increasingly important role of analytics. Machine learning and expert systems that incorporate attack patterns are providing greater detection insights. Finding actionable indicators requires the analysis to combine security event log data with other network data such and access control lists, making the big-data problem even bigger. Automation of threat intelligence is recognized as not complete with limited adoption of standards. With limited progress in anomaly signature detection, movement towards using expert systems has been identified as the path forward. Techniques focus on matching behaviors of attackers to patterns of abnormal activity in the network. The need to stream, parse, and analyze large volumes of small, semi-structured data files can be feasibly addressed through a variety of techniques identified by researchers. This report highlights research in key areas, including protection of the data, performance of the systems and network bandwidth utilization.
The usual approach to security for cloud-hosted applications is strong separation. However, it is often the case that the same data is used by different applications, particularly given the increase in data-driven (`big data' and IoT) applications. We argue that access control for the cloud should no longer be application-specific but should be data-centric, associated with the data that can flow between applications. Indeed, the data may originate outside cloud services from diverse sources such as medical monitoring, environmental sensing etc. Information Flow Control (IFC) potentially offers data-centric, system-wide data access control. It has been shown that IFC can be provided at operating system level as part of a PaaS offering, with an acceptable overhead. In this paper we consider how IFC can be integrated with application-specific access control, transparently from application developers, while building from simple IFC primitives, access control policies that align with the data management obligations of cloud providers and tenants.