Biblio
We present a novel Cyber Security analytics framework. We demonstrate a comprehensive cyber security monitoring system to construct cyber security correlated events with feature selection to anticipate behaviour based on various sensors.
The massive amount of data that is being collected by today's society has the potential to advance scientific knowledge and boost innovations. However, people often lack sufficient computing resources to analyze their large-scale data in a cost-effective and timely way. Cloud computing offers access to vast computing resources on an on-demand and pay-per-use basis, which is a practical way for people to analyze their huge data sets. However, since their data contain sensitive information that needs to be kept secret for ethical, security, or legal reasons, many people are reluctant to adopt cloud computing. For the first time in the literature, we propose a secure outsourcing algorithm for large-scale quadratic programs (QPs), which is one of the most fundamental problems in data analysis. Specifically, based on simple linear algebra operations, we design a low-complexity QP transformation that protects the private data in a QP. We show that the transformed QP is computationally indistinguishable under a chosen plaintext attack (CPA), i.e., CPA-secure. We then develop a parallel algorithm to solve the transformed QP at the cloud, and efficiently find the solution to the original QP at the user. We implement the proposed algorithm on the Amazon Elastic Compute Cloud (EC2) and a laptop. We find that our proposed algorithm offers significant time savings for the user and is scalable to the size of the QP.
The dramatically growing demand of Cyber Physical and Social Computing (CPSC) has enabled a variety of novel channels to reach services in the financial industry. Combining cloud systems with multimedia big data is a novel approach for Financial Service Institutions (FSIs) to diversify service offerings in an efficient manner. However, the security issue is still a great issue in which the service availability often conflicts with the security constraints when the service media channels are varied. This paper focuses on this problem and proposes a novel approach using the Semantic-Based Access Control (SBAC) techniques for acquiring secure financial services on multimedia big data in cloud computing. The proposed approach is entitled IntercroSsed Secure Big Multimedia Model (2SBM), which is designed to secure accesses between various media through the multiple cloud platforms. The main algorithms supporting the proposed model include the Ontology-Based Access Recognition (OBAR) Algorithm and the Semantic Information Matching (SIM) Algorithm. We implement an experimental evaluation to prove the correctness and adoptability of our proposed scheme.
Multimedia has been exponentially increasing as the biggest big data, which consist of video clips, images, and audio files. Processing and analyzing them on a cloud data center have become a preferred solution that can utilize the large pool of cloud resources to address the problems caused by the tremendous amount of unstructured multimedia data. However, there exist many challenges in processing multimedia big data on a cloud data center, such as multimedia data representation approach, an efficient networking model, and an estimation method for traffic patterns. The primary purpose of this article is to develop a novel tensor-based software-defined networking model on a cloud data center for multimedia big-data computation and communication. First, an overview of the proposed framework is provided, in which the functions of the representative modules are briefly illustrated. Then, three models,—forwarding tensor, control tensor, and transition tensor—are proposed for management of networking devices and prediction of network traffic patterns. Finally, two algorithms about single-mode and multimode tensor eigen-decomposition are developed, and the incremental method is employed for efficiently updating the generated eigen-vector and eigen-tensor. Experimental results reveal that the proposed framework is feasible and efficient to handle multimedia big data on a cloud data center.
While cloud computing has become an attractive platform for supporting data intensive applications, a major obstacle to the adoption of cloud computing in sectors such as health and defense is the privacy risk associated with releasing datasets to third-parties in the cloud for analysis. A widely-adopted technique for data privacy preservation is to anonymize data via local recoding. However, most existing local-recoding techniques are either serial or distributed without directly optimizing scalability, thus rendering them unsuitable for big data applications. In this paper, we propose a highly scalable approach to local-recoding anonymization in cloud computing, based on Locality Sensitive Hashing (LSH). Specifically, a novel semantic distance metric is presented for use with LSH to measure the similarity between two data records. Then, LSH with the MinHash function family can be employed to divide datasets into multiple partitions for use with MapReduce to parallelize computation while preserving similarity. By using our efficient LSH-based scheme, we can anonymize each partition through the use of a recursive agglomerative \$k\$-member clustering algorithm. Extensive experiments on real-life datasets show that our approach significantly improves the scalability and time-efficiency of local-recoding anonymization by orders of magnitude over existing approaches.
While attacks on information systems have for most practical purposes binary outcomes (information was manipulated/eavesdropped, or not), attacks manipulating the sensor or control signals of Industrial Control Systems (ICS) can be tuned by the attacker to cause a continuous spectrum in damages. Attackers that want to remain undetected can attempt to hide their manipulation of the system by following closely the expected behavior of the system, while injecting just enough false information at each time step to achieve their goals. In this work, we study if attack-detection can limit the impact of such stealthy attacks. We start with a comprehensive review of related work on attack detection schemes in the security and control systems community. We then show that many of those works use detection schemes that are not limiting the impact of stealthy attacks. We propose a new metric to measure the impact of stealthy attacks and how they relate to our selection on an upper bound on false alarms. We finally show that the impact of such attacks can be mitigated in several cases by the proper combination and configuration of detection schemes. We demonstrate the effectiveness of our algorithms through simulations and experiments using real ICS testbeds and real ICS systems.
Proactive security review and test efforts are a necessary component of the software development lifecycle. Since resource limitations often preclude reviewing, testing and fortifying the entire code base, prioritizing what code to review/test can improve a team's ability to find and remove more vulnerabilities that are reachable by an attacker. One way that professionals perform this prioritization is the identification of the attack surface of software systems. However, identifying the attack surface of a software system is non-trivial. The goal of this poster is to present the concept of a risk-based attack surface approximation based on crash dump stack traces for the prioritization of security code rework efforts. For this poster, we will present results from previous efforts in the attack surface approximation space, including studies on its effectiveness in approximating security relevant code for Windows and Firefox. We will also discuss future research directions for attack surface approximation, including discovery of additional metrics from stack traces and determining how many stack traces are required for a good approximation.
Given a history of detected malware attacks, can we predict the number of malware infections in a country? Can we do this for different malware and countries? This is an important question which has numerous implications for cyber security, right from designing better anti-virus software, to designing and implementing targeted patches to more accurately measuring the economic impact of breaches. This problem is compounded by the fact that, as externals, we can only detect a fraction of actual malware infections. In this paper we address this problem using data from Symantec covering more than 1.4 million hosts and 50 malware spread across 2 years and multiple countries. We first carefully design domain-based features from both malware and machine-hosts perspectives. Secondly, inspired by epidemiological and information diffusion models, we design a novel temporal non-linear model for malware spread and detection. Finally we present ESM, an ensemble-based approach which combines both these methods to construct a more accurate algorithm. Using extensive experiments spanning multiple malware and countries, we show that ESM can effectively predict malware infection ratios over time (both the actual number and trend) upto 4 times better compared to several baselines on various metrics. Furthermore, ESM's performance is stable and robust even when the number of detected infections is low.
Security requirements around software systems have become more stringent as society becomes more interconnected via the Internet. New ways of prioritizing security efforts are needed so security professionals can use their time effectively to find security vulnerabilities or prevent them from occurring in the first place. The goal of this work is to help software development teams prioritize security efforts by approximating the attack surface of a software system via stack trace analysis. Automated attack surface approximation is a technique that uses crash dump stack traces to predict what code may contain exploitable vulnerabilities. If a code entity (a binary, file or function) appears on stack traces, then Attack Surface Approximation (ASA) considers that code entity is on the attack surface of the software system. We also explore whether number of appearances of code on stack traces correlates with where security vulnerabilities are found. To date, feasibility studies of ASA have been performed on Windows 8 and 8.1, and Mozilla Firefox. The results from these studies indicate that ASA may be useful for practitioners trying to secure their software systems. We are now working towards establishing the ground truth of what the attack surface of software systems is, along with looking at how ASA could change over time, among other metrics.
The threat of DDOS and other cyberattacks has increased during the last decade. In addition to the radical increase in the number of attacks, they are also becoming more sophisticated with the targets ranging from ordinary users to service providers and even critical infrastructure. According to some resources, the sophistication of attacks is increasing faster than the mitigating actions against them. For example determining the location of the attack origin is becoming impossible as cyber attackers employ specific means to evade detection of the attack origin by default, such as using proxy services and source address spoofing. The purpose of this paper is to initiate discussion about effective Internet Protocol traceback mechanisms that are needed to overcome this problem. We propose an approach for traceback that is based on extensive use of security metrics before (proactive) and during (reactive) the attacks.
In this paper, we describe the results of several experiments designed to test two dynamic network moving target defenses against a propagating data exfiltration attack. We designed a collection of metrics to assess the costs to mission activities and the benefits in the face of attacks and evaluated the impacts of the moving target defenses in both areas. Experiments leveraged Siege's Cyber-Quantification Framework to automatically provision the networks used in the experiment, install the two moving target defenses, collect data, and analyze the results. We identify areas in which the costs and benefits of the two moving target defenses differ, and note some of their unique performance characteristics.
Most cyber network attacks begin with an adversary gaining a foothold within the network and proceed with lateral movement until a desired goal is achieved. The mechanism by which lateral movement occurs varies but the basic signature of hopping between hosts by exploiting vulnerabilities is the same. Because of the nature of the vulnerabilities typically exploited, lateral movement is very difficult to detect and defend against. In this paper we define a dynamic reachability graph model of the network to discover possible paths that an adversary could take using different vulnerabilities, and how those paths evolve over time. We use this reachability graph to develop dynamic machine-level and network-level impact scores. Lateral movement mitigation strategies which make use of our impact scores are also discussed, and we detail an example using a freely available data set.
When reasoning about software security, researchers and practitioners use the phrase ``attack surface'' as a metaphor for risk. Enumerate and minimize the ways attackers can break in then risk is reduced and the system is better protected, the metaphor says. But software systems are much more complicated than their surfaces. We propose function- and file-level attack surface metrics–-proximity and risky walk–-that enable fine-grained risk assessment. Our risky walk metric is highly configurable: we use PageRank on a probability-weighted call graph to simulate attacker behavior of finding or exploiting a vulnerability. We provide evidence-based guidance for deploying these metrics, including an extensive parameter tuning study. We conducted an empirical study on two large open source projects, FFmpeg and Wireshark, to investigate the potential correlation between our metrics and historical post-release vulnerabilities. We found our metrics to be statistically significantly associated with vulnerable functions/files with a small-to-large Cohen's d effect size. Our prediction model achieved an increase of 36% (in FFmpeg) and 27% (in Wireshark) in the average value of F-measure over a base model built with SLOC and coupling metrics. Our prediction model outperformed comparable models from prior literature with notable improvements: 58% reduction in false negative rate, 81% reduction in false positive rate, and 548% increase in F-measure. These metrics advance vulnerability prevention by [(a)] being flexible in terms of granularity, performing better than vulnerability prediction literature, and being tunable so that practitioners can tailor the metrics to their products and better assess security risk.
The success or failure of a mobile application (`app') is largely determined by user ratings. Users frequently make their app choices based on the ratings of apps in comparison with similar, often competing apps. Users also expect apps to continually provide new features while maintaining quality, or the ratings drop. At the same time apps must also be secure, but is there a historical trade-off between security and ratings? Or are app store ratings a more all-encompassing measure of product maturity? We used static analysis tools to collect security-related metrics in 38,466 Android apps from the Google Play store. We compared the rate of an app's permission misuse, number of requested permissions, and Androrisk score, against its user rating. We found that high-rated apps have statistically significantly higher security risk metrics than low-rated apps. However, the correlations are weak. This result supports the conventional wisdom that users are not factoring security risks into their ratings in a meaningful way. This could be due to several reasons including users not placing much emphasis on security, or that the typical user is unable to gauge the security risk level of the apps they use everyday.
Android malware is becoming very effective in evading detection techniques, and traditional malware detection techniques are demonstrating their weaknesses. Signature based detection shows at least two drawbacks: first, the detection is possible only after the malware has been identified, and the time needed to produce and distribute the signature provides attackers with window of opportunities for spreading the malware in the wild. For solving this problem, different approaches that try to characterize the malicious behavior through the invoked system and API calls emerged. Unfortunately, several evasion techniques have proven effective to evade detection based on system and API calls. In this paper, we propose an approach for capturing the malicious behavior in terms of device resource consumption (using a thorough set of features), which is much more difficult to camouflage. We describe a procedure, and the corresponding practical setting, for extracting those features with the aim of maximizing their discriminative power. Finally, we describe the promising results we obtained experimenting on more than 2000 applications, on which our approach exhibited an accuracy greater than 99%.
Now-a-days for most of the organizations across the globe, two important IT initiatives are: Big Data Analytics and Cloud Computing. Big Data Analytics can provide valuables insight that can create competitiveness and generate increased revenues. Cloud Computing can enhance productivity and efficiencies thus reducing cost. Cloud Computing offers groups of servers, storages and various networking resources. It enables environment of Big Data to processes voluminous, high velocity and varied formats of Big Data.
This paper presents a six-layer Aluminum Industry 4.0 architecture for the aluminum production and full lifecycle supply chain management. It integrates a series of innovative technologies, including the IoT sensing physical system, industrial cloud platform for data management, model-driven and big data driven analysis & decision making, standardization & securitization intelligent control and management, as well as visual monitoring and backtracking process etc. The main relevant control models are studied. The applications of real-time accurate perception & intelligent decision technology in the aluminum electrolytic industry are introduced.
Enormous amount of educational data has been accumulated through Massive Open Online Courses (MOOCs), as well as commercial and non-commercial learning platforms. This is in addition to the educational data released by US government since 2012 to facilitate disruption in education by making data freely available. The high volume, variety and velocity of collected data necessitate use of big data tools and storage systems such as distributed databases for storage and Apache Spark for analysis. This tutorial will introduce researchers and faculty to real-world applications involving data mining and predictive analytics in learning sciences. In addition, the tutorial will introduce statistics required to validate and accurately report results. Topics will cover how big data is being used to transform education. Specifically, we will demonstrate how exploratory data analysis, data mining, predictive analytics, machine learning, and visualization techniques are being applied to educational big data to improve learning and scale insights driven from millions of student's records. The tutorial will be held over a half day and will be hands on with pre-posted material. Due to the interdisciplinary nature of work, the tutorial appeals to researchers from a wide range of backgrounds including big data, predictive analytics, learning sciences, educational data mining, and in general, those interested in how big data analytics can transform learning. As a prerequisite, attendees are required to have familiarity with at least one programming language.
Cyber-attacks have been evolved in a way to be more sophisticated by employing combinations of attack methodologies with greater impacts. For instance, Advanced Persistent Threats (APTs) employ a set of stealthy hacking processes running over a long period of time, making it much hard to detect. With this trend, the importance of big-data security analytics has taken greater attention since identifying such latest attacks requires large-scale data processing and analysis. In this paper, we present SEAS-MR (Security Event Aggregation System over MapReduce) that facilitates scalable security event aggregation for comprehensive situation analysis. The introduced system provides the following three core functions: (i) periodic aggregation, (ii) on-demand aggregation, and (iii) query support for effective analysis. We describe our design and implementation of the system over MapReduce and high-level query languages, and report our experimental results collected through extensive settings on a Hadoop cluster for performance evaluation and design impacts.
Space utilization are important elements for a smart city to determine how well public space are being utilized. Such information could also provide valuable feedback to the urban developer on what are the factors that impact space utilization. The spatial and temporal information for space utilization can be studied and further analyzed to generate insights about that particular space. In our research context, these elements are translated to part of big data and Internet of things (IoT) to eliminate the need of on site investigation. However, there are a number of challenges for large scale deployment, eg. hardware cost, computation capability, communication bandwidth, scalability, data fragmentation, and resident privacy etc. In this paper, we designed and prototype a Renewable Wireless Sensor Network (RWSN), which addressed the aforementioned challenges. Finally, analyzed results based on initial data collected is presented.
Many emerging applications, from domains such as healthcare and oil & gas, require several data processing systems for complex analytics. This demo paper showcases system, a framework that provides multi-platform task execution for such applications. It features a three-layer data processing abstraction and a new query optimization approach for multi-platform settings. We will demonstrate the strengths of system by using real-world scenarios from three different applications, namely, machine learning, data cleaning, and data fusion.
Modelling network traffic is gaining importance to counter modern security threats of ever increasing sophistication. It is though surprisingly difficult and costly to construct reliable classifiers on top of telemetry data due to the variety and complexity of signals that no human can manage to interpret in full. Obtaining training data with sufficiently large and variable body of labels can thus be seen as a prohibitive problem. The goal of this work is to detect infected computers by observing their HTTP(S) traffic collected from network sensors, which are typically proxy servers or network firewalls, while relying on only minimal human input in the model training phase. We propose a discriminative model that makes decisions based on a computer's all traffic observed during a predefined time window (5 minutes in our case). The model is trained on traffic samples collected over equally-sized time windows for a large number of computers, where the only labels needed are (human) verdicts about the computer as a whole (presumed infected vs. presumed clean). As part of training, the model itself learns discriminative patterns in traffic targeted to individual servers and constructs the final high-level classifier on top of them. We show the classifier to perform with very high precision, and demonstrate that the learned traffic patterns can be interpreted as Indicators of Compromise. We implement the discriminative model as a neural network with special structure reflecting two stacked multi instance problems. The main advantages of the proposed configuration include not only improved accuracy and ability to learn from gross labels, but also automatic learning of server types (together with their detectors) that are typically visited by infected computers.
The data processing capabilities of MapReduce systems pioneered with the on-demand scalability of cloud computing have enabled the Big Data revolution. However, the data controllers/owners worried about the privacy and accountability impact of storing their data in the cloud infrastructures as the existing cloud computing solutions provide very limited control on the underlying systems. The intuitive approach - encrypting data before uploading to the cloud - is not applicable to MapReduce computation as the data analytics tasks are ad-hoc defined in the MapReduce environment using general programming languages (e.g, Java) and homomorphic encryption methods that can scale to big data do not exist. In this paper, we address the challenges of determining and detecting unauthorized access to data stored in MapReduce based cloud environments. To this end, we introduce alarm raising honeypots distributed over the data that are not accessed by the authorized MapReduce jobs, but only by the attackers and/or unauthorized users. Our analysis shows that unauthorized data accesses can be detected with reasonable performance in MapReduce based cloud environments.
With the growing observed success of big data use, many challenges appeared. Timeless, scalability and privacy are the main problems that researchers attempt to figure out. Privacy preserving is now a highly active domain of research, many works and concepts had seen the light within this theme. One of these concepts is the de-identification techniques. De-identification is a specific area that consists of finding and removing sensitive information either by replacing it, encrypting it or adding a noise to it using several techniques such as cryptography and data mining. In this report, we present a new model of de-identification of textual data using a specific Immune System algorithm known as CLONALG.