Biblio
IT system risk assessments are indispensable due to increasing cyber threats within our ever-growing IT systems. Moreover, laws and regulations urge organizations to conduct risk assessments regularly. Even though there exist several risk management frameworks and methodologies, they are in general high level, not defining the risk metrics, risk metrics values and the detailed risk assessment formulas for different risk views. To address this need, we define a novel risk assessment methodology specific to IT systems. Our model is quantitative, both asset and vulnerability centric and defines low and high level risk metrics. High level risk metrics are defined in two general categories; base and attack graph-based. In our paper, we provide a detailed explanation of formulations in each category and make our implemented software publicly available for those who are interested in applying the proposed methodology to their IT systems.
Edge Computing is a scheme to improve the performance, latency and security guidelines for IoT applications. However, edge deployment of an application also comes with additional complexity in management, an increased attack surface for security vulnerability, and could potentially result in a more expensive solution. As a result, the conditions under which an edge deployment of IoT applications delivers a better solution is not always obvious. Metrics which would be able to predict whether or not an IoT application is suitable for edge deployment can provide useful insights to address this question. In this paper, we examine the key performance indicators for IoT applications, namely the responsiveness, scalability and cost models for different types of IoT applications. Our analysis identifies that network centrality of an IoT application is a key characteristic which determines whether or not an IoT application is a good candidate for edge deployment. We discuss the different measures of network centrality that can be used to characterize applications, and the relative performance of edge deployment compared to centralized deployment for various IoT applications.
In the process of big data analysis and processing, a key concern blocking users from storing and processing their data in the cloud is their misgivings about the security and performance of cloud services. There is an urgent need to develop an approach that can help each cloud service provider (CSP) to demonstrate that their infrastructure and service behavior can meet the users' expectations. However, most of the prior research work focused on validating the process compliance of cloud service without an accurate description of the basic service behaviors, and could not measure the security capability. In this paper, we propose a novel approach to verify cloud service security conformance called CloudSec, which reduces the description gap between the cloud provider and customer through modeling cloud service behaviors (CloudBeh Model) and security SLA (SecSLA Model). These models enable a systematic integration of security constraints and service behavior into cloud while using UPPAAL to check the conformance, which can not only check CloudBeh performance metrics conformance, but also verify whether the security constraints meet the SecSLA. The proposed approach is validated through case study and experiments with a cloud storage service based on OpenStack, which illustrates CloudSec approach effectiveness and can be applied in real cloud scenarios.
The increasing complexity and ubiquity in user connectivity, computing environments, information content, and software, mobile, and web applications transfers the responsibility of privacy management to the individuals. Hence, making it extremely difficult for users to maintain the intelligent and targeted level of privacy protection that they need and desire, while simultaneously maintaining their ability to optimally function. Thus, there is a critical need to develop intelligent, automated, and adaptable privacy management systems that can assist users in managing and protecting their sensitive data in the increasingly complex situations and environments that they find themselves in. This work is a first step in exploring the development of such a system, specifically how user personality traits and other characteristics can be used to help automate determination of user sharing preferences for a variety of user data and situations. The Big-Five personality traits of openness, conscientiousness, extroversion, agreeableness, and neuroticism are examined and used as inputs into several popular machine learning algorithms in order to assess their ability to elicit and predict user privacy preferences. Our results show that the Big-Five personality traits can be used to significantly improve the prediction of user privacy preferences in a number of contexts and situations, and so using machine learning approaches to automate the setting of user privacy preferences has the potential to greatly reduce the burden on users while simultaneously improving the accuracy of their privacy preferences and security.
In this paper, we review big data characteristics and security challenges in the cloud and visit different cloud domains and security regulations. We propose using integrated auditing for secure data storage and transaction logs, real-time compliance and security monitoring, regulatory compliance, data environment, identity and access management, infrastructure auditing, availability, privacy, legality, cyber threats, and granular auditing to achieve big data security. We apply a stochastic process model to conduct security analyses in availability and mean time to security failure. Potential future works are also discussed.
Enterprises usually provide strong controls to prevent cyberattacks and inadvertent leakage of data to external entities. However, in the case where employees and data scientists have legitimate access to analyze and derive insights from the data, there are insufficient controls and employees are usually permitted access to all information about the customers of the enterprise including sensitive and private information. Though it is important to be able to identify useful patterns of one's customers for better customization and service, customers' privacy must not be sacrificed to do so. We propose an alternative — a framework that will allow privacy preserving data analytics over big data. In this paper, we present an efficient and scalable framework for Apache Spark, a cluster computing framework, that provides strong privacy guarantees for users even in the presence of an informed adversary, while still providing high utility for analysts. The framework, titled Shade, includes two mechanisms — SparkLAP, which provides Laplacian perturbation based on a user's query and SparkSAM, which uses the contents of the database itself in order to calculate the perturbation. We show that the performance of Shade is substantially better than earlier differential privacy systems without loss of accuracy, particularly when run on datasets small enough to fit in memory, and find that SparkSAM can even exceed performance of an identical nonprivate Spark query.
In the age of Big Data, we are witnessing a huge proliferation of digital data capturing our lives and our surroundings. Data privacy is a critical barrier to data analytics and privacy-preserving data disclosure becomes a key aspect to leveraging large-scale data analytics due to serious privacy risks. Traditional privacy-preserving data publishing solutions have focused on protecting individual's private information while considering all aggregate information about individuals as safe for disclosure. This paper presents a new privacy-aware data disclosure scheme that considers group privacy requirements of individuals in bipartite association graph datasets (e.g., graphs that represent associations between entities such as customers and products bought from a pharmacy store) where even aggregate information about groups of individuals may be sensitive and need protection. We propose the notion of $ε$g-Group Differential Privacy that protects sensitive information of groups of individuals at various defined group protection levels, enabling data users to obtain the level of information entitled to them. Based on the notion of group privacy, we develop a suite of differentially private mechanisms that protect group privacy in bipartite association graphs at different group privacy levels based on specialization hierarchies. We evaluate our proposed techniques through extensive experiments on three real-world association graph datasets and our results demonstrate that the proposed techniques are effective, efficient and provide the required guarantees on group privacy.
With the development of modern logistics industry railway freight enterprises as the main traditional logistics enterprises, the service mode is facing many problems. In the era of big data, for railway freight enterprises, coordinated development and sharing of information resources have become the requirements of the times, while how to protect the privacy of citizens has become one of the focus issues of the public. To prevent the disclosure or abuse of the citizens' privacy information, the citizens' privacy needs to be preserved in the process of information opening and sharing. However, most of the existing privacy preserving models cannot to be used to resist attacks with continuously growing background knowledge. This paper presents the method of applying differential privacy to protect associated data, which can be shared in railway freight service association information. First, the original service data need to slice by optimal shard length, then differential method and apriori algorithm is used to add Laplace noise in the Candidate sets. Thus the citizen's privacy information can be protected even if the attacker gets strong background knowledge. Last, sharing associated data to railway information resource partners. The steps and usefulness of the discussed privacy preservation method is illustrated by an example.
In smart grid, large quantities of data is collected from various applications, such as smart metering substation state monitoring, electric energy data acquisition, and smart home. Big data acquired in smart grid applications is usually sensitive. For instance, in order to dispatch accurately and support the dynamic price, lots of smart meters are installed at user's house to collect the real-time data, but all these collected data are related to user privacy. In this paper, we propose a data aggregation scheme based on secret sharing with fault tolerance in smart grid, which ensures that control center gets the integrated data without revealing user's privacy. Meanwhile, we also consider fault tolerance during the data aggregation. At last, we analyze the security of our scheme and carry out experiments to validate the results.
In the recent years, we have observed the development of several connected and mobile devices intended for daily use. This development has come with many risks that might not be perceived by the users. These threats are compromising when an unauthorized entity has access to private big data generated through the user objects in the Internet of Things. In the literature, many solutions have been proposed in order to protect the big data, but the security remains a challenging issue. This work is carried out with the aim to provide a solution to the access control to the big data and securing the localization of their generator objects. The proposed models are based on Attribute Based Encryption, CHORD protocol and $μ$TESLA. Through simulations, we compare our solutions to concurrent protocols and we show its efficiency in terms of relevant criteria.
Location privacy has become a significant challenge of big data. Particularly, by the advantage of big data handling tools availability, huge location data can be managed and processed easily by an adversary to obtain user private information from Location-Based Services (LBS). So far, many methods have been proposed to preserve user location privacy for these services. Among them, dummy-based methods have various advantages in terms of implementation and low computation costs. However, they suffer from the spatiotemporal correlation issue when users submit consecutive requests. To solve this problem, a practical hybrid location privacy protection scheme is presented in this paper. The proposed method filters out the correlated fake location data (dummies) before submissions. Therefore, the adversary can not identify the user's real location. Evaluations and experiments show that our proposed filtering technique significantly improves the performance of existing dummy-based methods and enables them to effectively protect the user's location privacy in the environment of big data.
Smart energy meters record electricity consumption and generation at fine-grained intervals, and are among the most widely deployed sensors in the world. Energy data embeds detailed information about a building's energy-efficiency, as well as the behavior of its occupants, which academia and industry are actively working to extract. In many cases, either inadvertently or by design, these third-parties only have access to anonymous energy data without an associated location. The location of energy data is highly useful and highly sensitive information: it can provide important contextual information to improve big data analytics or interpret their results, but it can also enable third-parties to link private behavior derived from energy data with a particular location. In this paper, we present Weatherman, which leverages a suite of analytics techniques to localize the source of anonymous energy data. Our key insight is that energy consumption data, as well as wind and solar generation data, largely correlates with weather, e.g., temperature, wind speed, and cloud cover, and that every location on Earth has a distinct weather signature that uniquely identifies it. Weatherman represents a serious privacy threat, but also a potentially useful tool for researchers working with anonymous smart meter data. We evaluate Weatherman's potential in both areas by localizing data from over one hundred smart meters using a weather database that includes data from over 35,000 locations. Our results show that Weatherman localizes coarse (one-hour resolution) energy consumption, wind, and solar data to within 16.68km, 9.84km, and 5.12km, respectively, on average, which is more accurate using much coarser resolution data than prior work on localizing only anonymous solar data using solar signatures.
The main issue with big data in cloud is the processed or used always need to be by third party. It is very important for the owners of data or clients to trust and to have the guarantee of privacy for the information stored in cloud or analyzed as big data. The privacy models studied in previous research showed that privacy infringement for big data happened because of limitation, privacy guarantee rate or dissemination of accurate data which is obtainable in the data set. In addition, there are various privacy models. In order to determine the best and the most appropriate model to be applied in the future, which also guarantees big data privacy, it is necessary to invest in research and study. In the next part, we surfed some of the privacy models in order to determine the advantages and disadvantages of each model in privacy assurance for big data in cloud. The present study also proposes combined Diff-Anonym algorithm (K-anonymity and differential models) to provide data anonymity with guarantee to keep balance between ambiguity of private data and clarity of general data.
Knowledge work such as summarizing related research in preparation for writing, typically requires the extraction of useful information from scientific literature. Nowadays the primary source of information for researchers comes from electronic documents available on the Web, accessible through general and academic search engines such as Google Scholar or IEEE Xplore. Yet, the vast amount of resources makes retrieving only the most relevant results a difficult task. As a consequence, researchers are often confronted with loads of low-quality or irrelevant content. To address this issue we introduce a novel system, which combines a rich, interactive Web-based user interface and different visualization approaches. This system enables researchers to identify key phrases matching current information needs and spot potentially relevant literature within hierarchical document collections. The chosen context was the collection and summarization of related work in preparation for scientific writing, thus the system supports features such as bibliography and citation management, document metadata extraction and a text editor. This paper introduces the design rationale and components of the PaperViz. Moreover, we report the insights gathered in a formative design study addressing usability.
Text analytics systems often rely heavily on detecting and linking entity mentions in documents to knowledge bases for downstream applications such as sentiment analysis, question answering and recommender systems. A major challenge for this task is to be able to accurately detect entities in new languages with limited labeled resources. In this paper we present an accurate and lightweight, multilingual named entity recognition (NER) and linking (NEL) system. The contributions of this paper are three-fold: 1) Lightweight named entity recognition with competitive accuracy; 2) Candidate entity retrieval that uses search click-log data and entity embeddings to achieve high precision with a low memory footprint; and 3) efficient entity disambiguation. Our system achieves state-of-the-art performance on TAC KBP 2013 multilingual data and on English AIDA CONLL data.
Labeled datasets are always limited, and oftentimes the quantity of labeled data is a bottleneck for data analytics. This especially affects supervised machine learning methods, which require labels for models to learn from the labeled data. Active learning algorithms have been proposed to help achieve good analytic models with limited labeling efforts, by determining which additional instance labels will be most beneficial for learning for a given model. Active learning is consistent with interactive analytics as it proceeds in a cycle in which the unlabeled data is automatically explored. However, in active learning users have no control of the instances to be labeled, and for text data, the annotation interface is usually document only. Both of these constraints seem to affect the performance of an active learning model. We hypothesize that visualization techniques, particularly interactive ones, will help to address these constraints. In this paper, we implement a pilot study of visualization in active learning for text classification, with an interactive labeling interface. We compare the results of three experiments. Early results indicate that visualization improves high-performance machine learning model building with an active learning algorithm.
The applications of 3D Virtual Environments are taking giant leaps with more sophisticated 3D user interfaces and immersive technologies. Interactive 3D and Virtual Reality platforms present a great opportunity for data analytics and can represent large amounts of data to help humans in decision making and insight. For any of the above to be effective, it is essential to understand the characteristics of these interfaces in displaying different types of content. Text is an essential and widespread content and legibility acts as an important criterion to determine the style, size and quantity of the text to be displayed. This study evaluates the maximum amount of text per visual angle, that is, the maximum density of text that will be legible in a virtual environment displayed on different platforms. We used Extensible 3D (X3D) to provide the portable (cross-platform) stimuli. The results presented here are based on a user study conducted in DeepSix (a tiled LCD display with 5750×2400 resolution) and the Hypercube (an immersive CAVE-style active stereo projection system with three walls and floor at 2560×2560 pixels active stereo per wall). We found that more legible text can be displayed on an immersive projection due to its larger Field of Regard; in the immersive case, stereo versus monoscopic rendering did not have a significant effect on legibility.
Information technology graduates reach industry and innovate for the future after completing demanding degrees. Upper division college courses require long hours of work on class projects and exams. Some students have hopes of completing their degrees, but are deterred due to many different issues. Instructors can monitor students' progress based on their assignments, projects, and exams. Judging students' understanding and potential for success becomes more difficult when handling large classes. In this paper we utilize IBM Text Analytics Web Tooling on large amounts of unstructured text data collected from past assignments, exams, and discussions to help professors make assessments faster for large classes. In particular, we focus on an Information Security course offered at San Jose State University and use its classroom-generated data to determine if the extracted information provides strong insights for professors to help struggling students. We examine these issues through exploratory analysis.
Autoencoders have been successful in learning meaningful representations from image datasets. However, their performance on text datasets has not been widely studied. Traditional autoencoders tend to learn possibly trivial representations of text documents due to their confoundin properties such as high-dimensionality, sparsity and power-law word distributions. In this paper, we propose a novel k-competitive autoencoder, called KATE, for text documents. Due to the competition between the neurons in the hidden layer, each neuron becomes specialized in recognizing specific data patterns, and overall the model can learn meaningful representations of textual data. A comprehensive set of experiments show that KATE can learn better representations than traditional autoencoders including denoising, contractive, variational, and k-sparse autoencoders. Our model also outperforms deep generative models, probabilistic topic models, and even word representation models (e.g., Word2Vec) in terms of several downstream tasks such as document classification, regression, and retrieval.
Social media has become an important platform for people to express opinions, share information and communicate with others. Detecting and tracking topics from social media can help people grasp essential information and facilitate many security-related applications. As social media texts are usually short, traditional topic evolution models built based on LDA or HDP often suffer from the data sparsity problem. Recently proposed topic evolution models are more suitable for short texts, but they need to manually specify topic number which is fixed during different time period. To address these issues, in this paper, we propose a nonparametric topic evolution model for social media short texts. We first propose the recurrent semantic dependent Chinese restaurant process (rsdCRP), which is a nonparametric process incorporating word embeddings to capture semantic similarity information. Then we combine rsdCRP with word co-occurrence modeling and build our short-text oriented topic evolution model sdTEM. We carry out experimental studies on Twitter dataset. The results demonstrate the effectiveness of our method to monitor social media topic evolution compared to the baseline methods.
Guidelines, directives, and policy statements are usually presented in ``linear'' text form - word after word, page after page. However necessary, this practice impedes full understanding, obscures feedback dynamics, hides mutual dependencies and cascading effects and the like, - even when augmented with tables and diagrams. The net result is often a checklist response as an end in itself. All this creates barriers to intended realization of guidelines and undermines potential effectiveness. We present a solution strategy using text as ``data'', transforming text into a structured model, and generate a network views of the text(s), that we then can use for vulnerability mapping, risk assessments and control point analysis. We apply this approach using two NIST reports on cybersecurity of smart grid, more than 600 pages of text. Here we provide a synopsis of approach, methods, and tools. (Elsewhere we consider (a) system-wide level, (b) aviation e-landscape, (c) electric vehicles, and (d) SCADA for smart grid).
In our era, most of the communication between people is realized in the form of electronic messages and especially through smart mobile devices. As such, the written text exchanged suffers from bad use of punctuation, misspelling words, continuous chunk of several words without spaces, tables, internet addresses etc. which make traditional text analytics methods difficult or impossible to be applied without serious effort to clean the dataset. Our proposed method in this paper can work in massive noisy and scrambled texts with minimal preprocessing by removing special characters and spaces in order to create a continuous string and detect all the repeated patterns very efficiently using the Longest Expected Repeated Pattern Reduced Suffix Array (LERP-RSA) data structure and a variant of All Repeated Patterns Detection (ARPaD) algorithm. Meta-analyses of the results can further assist a digital forensics investigator to detect important information to the chunk of text analyzed.
Critical information systems strongly rely on event logging techniques to collect data, such as housekeeping/error events, execution traces and dumps of variables, into unstructured text logs. Event logs are the primary source to gain actionable intelligence from production systems. In spite of the recognized importance, system/application logs remain quite underutilized in security analytics when compared to conventional and structured data sources, such as audit traces, network flows and intrusion detection logs. This paper proposes a method to measure the occurrence of interesting activity (i.e., entries that should be followed up by analysts) within textual and heterogeneous runtime log streams. We use an entropy-based approach, which makes no assumptions on the structure of underlying log entries. Measurements have been done in a real-world Air Traffic Control information system through a data analytics framework. Experiments suggest that our entropy-based method represents a valuable complement to security analytics solutions.
The growing computerization of critical infrastructure as well as the pervasiveness of computing in everyday life has led to increased interest in secure application development. We observe a flurry of new security technologies like ARM TrustZone and Intel SGX, but a lack of a corresponding architectural vision. We are convinced that point solutions are not sufficient to address the overall challenge of secure system design. In this paper, we outline our take on a trusted component ecosystem of small individual building blocks with strong isolation. In our view, applications should no longer be designed as massive stacks of vertically layered frameworks, but instead as horizontal aggregates of mutually isolated components that collaborate across machine boundaries to provide a service. Lateral thinking is needed to make secure systems going forward.
This paper describes an experiment carried out to demonstrate robustness and trustworthiness of an orchestrated two-layer network test-bed (PROnet). A Robotic Operating System Industrial (ROS-I) distributed application makes use of end-to-end flow services offered by PROnet. The PROnet Orchestrator is used to provision reliable end-to-end Ethernet flows to support the ROS-I application required data exchange. For maximum reliability, the Orchestrator provisions network resource redundancy at both layers, i.e., Ethernet and optical. Experimental results show that the robotic application is not interrupted by a fiber outage.