Biblio
Poor data quality has become a persistent challenge for organizations as data continues to grow in complexity and size. Existing data cleaning solutions focus on identifying repairs to the data to minimize either a cost function or the number of updates. These techniques, however, fail to consider underlying data privacy requirements that exist in many real data sets containing sensitive and personal information. In this demonstration, we present PARC, a Privacy-AwaRe data Cleaning system that corrects data inconsistencies w.r.t. a set of FDs, and limits the disclosure of sensitive values during the cleaning process. The system core contains modules that evaluate three key metrics during the repair search, and solves a multi-objective optimization problem to identify repairs that balance the privacy vs. utility tradeoff. This demonstration will enable users to understand: (1) the characteristics of a privacy-preserving data repair; (2) how to customize data cleaning and data privacy requirements using two real datasets; and (3) the distinctions among the repair recommendations via visualization summaries.
Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.
Taiwan has become the frontline in an emerging cyberspace battle. Cyberattacks from different countries are constantly reported during past decades. The incident of Advanced Persistent Threat (APT) is analyzed from the golden triangle components (people, process and technology) to ensure the application of digital forensics. This study presents a novel People-Process-Technology-Strategy (PPTS) model by implementing a triage investigative step to identify evidence dynamics in digital data and essential information in auditing logs. The result of this study is expected to improve APT investigation. The investigation scenario of this proposed methodology is illustrated by applying to some APT incidents in Taiwan.
Exhaustive enumeration of a S-select-k problem for hypothesized substations outages can be practically infeasible due to exponential growth of combinations as both S and k numbers increase. This enumeration of worst-case substations scenarios from the large set, however, can be improved based on the initial selection sets with the root nodes and segments. In this paper, the previous work of the reverse pyramid model (RPM) is enhanced with prioritization of root nodes and defined segmentations of substation list based on mean-time-to-compromise (MTTC) value that is associated with each substation. Root nodes are selected based on the threshold values of the substation ranking on MTTC values and are segmented accordingly from the root node set. Each segmentation is then being enumerated with S-select-k module to identify worst-case scenarios. The lowest threshold value on the list, e.g., a substation with no assignment of MTTC or extremely low number, is completely eliminated. Simulation shows that this approach demonstrates similar outcome of the risk indices among all randomly generated MTTC of the IEEE 30-bus system.
Phishing is an online security attack in which the hacker aims in harvesting sensitive information like passwords, credit card information etc. from the users by making them to believe what they see is what it is. This threat has been into existence for a decade and there has been continuous developments in counter attacking this threat. However, statistical study reveals how phishing is still a big threat to today's world as the online era booms. In this paper, we look into the art of phishing and have made a practical analysis on how the state of the art anti-phishing systems fail to prevent Phishing. With the loop-holes identified in the state-of-the-art systems, we move ahead paving the roadmap for the kind of system that will counter attack this online security threat more effectively.
Language vector space models (VSMs) have recently proven to be effective across a variety of tasks. In VSMs, each word in a corpus is represented as a real-valued vector. These vectors can be used as features in many applications in machine learning and natural language processing. In this paper, we study the effect of vector space representations in cyber security. In particular, we consider a passive traffic analysis attack (Website Fingerprinting) that threatens users' navigation privacy on the web. By using anonymous communication, Internet users (such as online activists) may wish to hide the destination of web pages they access for different reasons such as avoiding tyrant governments. Traditional website fingerprinting studies collect packets from the users' network and extract features that are used by machine learning techniques to reveal the destination of certain web pages. In this work, we propose the packet to vector (P2V) approach where we model website fingerprinting attack using word vector representations. We show how the suggested model outperforms previous website fingerprinting works.
The Center for Strategic and International Studies estimates the annual cost from cyber crime to be more than \$400 billion. Most notable is the recent digital identity thefts that compromised millions of accounts. These attacks emphasize the security problems of using clonable static information. One possible solution is the use of a physical device known as a Physically Unclonable Function (PUF). PUFs can be used to create encryption keys, generate random numbers, or authenticate devices. While the concept shows promise, current PUF implementations are inherently problematic: inconsistent behavior, expensive, susceptible to modeling attacks, and permanent. Therefore, we propose a new solution by which an unclonable, dynamic digital identity is created between two communication endpoints such as mobile devices. This Physically Unclonable Digital ID (PUDID) is created by injecting a data scrambling PUF device at the data origin point that corresponds to a unique and matching descrambler/hardware authentication at the receiving end. This device is designed using macroscopic, intentional anomalies, making them inexpensive to produce. PUDID is resistant to cryptanalysis due to the separation of the challenge response pair and a series of hash functions. PUDID is also unique in that by combining the PUF device identity with a dynamic human identity, we can create true two-factor authentication. We also propose an alternative solution that eliminates the need for a PUF mechanism altogether by combining tamper resistant capabilities with a series of hash functions. This tamper resistant device, referred to as a Quasi-PUDID (Q-PUDID), modifies input data, using a black-box mechanism, in an unpredictable way. By mimicking PUF attributes, Q-PUDID is able to avoid traditional PUF challenges thereby providing high-performing physical identity assurance with or without a low performing PUF mechanism. Three different application scenarios with mobile devices for PUDID and Q-PUDI- have been analyzed to show their unique advantages over traditional PUFs and outline the potential for placement in a host of applications.
We summarize the optimization and performance evaluation of the Nonhydrostatic ICosahedral Atmospheric Model (NICAM) on two different types of supercomputers: the K computer and TSUBAME2.5. First, we evaluated and improved several kernels extracted from the model code on the K computer. We did not significantly change the loop and data ordering for sufficient usage of the features of the K computer, such as the hardware-aided thread barrier mechanism and the relatively high bandwidth of the memory, i.e., a 0.5 Byte/FLOP ratio. Loop optimizations and code cleaning for a reduction in memory transfer contributed to a speed-up of the model execution time. The sustained performance ratio of the main loop of the NICAM reached 0.87 PFLOPS with 81,920 nodes on the K computer. For GPU-based calculations, we applied OpenACC to the dynamical core of NICAM. The performance and scalability were evaluated using the TSUBAME2.5 supercomputer. We achieved good performance results, which showed efficient use of the memory throughput performance of the GPU as well as good weak scalability. A dry dynamical core experiment was carried out using 2560 GPUs, which achieved 60 TFLOPS of sustained performance.
With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems—like Apache Spark and Hadoop—to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.
On Android, advertising libraries are commonly integrated with their host apps. Since the host and advertising components share the application's sandbox, advertisement code inherits all permissions and can access host resources with no further approval needed. Motivated by the privacy risks of advertisement libraries as already shown in the literature, this poster introduces an Android Runtime (ART) based app compartmentalization mechanism to achieve separation between trusted app code and untrusted library code without system modification and application rewriting. With our approach, advertising libraries will be isolated from the host app and the original app will be partitioned into two sub-apps that run independently, with the host app's resources and permissions being protected by Android's app sandboxing mechanism. ARTist [1], a compiler-based Android app instrumentation framework, is utilized here to recreate the communication channels between host and advertisement library. The result is a robust toolchain on device which provides a clean separation of developer-written app code and third-party advertisement code, allowing for finer-grained access control policies and information flow control without OS customization and application rebuilding.
We present new algorithms for Personalized PageRank estimation and Personalized PageRank search. First, for the problem of estimating Personalized PageRank (PPR) from a source distribution to a target node, we present a new bidirectional estimator with simple yet strong guarantees on correctness and performance, and 3x to 8x speedup over existing estimators in experiments on a diverse set of networks. Moreover, it has a clean algebraic structure which enables it to be used as a primitive for the Personalized PageRank Search problem: Given a network like Facebook, a query like "people named John," and a searching user, return the top nodes in the network ranked by PPR from the perspective of the searching user. Previous solutions either score all nodes or score candidate nodes one at a time, which is prohibitively slow for large candidate sets. We develop a new algorithm based on our bidirectional PPR estimator which identifies the most relevant results by sampling candidates based on their PPR; this is the first solution to PPR search that can find the best results without iterating through the set of all candidate results. Finally, by combining PPR sampling with sequential PPR estimation and Monte Carlo, we develop practical algorithms for PPR search, and we show via experiments that our algorithms are efficient on networks with billions of edges.
Learning classifier systems (LCSs) are rule-based evolutionary algorithms uniquely suited to classification and data mining in complex, multi-factorial, and heterogeneous problems. LCS rule fitness is commonly based on accuracy, but this metric alone is not ideal for assessing global rule `value' in noisy problem domains, and thus impedes effective knowledge extraction. Multi-objective fitness functions are promising but rely on knowledge of how to weigh objective importance. Prior knowledge would be unavailable in most real-world problems. The Pareto-front concept offers a strategy for multi-objective machine learning that is agnostic to objective importance. We propose a Pareto-inspired multi-objective rule fitness (PIMORF) for LCS, and combine it with a complimentary rule-compaction approach (SRC). We implemented these strategies in ExSTraCS, a successful supervised LCS and evaluated performance over an array of complex simulated noisy and clean problems (i.e. genetic and multiplexer) that each concurrently model pure interaction effects and heterogeneity. While evaluation over multiple performance metrics yielded mixed results, this work represents an important first step towards efficiently learning complex problem spaces without the advantage of prior problem knowledge. Overall the results suggest that PIMORF paired with SRC improved rule set interpretability, particularly with regard to heterogeneous patterns.
Named-Data Networking (NDN) is the most prominent proposal for a clean-slate proposal of Future Internet. Nevertheless, NDN routing schemes present scalability concerns due to the required number of stored routes and of control messages. In this work, we present a controller-based routing protocol using a formal method to unambiguously specify, and validate to prove its correctness. Our proposal codes signaling information on content names, avoiding control message overhead, and reduces router memory requirements, storing only the routes for simultaneously consumed prefixes. Additionally, the protocol installs a new route on all routers in a path with a single route request to the controller, avoiding replication of routing information and automating router provisioning. As a result, we provide a protocol proposal description using the Specification and Description Language and we validate the protocol, proving that CRoS behavior is free of dead or live locks. Furthermore, the protocol validation guarantees that the scheme ensures a valid working path from consumer to producer, even if it does not assure the shortest path.
There are more and more systems using mobile devices to perform sensing tasks, but these increase the risk of leakage of personal privacy and data. Data hiding is one of the important ways for information security. Even though many data hiding algorithms have worked on providing more hiding capacity or higher PSNR, there are few algorithms that can control PSNR effectively while ensuring hiding capacity. In this paper, with controllable PSNR based on LSBs substitution- PSNR-Controllable Data Hiding (PCDH), we first propose a novel encoding plan for data hiding. In PCDH, we use the remainder algorithm to calculate the hidden information, and hide the secret information in the last x LSBs of every pixel. Theoretical proof shows that this method can control the variation of stego image from cover image, and control PSNR by adjusting parameters in the remainder calculation. Then, we design the encoding and decoding algorithms with low computation complexity. Experimental results show that PCDH can control the PSNR in a given range while ensuring high hiding capacity. In addition, it can resist well some steganalysis. Compared to other algorithms, PCDH achieves better tradeoff among PSNR, hiding capacity, and computation complexity.
In this paper, we study the problem of privacy information leakage in a smart grid. The privacy risk is assumed to be caused by an unauthorized binary hypothesis testing of the consumer's behaviour based on the smart meter readings of energy supplies from the energy provider. Another energy supplies are produced by an alternative energy source. A controller equipped with an energy storage device manages the energy inflows to satisfy the energy demand of the consumer. We study the optimal energy control strategy which minimizes the asymptotic exponential decay rate of the minimum Type II error probability in the unauthorized hypothesis testing to suppress the privacy risk. Our study shows that the cardinality of the energy supplies from the energy provider for the optimal control strategy is no more than two. This result implies a simple objective of the optimal energy control strategy. When additional side information is available for the adversary, the optimal control strategy and privacy risk are compared with the case of leaking smart meter readings to the adversary only.
The precise measurement of temperature is very important to the security and stability of the operation for a superconducting magnet. A slight fluctuation in the operating temperature may cause a superconducting magnet unstable. This paper presents a low-temperature measurement system based on C8051 Micro Controller Unit and Platinum resistance thermometer. In the process of data acquisition, a modified weighted average algorithm is applied to the digital filter program of the micro controller unit. The noise can be effectively reduced and can measure temperature of three different location points simultaneously, and there is no the interference among the three channels. The designed system could measure the temperature from 400 K to 4.0 K with a resolution of 1 mK. This system will be applied in a conduction cooling Nb3Al superconducting magnet. In order to certify the feasibility of the system, tests are performed in a small NbTi non-insulation superconducting magnet model. The results show that the measurement system is reliable and the measured temperature is accurate.
Sharing cyber security data across organizational boundaries brings both privacy risks in the exposure of personal information and data, and organizational risk in disclosing internal information. These risks occur as information leaks in network traffic or logs, and also in queries made across organizations. They are also complicated by the trade-offs in privacy preservation and utility present in anonymization to manage disclosure. In this paper, we define three principles that guide sharing security information across organizations: Least Disclosure, Qualitative Evaluation, and Forward Progress. We then discuss engineering approaches that apply these principles to a distributed security system. Application of these principles can reduce the risk of data exposure and help manage trust requirements for data sharing, helping to meet our goal of balancing privacy, organizational risk, and the ability to better respond to security with shared information.
Language vector space models (VSMs) have recently proven to be effective across a variety of tasks. In VSMs, each word in a corpus is represented as a real-valued vector. These vectors can be used as features in many applications in machine learning and natural language processing. In this paper, we study the effect of vector space representations in cyber security. In particular, we consider a passive traffic analysis attack (Website Fingerprinting) that threatens users' navigation privacy on the web. By using anonymous communication, Internet users (such as online activists) may wish to hide the destination of web pages they access for different reasons such as avoiding tyrant governments. Traditional website fingerprinting studies collect packets from the users' network and extract features that are used by machine learning techniques to reveal the destination of certain web pages. In this work, we propose the packet to vector (P2V) approach where we model website fingerprinting attack using word vector representations. We show how the suggested model outperforms previous website fingerprinting works.
Spam Filtering is an adversary application in which data can be purposely employed by humans to attenuate their operation. Statistical spam filters are manifest to be vulnerable to adversarial attacks. To evaluate security issues related to spam filtering numerous machine learning systems are used. For adversary applications some Pattern classification systems are ordinarily used, since these systems are based on classical theory and design approaches do not take into account adversarial settings. Pattern classification system display vulnerabilities (i.e. a weakness that grants an attacker to reduce assurance on system's information) to several potential attacks, allowing adversaries to attenuate their effectiveness. In this paper, security evaluation of spam email using pattern classifier during an attack is addressed which degrade the performance of the system. Additionally a model of the adversary is used that allows defining spam attack scenario.
With the advancement of technology, the world has not only become a better place to live in but have also lost the privacy and security of shared data. Information in any form is never safe from the hands of unauthorized accessing individuals. Here, in our paper we propose an approach by which we can preserve data using visual cryptography. In this paper, two sixteen segment displayed text is broken into two shares that does not reveal any information about the original images. By this process we have obtained satisfactory results in statistical and structural testes.
Compressive sensing (CS) is a novel technology for sparse signal acquisition with sub-Nyquist sampling rate but with relative high resolution. Photonics-assisted CS has attracted much attention recently due the benefit of wide bandwidth provided by photonics. This paper discusses the approaches to realizing photonics-assisted CS.
Taiwan has become the frontline in an emerging cyberspace battle. Cyberattacks from different countries are constantly reported during past decades. The incident of Advanced Persistent Threat (APT) is analyzed from the golden triangle components (people, process and technology) to ensure the application of digital forensics. This study presents a novel People-Process-Technology-Strategy (PPTS) model by implementing a triage investigative step to identify evidence dynamics in digital data and essential information in auditing logs. The result of this study is expected to improve APT investigation. The investigation scenario of this proposed methodology is illustrated by applying to some APT incidents in Taiwan.
In this paper, we investigate the performance of multiple-input multiple-output aided coded interleave division multiple access (IDMA) system for secured medical image transmission through wireless communication. We realize the MIMO profile using four transmit antennas at the base station and three receive antennas at the mobile station. We achieve bandwidth efficiency using discrete wavelet transform (DWT). Further we implement Arnold's Cat Map (ACM) encryption algorithm for secured medical transmission. We consider celulas as medical image which is used to differentiate between normal cell and carcinogenic cell. In order to accommodate more users' image, we consider IDMA as accessing scheme. At the mobile station (MS), we employ non-linear minimum mean square error (MMSE) detection algorithm to alleviate the effects of unwanted multiple users image information and multi-stream interference (MSI) in the context of downlink transmission. In particular, we investigate the effects of three types of delay-spread distributions pertaining to Stanford university interim (SUI) channel models for encrypted image transmission of MIMO-IDMA system. From our computer simulation, we reveal that DWT based coded MIMO- IDMA system with ACM provides superior picture quality in the context of DL communication while offering higher spectral efficiency and security.
Turbo code has been one of the important subjects in coding theory since 1993. This code has low Bit Error Rate (BER) but decoding complexity and delay are big challenges. On the other hand, considering the complexity and delay of separate blocks for coding and encryption, if these processes are combined, the security and reliability of communication system are guaranteed. In this paper a secure decoding algorithm in parallel on General-Purpose Graphics Processing Units (GPGPU) is proposed. This is the first prototype of a fast and parallel Joint Channel-Security Coding (JCSC) system. Despite of encryption process, this algorithm maintains desired BER and increases decoding speed. We considered several techniques for parallelism: (1) distribute decoding load of a code word between multiple cores, (2) simultaneous decoding of several code words, (3) using protection techniques to prevent performance degradation. We also propose two kinds of optimizations to increase the decoding speed: (1) memory access improvement, (2) the use of new GPU properties such as concurrent kernel execution and advanced atomics to compensate buffering latency.
Todays' era of internet-of-things, cloud computing and big data centers calls for more fresh graduates with expertise in digital data processing techniques such as compression, encryption and error correcting codes. This paper describes a project-based elective that covers these three main digital data processing techniques and can be offered to three different undergraduate majors electrical and computer engineering and computer science. The course has been offered successfully for three years. Registration statistics show equal interest from the three different majors. Assessment data show that students have successfully completed the different course outcomes. Students' feedback show that students appreciate the knowledge they attain from this elective and suggest that the workload for this course in relation to other courses of equal credit is as expected.