Biblio
Adaptivity is an important feature of data analysis - the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated a general formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution P and a set of n independent samples x is drawn from P. We seek an algorithm that, given x as input, accurately answers a sequence of adaptively chosen ``queries'' about the unknown distribution P. How many samples n must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions towards resolving this question: We give upper bounds on the number of samples n that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015; NIPS, 2015). We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries (alternatively, risk minimization queries). As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that the stability notion guaranteed by differential privacy implies low generalization error. We also show that weaker stability guarantees such as bounded KL divergence and total variation distance lead to correspondingly weaker generalization guarantees.
A finite state machine (FSM) is responsible for controlling the overall functionality of most digital systems and, therefore, the security of the whole system can be compromised if there are vulnerabilities in the FSM. These vulnerabilities can be created by improper designs or by the synthesis tool which introduces additional don't-care states and transitions during the optimization and synthesis process. An attacker can utilize these vulnerabilities to perform fault injection attacks or insert malicious hardware modifications (Trojan) to gain unauthorized access to some specific states. To our knowledge, no systematic approaches have been proposed to analyze these vulnerabilities in FSM. In this paper, we develop a framework named Analyzing Vulnerabilities in FSM (AVFSM) which extracts the state transition graph (including the don't-care states and transitions) from a gate-level netlist using a novel Automatic Test Pattern Generation (ATPG) based approach and quantifies the vulnerabilities of the design to fault injection and hardware Trojan insertion. We demonstrate the applicability of the AVFSM framework by analyzing the vulnerabilities in the FSM of AES and RSA encryption module. We also propose a low-cost mitigation technique to make FSM more secure against these attacks.
Parallel and distributed systems rely on intricate protocols to manage shared resources and synchronize, i.e., to manage how many processes are in a particular state. Effective verification of such systems requires universally quantification to reason about parameterized state and cardinalities tracking sets of processes, messages, failures to adequately capture protocol logic. In this paper we present Tool, an automatic invariant synthesis method that integrates cardinality-based reasoning and universal quantification. The resulting increase of expressiveness allows Tool to verify, for the first time, a representative collection of intricate parameterized protocols.
Due to the evolution of LED lighting and information technology, the application of media facades has expanded rapidly. Despite the positive aspects of media facades, the growth of them can cause light pollution and add to the confusion of the city. This study analyzes the Seoul case which implements citywide management with a master plan for media facades. Through this, the study aims to investigate the meaning of citywide management of media facades installed on individual buildings. Firstly, it investigates the conditions of media facades in Seoul City. The identified problems prove the necessity of the citywide management for media facades. Secondly, it analyzed the progress of media facades regulation of Seoul City. Management target has changed from the indiscreet installation for the individual media facades to further inducing the attractive media facade for overall Seoul City. For this, the 'Seoul Media Facade Management MasterPlan' was drafted to establish citywide management by the Seoul government. Thirdly, it analyzed the MasterPlan. The management tools in the MasterPlan are classified into regional management, elemental management, and specialization plans, each having detailed approaches. Finally, the study discussed the meaning of citywide management in the aspect that media facades are the cultural asset to the city, that the regional differentiation is adopted, and that the continuous maintenance for both of the hardware and content) is important. Media facades utilizing the facade of buildings are recognized as an element of urban landscapes securing the publicness, contributing to the vitalization of the area, and finally providing pleasure to the citizens.
Crowdsourcing is an unique and practical approach to obtain personalized data and content. Its impact is especially significant in providing commentary, reviews and metadata, on a variety of location based services. In this study, we examine reliability of the Waze mapping service, and its vulnerability to a variety of location-based attacks. Our goals are to understand the severity of the problem, shed light on the general problem of location and device authentication, and explore the efficacy of potential defenses. Our preliminary results already show that a single attacker with limited resources can cause havoc on Waze, producing ``virtual'' congestion and accidents, automatically re-routing user traffic, and compromising user privacy by tracking users' precise movements via software while staying undetected. To defend against these attacks, we propose a proximity-based Sybil detection method to filter out malicious devices.
Hardware support for isolated execution (such as Intel SGX) enables development of applications that keep their code and data confidential even while running in a hostile or compromised host. However, automatically verifying that such applications satisfy confidentiality remains challenging. We present a methodology for designing such applications in a way that enables certifying their confidentiality. Our methodology consists of forcing the application to communicate with the external world through a narrow interface, compiling it with runtime checks that aid verification, and linking it with a small runtime that implements the narrow interface. The runtime includes services such as secure communication channels and memory management. We formalize this restriction on the application as Information Release Confinement (IRC), and we show that it allows us to decompose the task of proving confidentiality into (a) one-time, human-assisted functional verification of the runtime to ensure that it does not leak secrets, (b) automatic verification of the application's machine code to ensure that it satisfies IRC and does not directly read or corrupt the runtime's internal state. We present /CONFIDENTIAL: a verifier for IRC that is modular, automatic, and keeps our compiler out of the trusted computing base. Our evaluation suggests that the methodology scales to real-world applications.
The domain name system (DNS) offers an ideal distributed database for big data mining related to different cyber security questions. Besides infrastructural problems, scalability issues, and security challenges related to the protocol itself, information from DNS is often required also for more nuanced cyber security questions. Against this backdrop, this paper discusses the fundamental characteristics of DNS in relation to cyber security and different research prototypes designed for passive but continuous DNS-based monitoring of domains and addresses. With this discussion, the paper also illustrates a few general software design aspects.
While email plays a growingly important role on the Internet, we are faced with more severe challenges brought by compromised email accounts, especially for the administrators of institutional email service providers. Inspired by the previous experience on spam filtering and compromised accounts detection, we propose several criteria, like Success Outdegree Proportion, Reverse Pagerank, Recipient Clustering Coefficient and Legitimate Recipient Proportion, for compromised email accounts detection from the perspective of graph topology in this paper. Specifically, several widely used social network analysis metrics are used and adapted according to the characteristics of mail log analysis. We evaluate our methods on a dataset constructed by mining the one month (30 days) mail log from an university with 118,617 local users and 11,460,399 mail log entries. The experimental results demonstrate that our methods achieve very positive performance, and we also prove that these methods can be efficiently applied on even larger datasets.
Media streaming has largely dominated the Internet traffic and the trend will keep increasing in the next years. To efficiently distribute the media content, Information-Centric Networking (ICN) has attracted many researchers. Since end users usually obtain content from indeterminate caches in ICN, the publisher cannot reinforce data security and access control depending on the caches. Hence, the ability of self-contained protection is important for the cached contents. Attribute-based encryption (ABE) is considered the preferred solution to achieve this goal. However, the existing ABE schemes usually have problems regarding efficiency. The exponentiation in key generation and pairing operation in decryption respectively increases linearly with the number of attributes involved, which make it costly. In this paper, we propose an efficient key-policy ABE with fast key generation and decryption (FKP-ABE). In the key generation, we get rid of exponentiation and only require multiplications/divisions for each attribute in the access policy. And in the decryption, we reduce the pairing operations to a constant number, no matter how many attributes are used. The efficiency analysis indicates that our scheme has better performance than the existing KP-ABE schemes. Finally, we present an implementation framework that incorporates the proposed FKP-ABE with the ICN architecture.
Information Systems curricula require on-going and frequent review [2] [11]. Furthermore, such curricula must be flexible because of the fast-paced, dynamic nature of the workplace. Such flexibility can be maintained through modernizing course content or, inclusively, exchanging hardware or software for newer versions. Alternatively, flexibility can arise from incorporating new information into curricula from other disciplines. One field where the pace of change is extremely high is cybersecurity [3]. Students are left with outdated skills when curricula lag behind the pace of change in industry. For example, cryptography is a required learning objective in the DHS/NSA Center of Academic Excellence (CAE) knowledge criteria [1]. However, the overarching curriculum associated with basic ciphers has gone unchanged for decades. Indeed, a general problem in cybersecurity education is that students lack fundamental knowledge in areas such as ciphers [5]. In response, researchers have developed a variety of interactive classroom visualization tools [5] [8] [9]. Such tools visualize the standard approach to frequency analysis of simple substitution ciphers that includes review of most common, single letters in ciphertext. While fundamental ciphers such as the monoalphabetic substitution cipher have not been updated (these are historical ciphers), collective understanding of how humans interact with language has changed. Updated understanding in both English language pedagogy [10] [12] and automated cryptanalysis of substitution ciphers [4] potentially renders the interactive classroom visualization tools incomplete or outdated. Classroom visualization tools are powerful teaching aids, particularly for abstract concepts. Existing research has established that such tools promote an active learning environment that translates to not only effective learning conditions but also higher student retention rates [7]. However, visualization tools require extensive planning and design when used to actively engage students with detailed, specific knowledge units such as ciphers [7] [8]. Accordingly, we propose a heatmap-based frequency analysis visualization solution that (a) incorporates digraph and trigraph language processing norms; (b) and enhances the active learning pedagogy inherent in visualization tools. Preliminary results indicate that study participants take approximately 15% longer to learn the heatmap-based frequency analysis technique compared to traditional frequency analysis but demonstrate a 50% increase in efficacy when tasked with solving simple substitution ciphers. Further, a heatmap-based solution contributes positively to the field insofar as educators have an additional tool to use in the classroom. As well, the heatmap visualization tool may allow researchers to comparatively examine efficacy of visualization tools in the cryptanalysis of mono-alphabetic substitution ciphers.
It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.
Systematic implementation of System-on-Chip (SoC) security policies typically involves smart wrappers extracting local security critical events of interest from Intellectual Property (IP) blocks, together with a control engine that communicates with the wrappers to analyze the events for policy adherence. However, developing customized wrappers at each IP for security requirements may incur significant overhead in area and hardware resources. In this paper, we address this problem by exploiting the extensive design-for-debug (DfD) instrumentation already available on-chip. In addition to reduction in the overall hardware overhead, the approach also adds flexibility to the security architecture itself, e.g., permitting use of on-field DfD instrumentation, survivability and control hooks to patch security policy implementation in response to bugs and attacks found at post-silicon or changing security requirements on-field. We show how to design scalable interface between security and debug architectures that provides the benefits of flexibility to security policy implementation without interfering with existing debug and survivability use cases and at minimal additional cost in energy and design complexity.
The current trend of large scientific computing problems is to align as much as possible to a Single Programming Multiple Data (or SPMD) scheme when the application algorithms are conducive to parallelization and vectorization. This reduces the complexity of code because the processors or (computational nodes) perform the same instructions which allows for better performance as algorithms work on local data sets instead of continuously transferring data from one locality to another. However, certain applications, such as stencil problems, demonstrate the need to move data to or from remote localities. This involves an additional degree of complexity, as one must know with which localities to exchange data. In order to solve this issue, Fortran has extended its scalar element indexing approach to distributed structures of elements. In this extension, a structure of scalar elements is attributed a ”co-index” and lives in a specific locality. A co-index provides the application with enough information to retrieve the corresponding data reference. In C++, containers present themselves as a ”smarter” alternative of Fortran arrays but there are still no corresponding standardized features similar to the Fortran co-indexing approach. In this paper, we present an implementation of such features in HPX, a general purpose C++ runtime system for applications of any scale. We describe how the combination of the HPX features and the actual C++ Standard makes it easy to define a high performance API similar to Co-Array Fortran.
The face is the most dominant and distinct communication tool of human beings. Automatic analysis of facial behavior allows machines to understand and interpret a human's states and needs for natural interactions. This research focuses on developing advanced computer vision techniques to process and analyze facial images for the recognition of various facial behaviors. Specifically, this research consists of two parts: automatic facial landmark detection and tracking, and facial behavior analysis and recognition using the tracked facial landmark points. In the first part, we develop several facial landmark detection and tracking algorithms on facial images with varying conditions, such as varying facial expressions, head poses and facial occlusions. First, to handle facial expression and head pose variations, we introduce a hierarchical probabilistic face shape model and a discriminative deep face shape model to capture the spatial relationships among facial landmark points under different facial expressions and face poses to improve facial landmark detection. Second, to handle facial occlusion, we improve upon the effective cascade regression framework and propose the robust cascade regression framework for facial landmark detection, which iteratively predicts the landmark visibility probabilities and landmark locations. The second part of this research applies our facial landmark detection and tracking algorithms to facial behavior analysis, including facial action recognition and face pose estimation. For facial action recognition, we introduce a novel regression framework for joint facial landmark detection and facial action recognition. For head pose estimation, we are working on a robust algorithm that can perform head pose estimation under facial occlusion.
SDSM is a novel approach for providing secure directory-based distributed shared memory to CPUs that are connected via an untrusted medium. SDSM scales efficiently to thousands of CPUs with less than 0.8% performance reduction, compared to a system with no security.
Record linkage refers to the task of finding same entity across different databases. We propose a machine learning based record linkage algorithm for financial entity databases. Record linkage on financial databases are essential for information integration on certain financial entity, since those databases do not have common unified identifier. Our algorithm works in two steps to determine if a pair of record is same entity or not. First we check with proposed rules if the record pair can be exactly matched after cleaning the entity name and address. Second, inspired by earlier work on author name disambiguation, we train a binary Random Forest classifier to decide the linkage. To reduce and scale the computation, this process is done only for candidate pairs within a proposed heuristic. Initial evaluation for precision, recall and F1 measures on two different linking tasks in the Financial Entity Identification and Information Integration (FEIII) Challenge show promising results.
Open data is publicly available data that can be universally and readily accessed, used, and redistributed. Open data holds particular potential in the health and social sectors but, presently, health and social data are often published in a 'closed' format. There are different tools that allow to 'open' data, clean, structure and process them in order to elaborate them and build advanced services but, unfortunately, there is no single tool that can be used to perform all different tasks. We believe that the availability of Open Data in the health and social fields should be greatly increased and a way for creating new health and social services should be provided. In this paper, we present a framework that allows to create health and social Open Data starting from whatever is available on the web and to easily build advanced services based on those data.
In the beginning was the single core ... Then we moved to multicore, before we are fully ready for it! Then GPUs appear in the scene, giving us very high performance for some type of applications ... What is next? How can we get more performance? The very near future will be the era of heterogeneous computing. We already have a glimpse of it now; you write code for multicore and GPUs together, right? As computer systems become more and more heterogeneous (cores of different capabilities, GPUs, application specific hardware, ...), writing efficient code for it becomes more and more challenging. What type of heterogeneity are we talking about? Why do we need this heterogeneity? How can we write software that makes the best use of that? ... These are the topics we will discuss in this talk.
Various mobile applications require different QoS requirements, thus there is a need to resolve the application requirement into the underlying mesh network to support them. Existing approach to coordinate the application traffic requirement to underlying network has been applied in wired domains. However, it is complex in the wireless domain due to the mobility and diversity of mobile applications. Much interest is focused on resolving application QoS and match request to mesh network link availability. We propose a testbed architecture which allows dynamic configuration of mesh networks and coordination of each flow to support application-aware QoS. Our prototype testbed shows adaptive change in mesh network routing configuration depending on application requests.
The cloud has become an established and widespread paradigm. This success is due to the gain of flexibility and savings provided by this technology. However, the main obstacle to full cloud adoption is security. The cloud, as many other systems taking advantage of the Internet, is also facing threats that compromise data confidentiality and availability. In addition, new cloud-specific attacks have emerged and current intrusion detection and prevention mechanisms are not enough to protect the complex infrastructure of the cloud from these vulnerabilities. Furthermore, one of the promises of the cloud is the Quality of Service (QoS) by continuous delivery, which must be ensured even in case of intrusion. This work presents an overview of the main cloud vulnerabilities, along with the solutions proposed in the context of the H2020 CLARUS project in terms of monitoring techniques for intrusion detection and prevention, including attack-tolerance mechanisms.
Convolution serves as the basic computational primitive for various associative computing tasks ranging from edge detection to image matching. CMOS implementation of such computations entails significant bottlenecks in area and energy consumption due to the large number of multiplication and addition operations involved. In this paper, we propose an ultra-low power and compact hybrid spintronic-CMOS design for the convolution computing unit. Low-voltage operation of domain-wall motion based magneto-metallic "Spin-Memristor"s interfaced with CMOS circuits is able to perform the convolution operation with reasonable accuracy. Simulation results of Gabor filtering for edge detection reveal \textasciitilde 2.5× lower energy consumption compared to a baseline 45nm-CMOS implementation.
Computer systems face the threat of deliberate security intrusions due to malicious attacks that exploit security holes or vulnerabilities. In practice, these security holes or vulnerabilities still remain in the system and applications even if developers carefully execute system testing. Thus it is necessary and important to develop the mechanism to prevent and/or tolerate security intrusions. As a result, the computer systems are often evaluated with confidentiality, integrity and availability (CIA) criteria from the viewpoint of security, and security is treated as a QoS (Quality of Service) attribute at par with other QoS attributes such as capacity and performance. In this paper, we present the method for quantifying a security attribute called mean time to security failure (MTTSF) of a VM-based intrusion tolerant system based on queueing theory.
Ubiquitous WiFi infrastructure and smart phones offer a great opportunity to study physical activities. In this paper, we present MobiCamp, a large-scale testbed for studying mobility-related activities of residents on a campus. MobiCamp consists of \textasciitilde2,700 APs, \textasciitilde95,000 smart phones, and an App with \textasciitilde2,300 opt-in volunteer users. More specifically, we capture how mobile users interact with different types of buildings, with other users, and with classroom courses, etc. To achieve this goal, we first obtain a relatively complete coverage of the users' mobility traces by utilizing four types of information from SNMP and by relaxing the location granularity to roughly at the room level. Then the popular App provides user attributes (grade, gender, etc.) and fine-grained behavior information (phone usages, course timetables, etc.) of the sampled population. These detailed mobile data is then correlated with the mobility traces from the SNMP to estimate the entire campus population's physical activities. We use two applications to show the power of MobiCamp.
We summarize the optimization and performance evaluation of the Nonhydrostatic ICosahedral Atmospheric Model (NICAM) on two different types of supercomputers: the K computer and TSUBAME2.5. First, we evaluated and improved several kernels extracted from the model code on the K computer. We did not significantly change the loop and data ordering for sufficient usage of the features of the K computer, such as the hardware-aided thread barrier mechanism and the relatively high bandwidth of the memory, i.e., a 0.5 Byte/FLOP ratio. Loop optimizations and code cleaning for a reduction in memory transfer contributed to a speed-up of the model execution time. The sustained performance ratio of the main loop of the NICAM reached 0.87 PFLOPS with 81,920 nodes on the K computer. For GPU-based calculations, we applied OpenACC to the dynamical core of NICAM. The performance and scalability were evaluated using the TSUBAME2.5 supercomputer. We achieved good performance results, which showed efficient use of the memory throughput performance of the GPU as well as good weak scalability. A dry dynamical core experiment was carried out using 2560 GPUs, which achieved 60 TFLOPS of sustained performance.
It is expected that DRAM memory will be augmented, and perhaps eventually replaced, by one of several up-and-coming memory technologies. These are all non-volatile, in that they retain their contents without power. This allows primary memory to be used as a fast disk replacement. It also enables more aggressive programming models that directly leverage persistence of primary memory. However, it is challenging to maintain consistency of memory in such an environment. There is no consensus on the right programming model for doing so, and subtle differences can have large, and sometimes surprising, effects on the implementation and its performance. The existing literature describes multiple programming systems that provide point solutions to the selective persistence for user data structures. Real progress in this area requires a choice of programming model, which we cannot reasonably make without a real understanding of the design space. Point solutions are insufficient. We systematically explore what we consider to be the most promising part of the space, precisely defining semantics and identifying implementation costs. This allows us to be much more explicit and precise about semantic and implementation trade-offs that were usually glossed over in prior work. It also exposes some promising new design alternatives.