Biblio
It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.
Record linkage refers to the task of finding same entity across different databases. We propose a machine learning based record linkage algorithm for financial entity databases. Record linkage on financial databases are essential for information integration on certain financial entity, since those databases do not have common unified identifier. Our algorithm works in two steps to determine if a pair of record is same entity or not. First we check with proposed rules if the record pair can be exactly matched after cleaning the entity name and address. Second, inspired by earlier work on author name disambiguation, we train a binary Random Forest classifier to decide the linkage. To reduce and scale the computation, this process is done only for candidate pairs within a proposed heuristic. Initial evaluation for precision, recall and F1 measures on two different linking tasks in the Financial Entity Identification and Information Integration (FEIII) Challenge show promising results.
Open data is publicly available data that can be universally and readily accessed, used, and redistributed. Open data holds particular potential in the health and social sectors but, presently, health and social data are often published in a 'closed' format. There are different tools that allow to 'open' data, clean, structure and process them in order to elaborate them and build advanced services but, unfortunately, there is no single tool that can be used to perform all different tasks. We believe that the availability of Open Data in the health and social fields should be greatly increased and a way for creating new health and social services should be provided. In this paper, we present a framework that allows to create health and social Open Data starting from whatever is available on the web and to easily build advanced services based on those data.
We summarize the optimization and performance evaluation of the Nonhydrostatic ICosahedral Atmospheric Model (NICAM) on two different types of supercomputers: the K computer and TSUBAME2.5. First, we evaluated and improved several kernels extracted from the model code on the K computer. We did not significantly change the loop and data ordering for sufficient usage of the features of the K computer, such as the hardware-aided thread barrier mechanism and the relatively high bandwidth of the memory, i.e., a 0.5 Byte/FLOP ratio. Loop optimizations and code cleaning for a reduction in memory transfer contributed to a speed-up of the model execution time. The sustained performance ratio of the main loop of the NICAM reached 0.87 PFLOPS with 81,920 nodes on the K computer. For GPU-based calculations, we applied OpenACC to the dynamical core of NICAM. The performance and scalability were evaluated using the TSUBAME2.5 supercomputer. We achieved good performance results, which showed efficient use of the memory throughput performance of the GPU as well as good weak scalability. A dry dynamical core experiment was carried out using 2560 GPUs, which achieved 60 TFLOPS of sustained performance.
SDN’s logically centralized control provides an insertion point for programming the network. While it is generally agreed that higherlevel abstractions are needed to make that programming easy, there is little consensus on what are the “right” abstractions. Indeed, as SDN moves beyond its initial specialized deployments to broader use cases, it is likely that network control applications will require diverse abstractions that evolve over time. To this end, we champion a perspective that SDN control fundamentally revolves around data representation. We discard any application-specific structure that might be outgrown by new demands. Instead, we adopt a plain data representation of the entire network — network topology, forwarding, and control applications — and seek a universal data language that allows application programmers to transform the primitive representation into any high-level representations presented to applications or network operators. Driven by this insight, we present a system, Ravel, that implements an entire SDN network control infrastructure within a standard SQL database. In Ravel, network abstractions take the form of user-defined SQL views expressed by SQL queries that can be added on the fly. A key challenge in realizing this approach is to orchestrate multiple simultaneous abstractions that collectively affect the same underlying data. To achieve this, Ravel enhances the database with novel data integration mechanisms that merge the multiple views into a coherent forwarding behavior. Moreover, Ravel is exposed to applications through the one simple, familiar and highly interoperable SQL interface. While this is an ambitious long-term goal, our prototype built on the PostgreSQL database exhibits promising performance even for large scale networks.
Users receive a multitude of digital- and physical- security advice every day. Indeed, if we implemented all the security advice we received, we would never leave our houses or use the Internet. Instead, users selectively choose some advice to accept and some (most) to reject; however, it is unclear whether they are effectively prioritizing what is most important or most useful. If we can understand from where and why users take security advice, we can develop more effective security interventions.
As a first step, we conducted 25 semi-structured interviews of a demographically broad pool of users. These interviews resulted in several interesting findings: (1) participants evaluated digital-security advice based on the trustworthiness of the advice source, but evaluated physical-security advice based on their intuitive assessment of the advice content; (2) negative-security events portrayed in well-crafted fictional narratives with relatable characters (such as those shown in TV or movies) may be effective teaching tools for both digital- and physical-security behaviors; and (3) participants rejected advice for many reasons, including finding that the advice contains too much marketing material or threatens their privacy.
Smartphones nowadays are customized to help users with their daily tasks such as storing important data or making transactions through the internet. With the sensitivity of the data involved, authentication mechanism such as fixed-text password, PIN, or unlock patterns are used to safeguard these data against intruders. However, these mechanisms have the risk from security threats such as cracking or shoulder surfing. To enhance mobile and/or information security, this study aimed to develop a free-form handwriting gesture user authentication for smartphones. It also tried to discover the static and dynamic handwriting features that significantly influence the recognition of a legitimate user. The experiment was then conducted by asking thirty (30) individuals to draw or swipe using their fingertip their desired free-form security pattern ten (10) times. These patterns were then cleaned and processed, and extracted seven (7) static and eleven (11) dynamic handwriting features. By means of Neural Network classifier of the RapidMiner data mining tool, these features were used to develop, validate, and test a model for user authentication. The model showed a very promising recognition rate of 96.67%. The model is further tested through a prototype, and it still gave a very satisfactory result.
Internet has been being becoming the most famous and biggest communication networks as social, industrial, and public infrastructure since Internet was invented at late 1960s. In a historical retrospect of Internet's evolution, the Internet architecture continues evolution repeatedly by going through various technical challenges, for instance, in early 1990s, Internet had encountered danger of scalability, after a short while it had been overcome and successfully evolved by applying emerging techniques such as CIDR, NAT, and IPv6. Especially this paper emphasizes scalability issues as technical challenges with forecasting that Internet of things era has come. Firstly, we describe the Identifier and locator separation scheme that can achieve dramatically architectural evolution in historical perspective. Additionally, it reviews various kinds of Identifier and locator separation scheme because recently the scheme can be the major design pillar towards future of Internet architecture such as both various clean-slated future Internet architectures and evolving Internet architectures. Lastly we show a result of analysis by analysis table for future of internet of everything where number of Internet connected devices will growth to more than 20 billion by 2020.
This paper describes a study that investigates tilt-gesture depth on a Bluetooth handheld music controller for activating and deactivating music loops. Making use of a Wii Remote's 3-axis ADXL330 accelerometer, a Max patch was programmed to receive, handle, and store incoming accelerometer data. Each loop corresponded to the front, back, left and right tilt-gesture direction, with each gesture motion triggering a loop 'On' or 'Off' depending on its playback status. The study comprised 40 undergraduate students interacting with the prototype controller for a duration of 5 minutes per person. Each participant performed three full cycles beginning with the front gesture direction and moving clockwise. This corresponded to a total of 24 trigger motions per participant. Raw data associated with tilt-gesture motion depth was scaled, analyzed and graphed. Results show significant differences between each gesture direction in terms of tilt-gesture depth, as well as issues with noise for left/right gesture motion due to dependency on Roll and Yaw values. Front and Left tilt-gesture depths displayed significantly higher threshold levels compared to the Back and Right axes. Front and Left tilt-gesture thresholds therefore allow the device to easily differentiate between intentional sample triggering and general device handling, while this is more difficult for Back and Left directions. Future work will include finding an alternative method for evaluating intentional tilt-gesture triggering on the Back and Left axes, as well as utilizing two 2-axis accelerometers to garner clean data from the Left and Right axes.
The Google Identity Platform is a system that allows a user to sign in to applications and other services by using a Google account. Google Sign-In is one such method for providing one’s identity to the Google Identity Platform. Google Sign-In is available for Android applications and iOS applications, as well as for websites and other devices. Users of Google Sign-In find that it integrates well with the Android platform, but iOS users (iPhone, iPad, etc.) do not have the same experience. The user experience when logging in to a Google account on an iOS application can not only be more tedious than the Android experience, but it also conditions users to engage in behaviors that put the information in their Google accounts at risk.
With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems—like Apache Spark and Hadoop—to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.
Defect-prediction techniques can enhance the quality assurance activities for software systems. For instance, they can be used to predict bugs in source files or functions. In the context of a software product line, such techniques could ideally be used for predicting defects in features or combinations of features, which would allow developers to focus quality assurance on the error-prone ones. In this preliminary case study, we investigate how defect prediction models can be used to identify defective features using machine-learning techniques. We adapt process metrics and evaluate and compare three classifiers using an open-source product line. Our results show that the technique can be effective. Our best scenario achieves an accuracy of 73 % for accurately predicting features as defective or clean using a Naive Bayes classifier. Based on the results we discuss directions for future work.
The United States is losing the cyberwar. We are losing the cyberwar because cyber defenses apply the wrong philosophy to the wrong operating environment. In order to be effective, future cyber defenses must be viewed in the context of an engagement between human adversaries.
Pagination problems deal with questions around transforming a source text stream into a formatted document by dividing it up into individual columns and pages, including adding auxiliary elements that have some relationship to the source stream data but may allow a certain amount of variation in placement (such as figures or footnotes). Traditionally the pagination problem has been approached by separating it into one of micro-typography (e.g., breaking text into paragraphs, also known as h&j) and one of macro-typography (e.g., taking a galley of already formatted paragraphs and breaking them into columns and pages) without much interaction between the two. While early solutions for both problem spaces used simple greedy algorithms, Knuth and Plass introduced in the '80s a global-fit algorithm for line breaking that optimizes the breaks across the whole paragraph [1]. This algorithm was implemented in TeX'82 [2] and has since kept its crown as the best available solution for this space. However, for macro-typography there has been no (successful) attempt to provide globally optimized page layout: all systems to date (including TeX) use greedy algorithms for pagination. Various problems in this area have been researched (e.g., [3,4,5,6]) and the literature documents some prototype development. But none of these prototypes have been made widely available to the research community or ever made it into a generally usable and publicly available system. This paper presents a framework for a global-fit algorithm for page breaking based on the ideas of Knuth/Plass. It is implemented in such a way that it is directly usable without additional executables with any modern TeX installation. It therefore can serve as a test bed for future experiments and extensions in this space. At the same time a cleaned-up version of the current prototype has the potential to become a production tool for the huge number of TeX users world-wide. The paper also discusses two already implemented extensions that increase the flexibility of the pagination process: the ability to automatically consider existing flexibility in paragraph length (by considering paragraph variations with different numbers of lines [7]) and the concept of running the columns on a double spread a line long or short. It concludes with a discussion of the overall approach, its inherent limitations and directions for future research. [1] D. E. Knuth and M. F. Plass. Breaking Paragraphs into Lines. Software-Practice and Experience, 11(11):1119-1184, Nov. 1981. [2] D. E. Knuth. TeX: The Program, volume B of Computers and Typesetting. Addison-Wesley, Reading, MA, USA, 1986. [3] A. Brüggemann-Klein, R. Klein, and S. Wohlfeil. Computer science in perspective. Chapter On the Pagination of Complex Documents, pages 49-68. Springer-Verlag New York, Inc., New York, NY, USA, 2003. [4] C. Jacobs, W. Li, and D. H. Salesin. Adaptive document layout via manifold content. In Second International Workshop on Web Document Analysis (wda2003), Liverpool, UK, 2003, 2003. [5] A. Holkner. Global multiple objective line breaking. Master's thesis, School of Computer Science and Information Technology, RMIT University, Melbourne, Victoria, Australia, 2006. [6] P. Ciancarini, A. Di Iorio, L. Furini, and F. Vitali. High-quality pagination for publishing. Software-Practice and Experience, 42(6):733-751, June 2012. [7] T. Hassan and A. Hunter. Knuth-Plass revisited: Flexible line-breaking for automatic document layout. In Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng '15, pages 17-20, New York, NY, USA, 2015.
Maintaining a clean and hygienic civic environment is an indispensable yet formidable task, especially in developing countries. With the aim of engaging citizens to track and report on their neighborhoods, this paper presents a novel smartphone app, called SpotGarbage, which detects and coarsely segments garbage regions in a user-clicked geo-tagged image. The app utilizes the proposed deep architecture of fully convolutional networks for detecting garbage in images. The model has been trained on a newly introduced Garbage In Images (GINI) dataset, achieving a mean accuracy of 87.69%. The paper also proposes optimizations in the network architecture resulting in a reduction of 87.9% in memory usage and 96.8% in prediction time with no loss in accuracy, facilitating its usage in resource constrained smartphones.
The human factor is often regarded as the weakest link in cybersecurity systems. The investigation of several security breaches reveals an important impact of human errors in exhibiting security vulnerabilities. Although security researchers have long observed the impact of human behavior, few improvements have been made in designing secure systems that are resilient to the uncertainties of the human element.
In this talk, we discuss several psychological theories that attempt to understand and influence the human behavior in the cyber world. Our goal is to use such theories in order to build predictive cyber security models that include the behavior of typical users, as well as system administrators. We then illustrate the importance of our approach by presenting a case study that incorporates models of human users. We analyze our preliminary results and discuss their challenges and our approaches to address them in the future.
Presented at the ITI Joint Trust and Security/Science of Security Seminar, October 20, 2016.
This article describes our recent progress on the development of rigorous analytical metrics for assessing the threat-performance trade-off in control systems. Computing systems that monitor and control physical processes are now pervasive, yet their security is frequently an afterthought rather than a first-order design consideration. We investigate a rational basis for deciding—at the design level—how much investment should be made to secure the system.
Mobile malware attempts to evade detection during app analysis by mimicking security-sensitive behaviors of benign apps that provide similar functionality (e.g., sending SMS mes- sages), and suppressing their payload to reduce the chance of being observed (e.g., executing only its payload at night). Since current approaches focus their analyses on the types of security- sensitive resources being accessed (e.g., network), these evasive techniques in malware make differentiating between malicious and benign app behaviors a difficult task during app analysis. We propose that the malicious and benign behaviors within apps can be differentiated based on the contexts that trigger security- sensitive behaviors, i.e., the events and conditions that cause the security-sensitive behaviors to occur. In this work, we introduce AppContext, an approach of static program analysis that extracts the contexts of security-sensitive behaviors to assist app analysis in differentiating between malicious and benign behaviors. We implement a prototype of AppContext and evaluate AppContext on 202 malicious apps from various malware datasets, and 633 benign apps from the Google Play Store. AppContext correctly identifies 192 malicious apps with 87.7% precision and 95% recall. Our evaluation results suggest that the maliciousness of a security-sensitive behavior is more closely related to the intention of the behavior (reflected via contexts) than the type of the security-sensitive resources that the behavior accesses.
Prentation at Illinois SoS Lablet Bi-Weekly Meeting, January 2015.
The malware detection arms race involves constant change: malware changes to evade detection and labels change as detection mechanisms react. Recognizing that malware changes over time, prior work has enforced temporally consistent samples by requiring that training binaries predate evaluation binaries. We present temporally consistent labels, requiring that training labels also predate evaluation binaries since training labels collected after evaluation binaries constitute label knowledge from the future. Using a dataset containing 1.1 million binaries from over 2.5 years, we show that enforcing temporal label consistency decreases detection from 91% to 72% at a 0.5% false positive rate compared to temporal samples alone.
The impact of temporal labeling demonstrates the potential of improved labels to increase detection results. Hence, we present a detector capable of selecting binaries for submission to an expert labeler for review. At a 0.5% false positive rate, our detector achieves a 72% true positive rate without an expert, which increases to 77% and 89% with 10 and 80 expert queries daily, respectively. Additionally, we detect 42% of malicious binaries initially undetected by all 32 antivirus vendors from VirusTotal used in our evaluation. For evaluation at scale, we simulate the human expert labeler and show that our approach is robust against expert labeling errors. Our novel contributions include a scalable malware detector integrating manual review with machine learning and the examination of temporal label consistency
We examine the problem of aggregating the results of multiple anti-virus (AV) vendors' detectors into a single authoritative ground-truth label for every binary. To do so, we adapt a well-known generative Bayesian model that postulates the existence of a hidden ground truth upon which the AV labels depend. We use training based on Expectation Maximization for this fully unsupervised technique. We evaluate our method using 279,327 distinct binaries from VirusTotal, each of which appeared for the rst time between January 2012 and June 2014.
Our evaluation shows that our statistical model is consistently more accurate at predicting the future-derived ground truth than all unweighted rules of the form \k out of n" AV detections. In addition, we evaluate the scenario where partial ground truth is available for model building. We train a logistic regression predictor on the partial label information. Our results show that as few as a 100 randomly selected training instances with ground truth are enough to achieve 80% true positive rate for 0.1% false positive rate. In comparison, the best unweighted threshold rule provides only 60% true positive rate at the same false positive rate.
We present a controller synthesis algorithm for a discrete time reach-avoid problem in the presence of adversaries. Our model of the adversary captures typical malicious attacks en- visioned on cyber-physical systems such as sensor spoofing, controller corruption, and actuator intrusion. After formu- lating the problem in a general setting, we present a sound and complete algorithm for the case with linear dynamics and an adversary with a budget on the total L2-norm of its actions. The algorithm relies on a result from linear control theory that enables us to decompose and precisely compute the reachable states of the system in terms of a symbolic simulation of the adversary-free dynamics and the total uncertainty induced by the adversary. With this de- composition, the synthesis problem eliminates the universal quantifier on the adversary’s choices and the symbolic con- troller actions can be effectively solved using an SMT solver. The constraints induced by the adversary are computed by solving second-order cone programmings. The algorithm is later extended to synthesize state-dependent controller and to generate attacks for the adversary. We present prelimi- nary experimental results that show the effectiveness of this approach on several example problems.
This paper considers a decentralized switched control problem where exact conditions for controller synthesis are obtained in the form of semidefinite programming (SDP). The formulation involves a discrete-time switched linear plant that has a nested structure, and whose system matrices switch between a finite number of values according to finite-state automation. The goal of this paper is to synthesize a commensurately nested switched controller to achieve a desired level of 2-induced norm performance. The nested structures of both plant and controller are characterized by block lower-triangular system matrices. For this setup, exact conditions are provided for the existence of a finite path-dependent synthesis. These include conditions for the completion of scaling matrices obtained through an extended matrix completion lemma.When individual controller dimensions are chosen at least as large as the plant, these conditions reduce to a set of linear matrix inequalities. The completion lemma also provides an algorithm to complete closed-loop scaling matrices, leading to inequalities for ontroller synthesis that are solvable either algebraically or numerically through SDP.
Published in IEEE Transactions on Control of Network Systems, volume 2, issue 4, December 2015.
In distributed optimization and iterative consensus literature, a standard problem is for N agents to minimize a function f over a subset of Rn, where the cost function is expressed as Σ fi . In this paper, we study the private distributed optimization (PDOP) problem with the additional requirement that the cost function of the individual agents should remain differentially private. The adversary attempts to infer information about the private cost functions from the messages that the agents exchange. Achieving differential privacy requires that any change of an individual’s cost function only results in unsubstantial changes in the statistics of the messages. We propose a class of iterative algorithms for solving PDOP, which achieves differential privacy and convergence to the optimal value. Our analysis reveals the dependence of the achieved accuracy and the privacy levels on the the parameters of the algorithm.
Many current VM monitoring approaches require guest OS modifications and are also unable to perform application level monitoring, reducing their value in a cloud setting. This paper introduces hprobes, a framework that allows one to dynamically monitor applications and operating systems inside a VM. The hprobe framework does not require any changes to the guest OS, which avoids the tight coupling of monitoring with its target. Furthermore, the monitors can be customized and enabled/disabled while the VM is running. To demonstrate the usefulness of this framework, we present three sample detectors: an emergency detector for a security vulnerability, an application watchdog, and an infinite-loop detector. We test our detectors on real applications and demonstrate that those detectors achieve an acceptable level of performance overhead with a high degree of flexibility.