Biblio
This paper presents a solution that simultaneously addresses both reliability and security (RnS) in a monitoring framework. We identify the commonalities between reliability and security to guide the design of HyperTap, a hypervisor-level framework that efficiently supports both types of monitoring in virtualization environments. In HyperTap, the logging of system events and states is common across monitors and constitutes the core of the framework. The audit phase of each monitor is implemented and operated independently. In addition, HyperTap relies on hardware invariants to provide a strongly isolated root of trust. HyperTap uses active monitoring, which can be adapted to enforce a wide spectrum of RnS policies. We validate Hy- perTap by introducing three example monitors: Guest OS Hang Detection (GOSHD), Hidden RootKit Detection (HRKD), and Privilege Escalation Detection (PED). Our experiments with fault injection and real rootkits/exploits demonstrate that HyperTap provides robust monitoring with low performance overhead.
Winner of the William C. Carter Award for Best Paper based on PhD work and Best Paper Award voted by conference participants.
Reliability and security tend to be treated separately because they appear orthogonal: reliability focuses on accidental failures, security on intentional attacks. Because of the apparent dissimilarity between the two, tools to detect and recover from different classes of failures and attacks are usually designed and implemented differently. So, integrating support for reliability and security in a single framework is a significant challenge.
Here, we discuss how to address this challenge in the context of cloud computing, for which reliability and security are growing concerns. Because cloud deployments usually consist of commodity hardware and software, efficient monitoring is key to achieving resiliency. Although reliability and security monitoring might use different types of analytics, the same sensing infrastructure can provide inputs to monitoring modules.
We split monitoring into two phases: logging and auditing. Logging captures data or events; it constitutes the framework’s core and is common to all monitors. Auditing analyzes data or events; it’s implemented and operated independently by each monitor. To support a range of auditing policies, logging must capture a complete view, including both actions and states of target systems. It must also provide useful, trustworthy information regarding the captured view.
We applied these principles when designing HyperTap, a hypervisor-level monitoring framework for virtual machines (VMs). Unlike most VM-monitoring techniques, HyperTap employs hardware architectural invariants (hardware invariants, for short) to establish the root of trust for logging. Hardware invariants are properties defined and enforced by a hardware platform (for example, the x86 instruction set architecture). Additionally, HyperTap supports continuous, event-driven VM monitoring, which enables both capturing the system state and responding rapidly to actions of interest.
Computing a user-task assignment for a workflow coming with probabilistic user availability provides a measure of completion rate or resiliency. To a workflow designer this indicates a risk of failure, espe- cially useful for workflows which cannot be changed due to rigid security constraints. Furthermore, resiliency can help outline a mitigation strategy which states actions that can be performed to avoid workflow failures. A workflow with choice may have many different resiliency values, one for each of its execution paths. This makes understanding failure risk and mitigation requirements much more complex. We introduce resiliency variance, a new analysis metric for workflows which indicates volatility from the resiliency average. We suggest this metric can help determine the risk taken on by implementing a given workflow with choice. For instance, high average resiliency and low variance would suggest a low risk of workflow failure.
Workflows are complex operational processes that include security constraints restricting which users can perform which tasks. An improper user-task assignment may prevent the completion of the work- flow, and deciding such an assignment at runtime is known to be complex, especially when considering user unavailability (known as the resiliency problem). Therefore, design tools are required that allow fast evaluation of workflow resiliency. In this paper, we propose a methodology for work- flow designers to assess the impact of the security policy on computing the resiliency of a workflow. Our approach relies on encoding a work- flow into the probabilistic model-checker PRISM, allowing its resiliency to be evaluated by solving a Markov Decision Process. We observe and illustrate that adding or removing some constraints has a clear impact on the resiliency computation time, and we compute the set of security constraints that can be artificially added to a security policy in order to reduce the computation time while maintaining the resiliency.
Workflows capture complex operational processes and include security constraints limiting which users can perform which tasks. An improper security policy may prevent cer- tain tasks being assigned and may force a policy violation. Deciding whether a valid user-task assignment exists for a given policy is known to be extremely complex, especially when considering user unavailability (known as the resiliency problem). Therefore tools are required that allow automatic evaluation of workflow resiliency. Modelling well defined workflows is fairly straightforward, however user availabil- ity can be modelled in multiple ways for the same workflow. Correct choice of model is a complex yet necessary concern as it has a major impact on the calculated resiliency. We de- scribe a number of user availability models and their encod- ing in the model checker PRISM, used to evaluate resiliency. We also show how model choice can affect resiliency computation in terms of its value, memory and CPU time.
Best Poster Award, ACM SIGCOMM Conference on Principles of Advanced Discrete Simulation, London, UK, June 10-12, 2015.
The advancement of software-defined networking (SDN) technology is highly dependent on the successful transformations from in-house research ideas to real-life products. To enable such transformations, a testbed offering scalable and high fidelity networking environment for testing and evaluating new/existing designs is extremely valuable. Mininet, the most popular SDN emulator by far, is designed to achieve both accuracy and scalability by running unmodified code of network applications in lightweight Linux Containers. How- ever, Mininet cannot guarantee performance fidelity under high workloads, in particular when the number of concurrent active events is more than the number of parallel cores. In this project, we develop a lightweight virtual time system in Linux container and integrate the system with Mininet, so that all the containers have their own virtual clocks rather than using the physical system clock which reflects the se- rialized execution of multiple containers. With the notion of virtual time, all the containers perceive virtual time as if they run independently and concurrently. As a result, inter- actions between the containers and the physical system are artificially scaled, making a network appear to be ten times faster from the viewpoint of applications within the contain- ers than it actually is. We also design an adaptive virtual time scheduling subsystem in Mininet, which is responsible to balance the experiment speed and fidelity. Experimen- tal results demonstrate that embedding virtual time into Mininet significantly enhances its performance fidelity, and therefore, results in a useful platform for the SDN community to conduct scalable experiments with high fidelity.
Fat-tree topologies have been widely adopted as the communication network in data centers in the past decade. Nowa- days, high-performance computing (HPC) system designers are considering using fat-tree as the interconnection network for the next generation supercomputers. For extreme-scale computing systems like the data centers and supercomput- ers, the performance is highly dependent on the intercon- nection networks. In this paper, we present FatTreeSim, a PDES-based toolkit consisting of a highly scalable fat-tree network model, with the goal of better understanding the de- sign constraints of fat-tree networking architectures in data centers and HPC systems, as well as evaluating the applica- tions running on top of the network. FatTreeSim is designed to model and simulate large-scale fat-tree networks up to millions of nodes with protocol-level fidelity. We have con- ducted extensive experiments to validate and demonstrate the accuracy, scalability and usability of FatTreeSim. On Argonne Leadership Computing Facility’s Blue Gene/Q sys- tem, Mira, FatTreeSim is capable of achieving a peak event rate of 305 M/s for a 524,288-node fat-tree model with a total of 567 billion committed events. The strong scaling experiments use up to 32,768 cores and show a near linear scalability. Comparing with a small-scale physical system in Emulab, FatTreeSim can accurately model the latency in the same fat-tree network with less than 10% error rate for most cases. Finally, we demonstrate FatTreeSim’s usability through a case study in which FatTreeSim serves as the net- work module of the YARNsim system, and the error rates for all test cases are less than 13.7%.
Best Paper Award
Realistic and scalable testing systems are critical to evaluate network applications and protocols to ensure successful real system deployments. Container-based network emula- tion is attractive because of the combination of many desired features of network simulators and physical testbeds . The success of Mininet, a popular software- defined networking (SDN) emulation testbed, demonstrates the value of such approach that we can execute unmodified binary code on a large- scale emulated network with lightweight OS-level vir- tualization techniques. However, an ordinary network em- ulator uses the system clock across all the containers even if a container is not being scheduled to run. This leads to the issue of temporal fidelity, especially with high workloads. Virtual time sheds the light on the issue of preserving tem- poral fidelity for large-scale emulation. The key insight is to trade time with system resources via precisely scaling the time of interactions between containers and physical devices by a factor of n, hence, making an emulated network ap- pear to be n times faster from the viewpoints of applications in the container. In this paper, we develop a lightweight Linux-container-based virtual time system and integrate the system to Mininet for fidelity and scalability enhancement. We also design an adaptive time dilation scheduling mod- ule for balancing speed and accuracy. Experimental results demonstrate that (1) with virtual time, Mininet is able to accurately emulate a network n times larger in scale, where n is the scaling factor, with the system behaviors closely match data obtained from a physical testbed; and (2) with the adaptive time dilation scheduling, we reduce the running time by 46% with little accuracy loss. Finally, we present a case study using the virtual-time-enabled Mininet to evalu- ate the limitations of equal-cost multi-path (ECMP) routing in a data center network.
It is critical to ensure that network policy remains consistent during state transitions. However, existing techniques impose a high cost in update delay, and/or FIB space. We propose the Customizable Consistency Generator (CCG), a fast and generic framework to support customizable consistency policies during network updates. CCG effectively reduces the task of synthesizing an update plan under the constraint of a given consistency policy to a verification problem, by checking whether an update can safely be installed in the network at a particular time, and greedily processing network state transitions to heuristically minimize transition delay. We show a large class of consistency policies are guaranteed by this greedy heuristic alone; in addition, CCG makes judicious use of existing heavier-weight network update mechanisms to provide guarantees when necessary. As such, CCG nearly achieves the “best of both worlds”: the efficiency of simply passing through updates in most cases, with the consistency guarantees of more heavyweight techniques. Mininet and physical testbed evaluations demonstrate CCG’s capability to achieve various types of consistency, such as path and bandwidth properties, with zero switch memory overhead and up to a 3× delay reduction compared to previous solutions.
Today’s quality of life is highly dependent on the successful operation of many large-scale industrial control systems. To enhance their protection against cyber-attacks and operational errors, we develop a simulation-based verification framework with cross-layer verification techniques that allow comprehensive analysis of the entire ICS-specific stack, including application, protocol, and network layers.
Work in progress paper.
In SDN, the underlying infrastructure is usually abstracted for applications that can treat the network as a logical or virtual entity. Commonly, the “mappings” between virtual abstractions and their actual physical implementations are not one-to-one, e.g., a single “big switch” abstract object might be implemented using a distributed set of physical devices. A key question is, what abstractions could be mapped to multiple physical elements while faithfully preserving their native semantics? E.g., can an application developer always expect her abstract “big switch” to act exactly as a physical big switch, despite being implemented using multiple physical switches in reality? We show that the answer to that question is “no” for existing virtual-to-physical mapping techniques: behavior can differ between the virtual “big switch” and the physical network, providing incorrect application-level behavior.
We also show that that those incorrect behaviors occur despite the fact that the most pervasive correctness invariants, such as per-packet consistency, are preserved throughout. These examples demonstrate that for practical notions of correctness, new systems and a new analytical framework are needed. We take the first steps by defining end-to-end correctness, a correctness condition that focuses on applications only, and outline a research vision to obtain virtualization systems with correct virtual to physical mappings.
Won best paper award at HotSDN 2014.
Security subsystems are often designed with flawed assumptions arising from system designers' faulty mental models. Designers tend to assume that users behave according to some textbook ideal, and to consider each potential exposure/interface in isolation. However, fieldwork continually shows that even well-intentioned users often depart from this ideal and circumvent controls in order to perform daily work tasks, and that "incorrect" user behaviors can create unexpected links between otherwise "independent" interfaces. When it comes to security features and parameters, designers try to find the choices that optimize security utility–-except these flawed assumptions give rise to an incorrect curve, and lead to choices that actually make security worse, in practice. We propose that improving this situation requires giving designers more accurate models of real user behavior and how it influences aggregate system security. Agent-based modeling can be a fruitful first step here. In this paper, we study a particular instance of this problem, propose user-centric techniques designed to strengthen the security of systems while simultaneously improving the usability of them, and propose further directions of inquiry.
This paper presents a model for generating personalized passwords (i.e., passwords based on user and service profile). A user's password is generated from a list of personalized words, each word is drawn from a topic relating to a user and the service in use. The proposed model can be applied to: (i) assess the strength of a password (i.e., determine how many guesses are used to crack the password), and (ii) generate secure (i.e., contains digits, special characters, or capitalized characters) yet easy to memorize passwords.
This paper presents a system named SPOT to achieve high accuracy and preemptive detection of attacks. We use security logs of real-incidents that occurred over a six-year period at National Center for Supercomputing Applications (NCSA) to evaluate SPOT. Our data consists of attacks that led directly to the target system being compromised, i.e., not detected in advance, either by the security analysts or by intrusion detection systems. Our approach can detect 75 percent of attacks as early as minutes to tens of hours before attack payloads are executed.
A key question that arises in rigorous analysis of cyberphysical systems under attack involves establishing whether or not the attacked system deviates significantly from the ideal allowed behavior. This is the problem of deciding whether or not the ideal system is an abstraction of the attacked system. A quantitative variation of this question can capture how much the attacked system deviates from the ideal. Thus, algorithms for deciding abstraction relations can help measure the effect of attacks on cyberphysical systems and to develop attack detection strategies. In this paper, we present a decision procedure for proving that one nonlinear dynamical system is a quantitative abstraction of another. Directly computing the reach sets of these nonlinear systems are undecidable in general and reach set over-approximations do not give a direct way for proving abstraction. Our procedure uses (possibly inaccurate) numerical simulations and a model annotation to compute tight approximations of the observable behaviors of the system and then uses these approximations to decide on abstraction. We show that the procedure is sound and that it is guaranteed to terminate under reasonable robustness assumptions.