Data Driven Security Models and Analysis - January 2017
Public Audience
Purpose: To highlight project progress. Information is generally at a higher level which is accessible to the interested public. All information contained in the report (regions 1-3) is a Government Deliverable/CDRL.
PI(s): Ravi Iyer
Co-PI(s): Zbigniew Kalbarczyk and Adam Slagell
Researchers: Phuong Cao and Key-whan Chung
HARD PROBLEM(S) ADDRESSED
This refers to Hard Problems, released November 2012.
- Predictive security metrics - design, development, and validation
- Resilient architectures - in the end we want to use the metrics to achieve a measurable enhancement in system resiliency, i.e., the ability to withstand attacks
- Human behavior - data contain traces of the steps the attacker took, and hence inherently include some aspects of the human behavior (of both users and miscreants)
PUBLICATIONS
Papers published in this quarter as a result of this research. Include title, author(s), venue published/presented, and a short description or abstract. Identify which hard problem(s) the publication addressed. Papers that have not yet been published should be reported in region 2 below.
No new publications this quarter.
ACCOMPLISHMENT HIGHLIGHTS
This quarter we have continued our work on factor graphs for preemptive detection of known and unknown attacks. Specifically we have focused on development of methods for automatic learning of factor functions. The accomplishments include:
1. Approach to model multi-stage attacks for preemptive detection of attacks.
- Step1: Incident data collection. The security logs are preprocessed (using automated scripts) to extract sequences of events corresponding to user and attacker activities (see Figure 1). The set of possible security-related events is finite and determined based on known capabilities of the monitoring tools. A human-written report describing the incident and indicating which users were compromised is available for each incident in our data set. The output from Step1 is the incident database, which contains time-ordered sequences of events corresponding to different attacks.
Figure 1: An architecture for automated (with human in the loop) generation of factor functions
- Step2: Automated extraction of factor functions. We develop procedures to extract three types of factor functions: event frequency counter, event sensitivity evaluator, and longest common subsequence analysis for univariate, bivariate, and multivariate factor functions, respectively.
- Step3: Pruning factor functions. For each combination of inputs (e.g., observable events corresponding to system/user activities) we construct a factor function. To identify high-quality factor functions, we develop procedures for pruning and sorting factor functions. First, we remove factor functions that assume an order of event variables which is not valid, for example, a download of a source code happens before a login to the system. Second, we sort the factor functions by frequency of events, for example, assigning a higher weight to rare and potentially policy violation events. Finally, for multivariate factor function we put more emphasis on how close are two events in time. If two events happen in close time proximity, they are likely to belong to the activities related to the same attack.
2. Classification of factor functions
We classify factor functions by the number of input variables (univariate, bivariate, and multivariate) and the function body. A factor function body can vary from a probability mass function or a likelihood function to an indicator function or a computer procedure (i.e., an actual program) that returns a value.
- Univariate factor functions. Consider a function f(u), which in the simplest case can indicate possible values of the variable u, i.e., u is a user state. In this case, the function f(u) can return a frequency corresponding to each possible value of u, i.e., frequency of being benign, suspicious, or malicious based on past observation data. Such a Boolean univariate factor function, however, does not quantify the impact of the compromise. To further quantify the impact of a past user compromise, we could use a frequency counter univariate factor function. Instead of returning a Boolean value, the function returns a counter value. Finally, a factor function can return a real value, such as suspiciousness of the user based on past login records, which can be expressed as: ratio = [number of successful logins] / {[number of failed logins] + [number of suspicious logins (e.g., from two locations in a short time proximity)]}.
- Bivariate factor functions. A bivariate factor function defines the relations between two variables, where one of the variables is known (an event observed in the logs or an attribute from a user profile) and the other variable is unknown (a user state). Bivariate factor function, in the simplest case, is similar to attack detectors based on fixed rules, which output a Boolean indicator of an attack based on a specific security alert. However, the alert alone may not be a good indicator of an attack. A more robust function could be a histogram counter bivariate factor function, which for example, returns the value specifying the number of times that the two variables have been observed together.
- Multivariate factor functions. A multivariate factor function captures contextual information provided by multiple variables. In a multistage attack, when we put together all the variables (i.e., a sequence of events corresponding to the user activity) can we see the progression of an attack. A multivariate factor function specifies the relation among these events, the user profile, and the user state. A way to define robust multivariate factor functions is to identify common subsequences of events in attacks, i.e., the core events that constitute an attack. Note that these events do not have to be consecutive; other events can occur between the events that constitute the common subsequence. We use dynamic programming to automatically extract the longest common subsequence of events from the past incidents and organize these events into a database of event sequences, represented in a form of a tree. A multivariate factor function, in this case, does not match a specific set of consecutive events; rather, it matches a sequence of nonconsecutive events. For example, a factor function of events and a user state matches the events with a valid path from the root of a tree to the leaves of the tree. The length of a path indicates the progress of an attack and is used as the corresponding return value.