Biblio
With the continuing evolution of sophisticated APT attacks, provenance tracking is becoming an important technique for efficient attack investigation in enterprise networks. Most of existing provenance techniques are operating on system event auditing that discloses dependence relationships by scrutinizing syscall traces. Unfortunately, such auditing-based provenance is not able to track the causality of another important dimension in provenance, the shared libraries. Different from other data-only system entities like files and sockets, dynamic libraries are linked at runtime and may get executed, which poses new challenges in provenance tracking. For example, library provenance cannot be tracked by syscalls and mapping; whether a library function is called and how it is called within an execution context is invisible at syscall level; linking a library does not promise their execution at runtime. Addressing these challenges is critical to tracking sophisticated attacks leveraging libraries. In this paper, to facilitate fine-grained investigation inside the execution of library binaries, we develop Lprov, a novel provenance tracking system which combines library tracing and syscall tracing. Upon a syscall, Lprov identifies the library calls together with the stack which induces it so that the library execution provenance can be accurately revealed. Our evaluation shows that Lprov can precisely identify attack provenance involving libraries, including malicious library attack and library vulnerability exploitation, while syscall-based provenance tools fail to identify. It only incurs 7.0% (in geometric mean) runtime overhead and consumes 3 times less storage space of a state-of-the-art provenance tool.
The Android research community has long focused on building an Android API permission specification, which can be leveraged by app developers to determine the optimum set of permissions necessary for a correct and safe execution of their app. However, while prominent existing efforts provide a good approximation of the permission specification, they suffer from a few shortcomings. Dynamic approaches cannot generate complete results, although accurate for the particular execution. In contrast, static approaches provide better coverage, but produce imprecise mappings due to their lack of path-sensitivity. In fact, in light of Android's access control complexity, the approximations hardly abstract the actual co-relations between enforced protections. To address this, we propose to precisely derive Android protection specification in a path-sensitive fashion, using a novel graph abstraction technique. We further showcase how we can apply the generated maps to tackle security issues through logical satisfiability reasoning. Our constructed maps for 4 Android Open Source Project (AOSP) images highlight the significance of our approach, as \textasciitilde41% of APIs' protections cannot be correctly modeled without our technique.
Despite significant recent advances, the effectiveness of symbolic execution is limited when used to test complex, real-world software. One of the main scalability challenges is related to constraint solving: large applications and long exploration paths lead to complex constraints, often involving big arrays indexed by symbolic expressions. In this paper, we propose a set of semantics-preserving transformations for array operations that take advantage of contextual information collected during symbolic execution. Our transformations lead to simpler encodings and hence better performance in constraint solving. The results we obtain are encouraging: we show, through an extensive experimental analysis, that our transformations help to significantly improve the performance of symbolic execution in the presence of arrays. We also show that our transformations enable the analysis of new code, which would be otherwise out of reach for symbolic execution.
Modern software systems are becoming increasingly complex, relying on a lot of third-party library support. Library behaviors are hence an integral part of software behaviors. Analyzing them is as important as analyzing the software itself. However, analyzing libraries is highly challenging due to the lack of source code, implementation in different languages, and complex optimizations. We observe that many Java library functions provide excellent documentation, which concisely describes the functionalities of the functions. We develop a novel technique that can construct models for Java API functions by analyzing the documentation. These models are simpler implementations in Java compared to the original ones and hence easier to analyze. More importantly, they provide the same functionalities as the original functions. Our technique successfully models 326 functions from 14 widely used Java classes. We also use these models in static taint analysis on Android apps and dynamic slicing for Java programs, demonstrating the effectiveness and efficiency of our models.
Traditional sensitive data disclosure analysis faces two challenges: to identify sensitive data that is not generated by specific API calls, and to report the potential disclosures when the disclosed data is recognized as sensitive only after the sink operations. We address these issues by developing BidText, a novel static technique to detect sensitive data disclosures. BidText formulates the problem as a type system, in which variables are typed with the text labels that they encounter (e.g., during key-value pair operations). The type system features a novel bi-directional propagation technique that propagates the variable label sets through forward and backward data-flow. A data disclosure is reported if a parameter at a sink point is typed with a sensitive text label. BidText is evaluated on 10,000 Android apps. It reports 4,406 apps that have sensitive data disclosures, with 4,263 apps having log based disclosures and 1,688 having disclosures due to other sinks such as HTTP requests. Existing techniques can only report 64.0% of what BidText reports. And manual inspection shows that the false positive rate for BidText is 10%.
Advanced cyber attacks consist of multiple stages aimed at being stealthy and elusive. Such attack patterns leave their footprints spatio-temporally dispersed across many different logs in victim machines. However, existing log-mining intrusion analysis systems typically target only a single type of log to discover evidence of an attack and therefore fail to exploit fundamental inter-log connections. The output of such single-log analysis can hardly reveal the complete attack story for complex, multi-stage attacks. Additionally, some existing approaches require heavyweight system instrumentation, which makes them impractical to deploy in real production environments. To address these problems, we present HERCULE, an automated multi-stage log-based intrusion analysis system. Inspired by graph analytics research in social network analysis, we model multi-stage intrusion analysis as a community discovery problem. HERCULE builds multi-dimensional weighted graphs by correlating log entries across multiple lightweight logs that are readily available on commodity systems. From these, HERCULE discovers any "attack communities" embedded within the graphs. Our evaluation with 15 well known APT attack families demonstrates that HERCULE can reconstruct attack behaviors from a spectrum of cyber attacks that involve multiple stages with high accuracy and low false positive rates.
Causality inference, such as dynamic taint anslysis, has many applications (e.g., information leak detection). It determines whether an event e is causally dependent on a preceding event c during execution. We develop a new causality inference engine LDX. Given an execution, it spawns a slave execution, in which it mutates c and observes whether any change is induced at e. To preclude non-determinism, LDX couples the executions by sharing syscall outcomes. To handle path differences induced by the perturbation, we develop a novel on-the-fly execution alignment scheme that maintains a counter to reflect the progress of execution. The scheme relies on program analysis and compiler transformation. LDX can effectively detect information leak and security attacks with an average overhead of 6.08% while running the master and the slave concurrently on separate CPUs, much lower than existing systems that require instruction level monitoring. Furthermore, it has much better accuracy in causality inference.