Biblio
Concurrency vulnerabilities are extremely harmful and can be frequently exploited to launch severe attacks. Due to the non-determinism of multithreaded executions, it is very difficult to detect them. Recently, data race detectors and techniques based on maximal casual model have been applied to detect concurrency vulnerabilities. However, the former are ineffective and the latter report many false negatives. In this paper, we present CONVUL, an effective tool for concurrency vulnerability detection. CONVUL is based on exchangeable events, and adopts novel algorithms to detect three major kinds of concurrency vulnerabilities. In our experiments, CONVUL detected 9 of 10 known vulnerabilities, while other tools only detected at most 2 out of these 10 vulnerabilities. The 10 vulnerabilities are available at https://github.com/mryancai/ConVul.
Formal security verification of firmware interacting with hardware in modern Systems-on-Chip (SoCs) is a critical research problem. This faces the following challenges: (1) design complexity and heterogeneity, (2) semantics gaps between software and hardware, (3) concurrency between firmware/hardware and between Intellectual Property Blocks (IPs), and (4) expensive bit-precise reasoning. In this paper, we present a co-verification methodology to address these challenges. We model hardware using the Instruction-Level Abstraction (ILA), capturing firmware-visible behavior at the architecture level. This enables integrating hardware behavior with firmware in each IP into a single thread. The co-verification with multiple firmware across IPs is formulated as a multi-threaded program verification problem, for which we leverage software verification techniques. We also propose an optimization using abstraction to prevent expensive bit-precise reasoning. The evaluation of our methodology on an industry SoC Secure Boot design demonstrates its applicability in SoC security verification.
Use-After-Free (UAF) vulnerabilities are caused by the program operating on a dangling pointer and can be exploited to compromise critical software systems. While there have been many tools to mitigate UAF vulnerabilities, UAF remains one of the most common attack vectors. UAF is particularly di cult to detect in concurrent programs, in which a UAF may only occur with rare thread schedules. In this paper, we present a novel technique, UFO, that can precisely predict UAFs based on a single observed execution trace with a provably higher detection capability than existing techniques with no false positives. The key technical advancement of UFO is an extended maximal thread causality model that captures the largest possible set of feasible traces that can be inferred from a given multithreaded execution trace. By formulating UAF detection as a constraint solving problem atop this model, we can explore a much larger thread scheduling space than classical happens-before based techniques. We have evaluated UFO on several real-world large complex C/C++ programs including Chromium and FireFox. UFO scales to real-world systems with hundreds of millions of events in their execution and has detected a large number of real concurrency UAFs.
Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.
Hash functions, such as SHA (secure hash algorithm) and MD (message digest) families that are built upon Merkle-Damgard construction, suffer many attacks due to the iterative nature of block-by-block message processing. Chum and Zhang [4] proposed a new hash function construction that takes advantage of the randomize-then-combine technique, which was used in the incremental hash functions, to the iterative hash function. In this paper, we implement such hash construction in three ways distinguished by their corresponding padding methods. We conduct the experiment in parallel multi-threaded programming settings. The results show that the speed of proposed hash function is no worse than SHA1.
Optimizing memory access is critical for performance and power efficiency. CPU manufacturers have developed sampling-based performance measurement units (PMUs) that report precise costs of memory accesses at specific addresses. However, this data is too low-level to be meaningfully interpreted and contains an excessive amount of irrelevant or uninteresting information. We have developed a method to gather fine-grained memory access performance data for specific data objects and regions of code with low overhead and attribute semantic information to the sampled memory accesses. This information provides the context necessary to more effectively interpret the data. We have developed a tool that performs this sampling and attribution and used the tool to discover and diagnose performance problems in real-world applications. Our techniques provide useful insight into the memory behaviour of applications and allow programmers to understand the performance ramifications of key design decisions: domain decomposition, multi-threading, and data motion within distributed memory systems.