Biblio
The Rowhammer bug is a novel micro-architectural security threat, enabling powerful privilege-escalation attacks on various mainstream platforms. It works by actively flipping bits in Dynamic Random Access Memory (DRAM) cells with unprivileged instructions. In order to set up Rowhammer against binaries in the Linux page cache, the Waylaying algorithm has previously been proposed. The Waylaying method stealthily relocates binaries onto exploitable physical addresses without exhausting system memory. However, the proof-of-concept Waylaying algorithm can be easily detected during page cache eviction because of its high disk I/O overhead and long running time. This paper proposes the more advanced Memway algorithm, which improves on Waylaying in terms of both I/O overhead and speed. Running time and disk I/O overhead are reduced by 90% by utilizing Linux tmpfs and inmemory swapping to manage eviction files. Furthermore, by combining Memway with the unprivileged posix fadvise API, the binary relocation step is made 100 times faster. Equipped with our Memway+fadvise relocation scheme, we demonstrate practical Rowhammer attacks that take only 15-200 minutes to covertly relocate a victim binary, and less than 3 seconds to flip the target instruction bit.
Physical Unclonable Functions (PUFs) are considered as an attractive low-cost security anchor. The unique features of PUFs are dependent on the Nanoscale variations introduced during the manufacturing variations. Most PUFs exhibit an unreliability problem due to aging and inherent sensitivity to the environmental conditions. As a remedy to the reliability issue, helper data algorithms are used in practice. A helper data algorithm generates and stores the helper data in the enrollment phase in a secure environment. The generated helper data are used then for error correction, which can transform the unique feature of PUFs into a reproducible key. The key can be used to encrypt secret data in the security scheme. In contrast, this work shows that the fuzzy PUFs can be used to secret important data directly by an error-tolerant protocol without the enrollment phase and error-correction algorithm. In our proposal, the secret data is locked in a vault leveraging the unique fuzzy pattern of PUF. Although the noise exists, the data can then be released only by this unique PUF. The evaluation was performed on the most prominent intrinsic PUF - DRAM PUF. The test results demonstrate that our proposal can reach an acceptable reconstruction rate in various environment. Finally, the security analysis of the new proposal is discussed.
The evolution of convolutional neural networks (CNNs) into more complex forms of organization, with additional layers, larger convolutions and increasing connections, established the state-of-the-art in terms of accuracy errors for detection and classification challenges in images. Moreover, as they evolved to a point where Gigabytes of memory are required for their operation, we have reached a stage where it becomes fundamental to understand how their inference capabilities can be impaired if data elements somehow become corrupted in memory. This paper introduces fault-injection in these systems by simulating failing bit-cells in hardware memories brought on by relaxing the 100% reliable operation assumption. We analyze the behavior of these networks calculating inference under severe fault-injection rates and apply fault mitigation strategies to improve on the CNNs resilience. For the MNIST dataset, we show that 8x less memory is required for the feature maps memory space, and that in sub-100% reliable operation, fault-injection rates up to 10-1 (with most significant bit protection) can withstand only a 1% error probability degradation. Furthermore, considering the offload of the feature maps memory to an embedded dynamic RAM (eDRAM) system, using technology nodes from 65 down to 28 nm, up to 73 80% improved power efficiency can be obtained.
Byte-addressable non-volatile memory technology is emerging as an alternative for DRAM for main memory. This new Non-Volatile Main Memory (NVMM) allows programmers to store important data in data structures in memory instead of serializing it to the file system, thereby providing a substantial performance boost. However, modern systems reorder memory operations and utilize volatile caches for better performance, making it difficult to ensure a consistent state in NVMM. Intel recently announced a new set of persistence instructions, clflushopt, clwb, and pcommit. These new instructions make it possible to implement fail-safe code on NVMM, but few workloads have been written or characterized using these new instructions. In this work, we describe how these instructions work and how they can be used to implement write-ahead logging based transactions. We implement several common data structures and kernels and evaluate the performance overhead incurred over traditional non-persistent implementations. In particular, we find that persistence instructions occur in clusters along with expensive fence operations, they have long latency, and they add a significant execution time overhead, on average by 20.3% over code with logging but without fence instructions to order persists. To deal with this overhead and alleviate the performance bottleneck, we propose to speculate past long latency persistency operations using checkpoint-based processing. Our speculative persistence architecture reduces the execution time overheads to only 3.6%.
Hash tables form a core component of many algorithms as well as network devices. Because of their large size, they often require a combined memory model, in which some of the elements are stored in a fast memory (for example, cache or on-chip SRAM) while others are stored in much slower memory (namely, the main memory or off-chip DRAM). This makes the implementation of real-life hash tables particularly delicate, as a suboptimal choice of the hashing scheme parameters may result in a higher average query time, and therefore in a lower throughput. In this paper, we focus on multiple-choice hash tables. Given the number of choices, we study the tradeoff between the load of a hash table and its average lookup time. The problem is solved by analyzing an equivalent problem: the expected maximum matching size of a random bipartite graph with a fixed left-side vertex degree. Given two choices, we provide exact results for any finite system, and also deduce asymptotic results as the fast memory size increases. In addition, we further consider other variants of this problem and model the impact of several parameters. Finally, we evaluate the performance of our models on Internet backbone traces, and illustrate the impact of the memories speed difference on the choice of parameters. In particular, we show that the common intuition of entirely avoiding slow memory accesses by using highly efficient schemes (namely, with many fast-memory choices) is not always optimal.
High-speed IP address lookup is essential to achieve wire-speed packet forwarding in Internet routers. Ternary content addressable memory (TCAM) technology has been adopted to solve the IP address lookup problem because of its ability to perform fast parallel matching. However, the applicability of TCAMs presents difficulties due to cost and power dissipation issues. Various algorithms and hardware architectures have been proposed to perform the IP address lookup using ordinary memories such as SRAMs or DRAMs without using TCAMs. Among the algorithms, we focus on two efficient algorithms providing high-speed IP address lookup: parallel multiple-hashing (PMH) algorithm and binary search on level algorithm. This paper shows how effectively an on-chip Bloom filter can improve those algorithms. A performance evaluation using actual backbone routing data with 15,000-220,000 prefixes shows that by adding a Bloom filter, the complicated hardware for parallel access is removed without search performance penalty in parallel-multiple hashing algorithm. Search speed has been improved by 30-40 percent by adding a Bloom filter in binary search on level algorithm.
As recently shown in 2013, Android-driven smartphones and tablet PCs are vulnerable to so-called cold boot attacks. With physical access to an Android device, forensic memory dumps can be acquired with tools like FROST that exploit the remanence effect of DRAM to read out what is left in memory after a short reboot. While FROST can in some configurations be deployed to break full disk encryption, encrypted user partitions are usually wiped during a cold boot attack, such that a post-mortem analysis of main memory remains the only source of digital evidence. Therefore, we provide an in-depth analysis of Android's memory structures for system and application level memory. To leverage FROST in the digital investigation process of Android cases, we provide open-source Volatility plugins to support an automated analysis and extraction of selected Dalvik VM memory structures.
Cloud computing allows users to delegate data and computation to cloud service providers, at the cost of giving up physical control of their computing infrastructure. An attacker (e.g., insider) with physical access to the computing platform can perform various physical attacks, including probing memory buses and cold-boot style attacks. Previous work on secure (co-)processors provides hardware support for memory encryption and prevents direct leakage of sensitive data over the memory bus. However, an adversary snooping on the bus can still infer sensitive information from the memory access traces. Existing work on Oblivious RAM (ORAM) provides a solution for users to put all data in an ORAM; and accesses to an ORAM are obfuscated such that no information leaks through memory access traces. This method, however, incurs significant memory access overhead. This work is the first to leverage programming language techniques to offer efficient memory-trace oblivious program execution, while providing formal security guarantees. We formally define the notion of memory-trace obliviousness, and provide a type system for verifying that a program satisfies this property. We also describe a compiler that transforms a program into a structurally similar one that satisfies memory trace obliviousness. To achieve optimal efficiency, our compiler partitions variables into several small ORAM banks rather than one large one, without risking security. We use several example programs to demonstrate the efficiency gains our compiler achieves in comparison with the naive method of placing all variables in the same ORAM.