Biblio
Byte-addressable non-volatile memory technology is emerging as an alternative for DRAM for main memory. This new Non-Volatile Main Memory (NVMM) allows programmers to store important data in data structures in memory instead of serializing it to the file system, thereby providing a substantial performance boost. However, modern systems reorder memory operations and utilize volatile caches for better performance, making it difficult to ensure a consistent state in NVMM. Intel recently announced a new set of persistence instructions, clflushopt, clwb, and pcommit. These new instructions make it possible to implement fail-safe code on NVMM, but few workloads have been written or characterized using these new instructions. In this work, we describe how these instructions work and how they can be used to implement write-ahead logging based transactions. We implement several common data structures and kernels and evaluate the performance overhead incurred over traditional non-persistent implementations. In particular, we find that persistence instructions occur in clusters along with expensive fence operations, they have long latency, and they add a significant execution time overhead, on average by 20.3% over code with logging but without fence instructions to order persists. To deal with this overhead and alleviate the performance bottleneck, we propose to speculate past long latency persistency operations using checkpoint-based processing. Our speculative persistence architecture reduces the execution time overheads to only 3.6%.
Ensuring the integrity and security of the memory system is critical. Recent studies have shown serious security concerns due to "rowhammer" attacks, where repeated accesses to a row of memory cause bit flips in adjacent rows. Recent work by Google's Project Zero has shown how to leverage rowhammer-induced bit-flips as the basis for security exploits that include malicious code injection and memory privilege escalation. Being an important security concern, industry has attempted to defend against rowhammer attacks. Deployed defenses employ two strategies: (1) doubling the system DRAM refresh rate and (2) restricting access to the CLFLUSH instruction that attackers use to bypass the cache to increase memory access frequency (i.e., the rate of rowhammering). We demonstrate that such defenses are inadequte: we implement rowhammer attacks that both avoid using the CLFLUSH instruction and cause bit flips with a doubled refresh rate. Our next-generation CLFLUSH-free rowhammer attack bypasses the cache by manipulating cache replacement state to allow frequent misses out of the last-level cache to DRAM rows of our choosing. To protect existing systems from more advanced rowhammer attacks, we develop a software-based defense, ANVIL, which thwarts all known rowhammer attacks on existing systems. ANVIL detects rowhammer attacks by tracking the locality of DRAM accesses using existing hardware performance counters. Our detector identifies the rows being frequently accessed (i.e., the aggressors), then selectively refreshes the nearby victim rows to prevent hammering. Experiments running on real hardware with the SPEC2006 benchmarks show that ANVIL has less than a 1% false positive rate and an average slowdown of 1%. ANVIL is low-cost and robust, and our experiments indicate that it is an effective approach for protecting existing and future systems from even advanced rowhammer attacks.
High-speed IP address lookup is essential to achieve wire-speed packet forwarding in Internet routers. Ternary content addressable memory (TCAM) technology has been adopted to solve the IP address lookup problem because of its ability to perform fast parallel matching. However, the applicability of TCAMs presents difficulties due to cost and power dissipation issues. Various algorithms and hardware architectures have been proposed to perform the IP address lookup using ordinary memories such as SRAMs or DRAMs without using TCAMs. Among the algorithms, we focus on two efficient algorithms providing high-speed IP address lookup: parallel multiple-hashing (PMH) algorithm and binary search on level algorithm. This paper shows how effectively an on-chip Bloom filter can improve those algorithms. A performance evaluation using actual backbone routing data with 15,000-220,000 prefixes shows that by adding a Bloom filter, the complicated hardware for parallel access is removed without search performance penalty in parallel-multiple hashing algorithm. Search speed has been improved by 30-40 percent by adding a Bloom filter in binary search on level algorithm.