Biblio
On ARM processors with TrustZone security extension, asynchronous introspection mechanisms have been developed in the secure world to detect security policy violations in the normal world. These mechanisms provide security protection via passively checking the normal world snapshot. However, since previous secure world checking solutions require to suspend the entire rich OS, asynchronous introspection has not been widely adopted in the real world. Given a multi-core ARM system that can execute the two worlds simultaneously on different cores, secure world introspection can check the rich OS without suspension. However, we identify a new normal-world evasion attack that can defeat the asynchronous introspection by removing the attacking traces in parallel from one core when the security checking is performing on another core. We perform a systematic study on this attack and present its efficiency against existing asynchronous introspection mechanisms. As the countermeasure, we propose a secure and trustworthy asynchronous introspection mechanism called SATIN, which can efficiently detect the evasion attacks by increasing the attackers' evasion time cost and decreasing the defender's execution time under a safe limit. We implement a prototype on an ARM development board and the experimental results show that SATIN can effectively prevent evasion attacks on multi-core systems with a minor system overhead.
The confidentiality of data stored in embedded and handheld devices has become an urgent necessity more than ever before. Encryption of sensitive data is a well-known technique to preserve their confidentiality, however it comes with certain costs that can heavily impact the device processing resources. Utilizing multicore processors, which are equipped with current embedded devices, has brought a new era to enhance data confidentiality while maintaining suitable device performance. Encrypting the complete storage area, also known as Full Disk Encryption (FDE) can still be challenging, especially with newly emerging massive storage systems. Alternatively, since the most user sensitive data are residing inside persisting databases, it will be more efficient to focus on securing SQLite databases, through encryption, where SQLite is the most common RDBMS in handheld and embedded systems. This paper addresses the problem of ensuring data protection in embedded and mobile devices while maintaining suitable device performance by mitigating the impact of encryption. We presented here a proposed design for a parallel database encryption system, called SQLite-XTS. The proposed system encrypts data stored in databases transparently on-the-fly without the need for any user intervention. To maintain a proper device performance, the system takes advantage of the commodity multicore processors available with most embedded and mobile devices.
Cooperation of software and hardware with hybrid architectures, such as Xilinx Zynq SoC combining ARM CPU and FPGA fabric, is a high-performance and low-power platform for accelerating RSA Algorithm. This paper adopts the none-subtraction Montgomery algorithm and the Chinese Remainder Theorem (CRT) to implement high-speed RSA processors, and deploys a 48-node cluster infrastructure based on Zynq SoC to achieve extremely high scalability and throughput of RSA computing. In this design, we use the ARM to implement node-to-node communication with the Message Passing Interface (MPI) while use the FPGA to handle complex calculation. Finally, the experimental results show that the overall performance is linear with the number of nodes. And the cluster achieves 6× 9× speedup against a multi-core desktop (Intel i7-3770) and comparable performance to a many-core server (288-core). In addition, we gain up to 2.5× energy efficiency compared to these two traditional platforms.
Volume of digital data is increasing at a faster rate and the security of the data is at risk while being transit on a network as well as at rest. The execution time of full disk encryption in large servers is significant because of the computational complexity associated with disk encryption. Hence it is necessary to reduce the execution time of full disk encryption from the application point of view. In this work a full disk encryption algorithm namely EME2 AES (Encrypt Mix Encrypt V2 Advanced Encryption Standard) is analyzed. The execution speed of this algorithm is reduced by means of multicore compatible parallel implementation which makes use of available cores. Parallel implementation is executed on a multicore machine with 8 cores and speed up on the multicore implementation is measured. Results show that the multicore implementation of EME2 AES using OpenMP is up to 2.85 times faster than sequential execution for the chosen infrastructure and data range.
Near-sensor data analytics is a promising direction for internet-of-things endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data are stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a system-on-chip (SoC) based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65-nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep convolutional neural network (CNN) consuming 3.16pJ per equivalent reduced instruction set computer operation, local CNN-based face detection with secured remote recognition in 5.74pJ/op, and seizure detection with encrypted data collection from electroencephalogram within 12.7pJ/op.
The emergence of general-purpose system-on-chip (SoC) architectures has given rise to a number of significant security challenges. The current trend in SoC design is system-level integration of heterogeneous technologies consisting of a large number of processing elements such as programmable RISC cores, memory, DSPs, and accelerator function units/ASIC. These processing elements may come from different providers, and application executable code may have varying levels of trust. Some of the pressing architecture design questions are: (1) how to implement multi-level user-defined security; (2) how to optimally and securely share resources and data among processing elements. In this work, we develop a secure multicore architecture, named Hermes. It represents a new architectural framework that integrates multiple processing elements (called tenants) of secure and non-secure cores into the same chip design while (a) maintaining individual tenant security, (b) preventing data leakage and corruption, and (c) promoting collaboration among the tenants. The Hermes architecture is based on a programmable secure router interface and a trust-aware routing algorithm. With 17% hardware overhead, it enables the implementation of processing-element-oblivious secure multicore systems with a programmable distributed group key management scheme.
In many-core systems, the processing elements are interconnected using Networks-on-Chip. An example of on-chip network is SoCIN, a low-cost interconnect architecture whose original design did not take into account security aspects. This network is vulnerable to eavesdropping and spoofing attacks, what limits its use in systems that require security. This work addresses this issue and aims to ensure the security properties of confidentiality and authenticity of SoCIN-based systems. For this, we propose the use of security mechanisms based on symmetric encryption at the network level using the AES (Advanced Encryption Standard) model. A reference multi-core platform was implemented and prototyped in programmable logic aiming at performing experiments to evaluate the implemented mechanisms. Results demonstrate the effectiveness of the proposed solution in protecting the system against the target attacks. The impact on the network performance is acceptable and the silicon overhead is equivalent to other solutions found in the literature.
Virtualization based memory isolation has been widely used as a security primitive in many security systems. This paper firstly provides an in-depth analysis of its effectiveness in the multicore setting, a first in the literature. Our study reveals that memory isolation by itself is inadequate for security. Due to the fundamental design choices in hardware, it faces several challenging issues including page table maintenance, address mapping validation and thread identification. As demonstrated by our attacks implemented on XMHF and BitVisor, these issues undermine the security of memory isolation. Next, we propose a new isolation approach that is immune to the aforementioned problems. In our design, the hypervisor constructs a fully isolated micro computing environment (FIMCE) that exposes a minimal attack surface to an untrusted OS on a multicore platform. By virtue of its architectural niche, FIMCE offers stronger assurance and greater versatility than memory isolation. We have built a prototype of FIMCE and measured its performance. To show the benefits of using FIMCE as a building block, we have also implemented several practical applications which cannot be securely realized by using memory isolation alone.
Packet classification is a core function in network and security systems; hence, hardware-based solutions, such as packet classification accelerator chips or Ternary Content Addressable Memory (T-CAM), have been widely adopted for high-performance systems. With the rapid improvement of general hardware architectures and growing popularity of multi-core multi-threaded processors, software-based packet classification algorithms are attracting considerable attention, owing to their high flexibility in satisfying various industrial requirements for security and network systems. For high classification speed, these algorithms internally use large tables, whose size increases exponentially with the ruleset size; consequently, they cannot be used with a large rulesets. To overcome this problem, we propose a new software-based packet classification algorithm that simultaneously supports high scalability and fast classification performance by merging partition decision trees in a search table. While most partitioning-based packet classification algorithms show good scalability at the cost of low classification speed, our algorithm shows very high classification speed, irrespective of the number of rules, with small tables and short table building time. Our test results confirm that the proposed algorithm enables network and security systems to support heavy traffic in the most effective manner.
Graphics processing unit (GPU) has been applied successfully in many scientific computing realms due to its superior performances on float-pointing calculation and memory bandwidth, and has great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power flow (ACPF) problems need to be solved. However, when applying existing GPU-accelerated algorithms to solve N-1 SSA problem, the degree of parallelism is limited because existing researches have been devoted to accelerating the solution of a single ACPF. This paper therefore proposes a GPU-accelerated solution that creates an additional layer of parallelism among batch ACPFs and consequently achieves a much higher level of overall parallelism. First, this paper establishes two basic principles for determining well-designed GPU algorithms, through which the limitation of GPU-accelerated sequential-ACPF solution is demonstrated. Next, being the first of its kind, this paper proposes a novel GPU-accelerated batch-QR solver, which packages massive number of QR tasks to formulate a new larger-scale problem and then achieves higher level of parallelism and better coalesced memory accesses. To further improve the efficiency of solving SSA, a GPU-accelerated batch-Jacobian-Matrix generating and contingency screening is developed and carefully optimized. Lastly, the complete process of the proposed GPU-accelerated batch-ACPF solution for SSA is presented. Case studies on an 8503-bus system show dramatic computation time reduction is achieved compared with all reported existing GPU-accelerated methods. In comparison to UMFPACK-library-based single-CPU counterpart using Intel Xeon E5-2620, the proposed GPU-accelerated SSA framework using NVIDIA K20C achieves up to 57.6 times speedup. It can even achieve four times speedup when compared to one of the fastest multi-core CPU parallel computing solution using KLU library. The prop- sed batch-solving method is practically very promising and lays a critical foundation for many other power system applications that need to deal with massive subtasks, such as Monte-Carlo simulation and probabilistic power flow.
Accelerated Processing Unit (APU) is a heterogeneous multicore processor that contains general-purpose CPU cores and a GPU in a single chip. It also supports Heterogeneous System Architecture (HSA) that provides coherent physically-shared memory between the CPU and the GPU. In this paper, we present the design and implementation of a high-performance IPsec gateway using a low-cost commodity embedded APU. The HSA supported by the APUs eliminates the data copy overhead between the CPU and the GPU, which is unavoidable in the previous discrete GPU approaches. The gateway is implemented in OpenCL to exploit the GPU and uses zero-copy packet I/O APIs in DPDK. The IPsec gateway handles the real-world network traffic where each packet has a different workload. The proposed packet scheduling algorithm significantly improves GPU utilization for such traffic. It works not only for APUs but also for discrete GPUs. With three CPU cores and one GPU in the APU, the IPsec gateway achieves a throughput of 10.36 Gbps with an average latency of 2.79 ms to perform AES-CBC+HMAC-SHA1 for incoming packets of 1024 bytes.
Developers and academics are constantly seeking to increase the speed and security of operating systems. Unfortunately, an increase in either one often comes at the cost of the other. In this paper, we present an operating system design that challenges a long-held tenet of multicore operating systems in order to produce an alternative architecture that has the potential to deliver both increased security and faster performance. In particular, we propose decoupling the operating system kernel from user processes by running each on completely separate processor cores instead of at different privilege levels within shared cores. Without using the hardware's privilege modes, virtualization and virtual memory contexts enforce the security policies necessary to maintain process isolation and protection. Our new kernel design paradigm offers the opportunity to simultaneously increase both performance and security; utilizing the hardware facilities for inter-core communication in place of those for privilege mode switching offers the opportunity for increased system call performance, while the hard separation between user processes and the kernel provides several strong security properties.
We introduce a system-level Simulation and Analysis Engine (SAE) framework based on dynamic binary instrumentation for fine-grained and customizable instruction-level introspection of everything that executes on the processor. SAE can instrument the BIOS, kernel, drivers, and user processes. It can also instrument multiple systems simultaneously using a single instrumentation interface, which is essential for studying scale-out applications. SAE is an x86 instruction set simulator designed specifically to enable rapid prototyping, evaluation, and validation of architectural extensions and program analysis tools using its flexible APIs. It is fast enough to execute full platform workloads–-a modern operating system can boot in a few minutes–-thus enabling research, evaluation, and validation of complex functionalities related to multicore configurations, virtualization, security, and more. To reach high speeds, SAE couples tightly with a virtual platform and employs both a just-in-time (JIT) compiler that helps simulate simple instructions efficiently and a fast interpreter for simulating new or complex instructions. We describe SAE's architecture and instrumentation engine design and show the framework's usefulness for single- and multi-system architectural and program analysis studies.
Modern multicore processors feature easily accessible temperature sensors that provide useful information for dynamic thermal management. These sensors were recently shown to be a potential security threat, since otherwise isolated applications can exploit them to establish a thermal covert channel and leak restricted information. Previous research showed experiments that document the feasibility of (low-rate) communication over this channel, but did not further analyze its fundamental characteristics. For this reason, the important questions of quantifying the channel capacity and achievable rates remain unanswered. To address these questions, we devise and exploit a new methodology that leverages both theoretical results from information theory and experimental data to study these thermal covert channels on modern multicores. We use spectral techniques to analyze data from two representative platforms and estimate the capacity of the channels from a source application to temperature sensors on the same or different cores. We estimate the capacity to be in the order of 300 bits per second (bps) for the same-core channel, i.e., when reading the temperature on the same core where the source application runs, and in the order of 50 bps for the 1-hop channel, i.e., when reading the temperature of the core physically next to the one where the source application runs. Moreover, we show a communication scheme that achieves rates of more than 45 bps on the same-core channel and more than 5 bps on the 1-hop channel, with less than 1% error probability. The highest rate shown in previous work was 1.33 bps on the 1-hop channel with 11% error probability.
Computed Tomography (CT) Image Reconstruction is an important technique used in a wide range of applications, ranging from explosive detection, medical imaging to scientific imaging. Among available reconstruction methods, Model Based Iterative Reconstruction (MBIR) produces higher quality images and allows for the use of more general CT scanner geometries than is possible with more commonly used methods. The high computational cost of MBIR, however, often makes it impractical in applications for which it would otherwise be ideal. This paper describes a new MBIR implementation that significantly reduces the computational cost of MBIR while retaining its benefits. It describes a novel organization of the scanner data into super-voxels (SV) that, combined with a super-voxel buffer (SVB), dramatically increase locality and prefetching, enable parallelism across SVs and lead to an average speedup of 187 on 20 cores.