Visible to the public Biblio

Filters: Keyword is Multicore Computing  [Clear All Filters]
2018-02-21
Yan, Mengjia, Gopireddy, Bhargava, Shull, Thomas, Torrellas, Josep.  2017.  Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Atacks. Proceedings of the 44th Annual International Symposium on Computer Architecture. :347–360.
In cache-based side channel attacks, a spy that shares a cache with a victim probes cache locations to extract information on the victim's access patterns. For example, in evict+reload, the spy repeatedly evicts and then reloads a probe address, checking if the victim has accessed the address in between the two operations. While there are many proposals to combat these cache attacks, they all have limitations: they either hurt performance, require programmer intervention, or can only defend against some types of attacks. This paper makes the following observation for an environment with an inclusive cache hierarchy: when the spy evicts the probe address from the shared cache, the address will also be evicted from the private cache of the victim process, creating an inclusion victim. Consequently, to disable cache attacks, this paper proposes to alter the line replacement algorithm of the shared cache, to prevent a process from creating inclusion victims in the caches of cores running other processes. By enforcing this rule, the spy cannot evict the probe address from the shared cache and, hence, cannot glimpse any information on the victim's access patterns. We call our proposal SHARP (Secure Hierarchy-Aware cache Replacement Policy). SHARP efficiently defends against all existing cross-core shared-cache attacks, needs only minimal hardware modifications, and requires no code modifications. We implement SHARP in a cycle-level full-system simulator. We show that it protects against real-world attacks, and that it introduces negligible average performance degradation.
Bai, Xu, Jiang, Lei, Dai, Qiong, Yang, Jiajia, Tan, Jianlong.  2017.  Acceleration of RSA processes based on hybrid ARM-FPGA cluster. 2017 IEEE Symposium on Computers and Communications (ISCC). :682–688.

Cooperation of software and hardware with hybrid architectures, such as Xilinx Zynq SoC combining ARM CPU and FPGA fabric, is a high-performance and low-power platform for accelerating RSA Algorithm. This paper adopts the none-subtraction Montgomery algorithm and the Chinese Remainder Theorem (CRT) to implement high-speed RSA processors, and deploys a 48-node cluster infrastructure based on Zynq SoC to achieve extremely high scalability and throughput of RSA computing. In this design, we use the ARM to implement node-to-node communication with the Message Passing Interface (MPI) while use the FPGA to handle complex calculation. Finally, the experimental results show that the overall performance is linear with the number of nodes. And the cluster achieves 6× 9× speedup against a multi-core desktop (Intel i7-3770) and comparable performance to a many-core server (288-core). In addition, we gain up to 2.5× energy efficiency compared to these two traditional platforms.

Hadagali, C..  2017.  Multicore implementation of EME2 AES disk encryption algorithm using OpenMP. 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT). :1–6.

Volume of digital data is increasing at a faster rate and the security of the data is at risk while being transit on a network as well as at rest. The execution time of full disk encryption in large servers is significant because of the computational complexity associated with disk encryption. Hence it is necessary to reduce the execution time of full disk encryption from the application point of view. In this work a full disk encryption algorithm namely EME2 AES (Encrypt Mix Encrypt V2 Advanced Encryption Standard) is analyzed. The execution speed of this algorithm is reduced by means of multicore compatible parallel implementation which makes use of available cores. Parallel implementation is executed on a multicore machine with 8 cores and speed up on the multicore implementation is measured. Results show that the multicore implementation of EME2 AES using OpenMP is up to 2.85 times faster than sequential execution for the chosen infrastructure and data range.

Conti, F., Schilling, R., Schiavone, P. D., Pullini, A., Rossi, D., Gürkaynak, F. K., Muehlberghuber, M., Gautschi, M., Loi, I., Haugou, G. et al..  2017.  An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics. IEEE Transactions on Circuits and Systems I: Regular Papers. 64:2481–2494.

Near-sensor data analytics is a promising direction for internet-of-things endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data are stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a system-on-chip (SoC) based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65-nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep convolutional neural network (CNN) consuming 3.16pJ per equivalent reduced instruction set computer operation, local CNN-based face detection with secured remote recognition in 5.74pJ/op, and seizure detection with encrypted data collection from electroencephalogram within 12.7pJ/op.

Kinsy, M. A., Khadka, S., Isakov, M., Farrukh, A..  2017.  Hermes: Secure heterogeneous multicore architecture design. 2017 IEEE International Symposium on Hardware Oriented Security and Trust (HOST). :14–20.

The emergence of general-purpose system-on-chip (SoC) architectures has given rise to a number of significant security challenges. The current trend in SoC design is system-level integration of heterogeneous technologies consisting of a large number of processing elements such as programmable RISC cores, memory, DSPs, and accelerator function units/ASIC. These processing elements may come from different providers, and application executable code may have varying levels of trust. Some of the pressing architecture design questions are: (1) how to implement multi-level user-defined security; (2) how to optimally and securely share resources and data among processing elements. In this work, we develop a secure multicore architecture, named Hermes. It represents a new architectural framework that integrates multiple processing elements (called tenants) of secure and non-secure cores into the same chip design while (a) maintaining individual tenant security, (b) preventing data leakage and corruption, and (c) promoting collaboration among the tenants. The Hermes architecture is based on a programmable secure router interface and a trust-aware routing algorithm. With 17% hardware overhead, it enables the implementation of processing-element-oblivious secure multicore systems with a programmable distributed group key management scheme.

Silva, M. R., Zeferino, C. A..  2017.  Confidentiality and Authenticity in a Platform Based on Network-on-Chip. 2017 VII Brazilian Symposium on Computing Systems Engineering (SBESC). :225–230.

In many-core systems, the processing elements are interconnected using Networks-on-Chip. An example of on-chip network is SoCIN, a low-cost interconnect architecture whose original design did not take into account security aspects. This network is vulnerable to eavesdropping and spoofing attacks, what limits its use in systems that require security. This work addresses this issue and aims to ensure the security properties of confidentiality and authenticity of SoCIN-based systems. For this, we propose the use of security mechanisms based on symmetric encryption at the network level using the AES (Advanced Encryption Standard) model. A reference multi-core platform was implemented and prototyped in programmable logic aiming at performing experiments to evaluate the implemented mechanisms. Results demonstrate the effectiveness of the proposed solution in protecting the system against the target attacks. The impact on the network performance is acceptable and the silicon overhead is equivalent to other solutions found in the literature.

Zhao, S., Ding, X..  2017.  On the Effectiveness of Virtualization Based Memory Isolation on Multicore Platforms. 2017 IEEE European Symposium on Security and Privacy (EuroS P). :546–560.

Virtualization based memory isolation has been widely used as a security primitive in many security systems. This paper firstly provides an in-depth analysis of its effectiveness in the multicore setting, a first in the literature. Our study reveals that memory isolation by itself is inadequate for security. Due to the fundamental design choices in hardware, it faces several challenging issues including page table maintenance, address mapping validation and thread identification. As demonstrated by our attacks implemented on XMHF and BitVisor, these issues undermine the security of memory isolation. Next, we propose a new isolation approach that is immune to the aforementioned problems. In our design, the hypervisor constructs a fully isolated micro computing environment (FIMCE) that exposes a minimal attack surface to an untrusted OS on a multicore platform. By virtue of its architectural niche, FIMCE offers stronger assurance and greater versatility than memory isolation. We have built a prototype of FIMCE and measured its performance. To show the benefits of using FIMCE as a building block, we have also implemented several practical applications which cannot be securely realized by using memory isolation alone.

Zheng, H., Zhang, X..  2017.  Optimizing Task Assignment with Minimum Cost on Heterogeneous Embedded Multicore Systems Considering Time Constraint. 2017 ieee 3rd international conference on big data security on cloud (bigdatasecurity), ieee international conference on high performance and smart computing (hpsc), and ieee international conference on intelligent data and security (ids). :225–230.
Time and cost are the most critical performance metrics for computer systems including embedded system, especially for the battery-based embedded systems, such as PC, mainframe computer, and smart phone. Most of the previous work focuses on saving energy in a deterministic way by taking the average or worst scenario into account. However, such deterministic approaches usually are inappropriate in modeling energy consumption because of uncertainties in conditional instructions on processors and time-varying external environments. Through studying the relationship between energy consumption, execution time and completion probability of tasks on heterogeneous multi-core architectures this paper proposes an optimal energy efficiency and system performance model and the OTHAP (Optimizing Task Heterogeneous Assignment with Probability) algorithm to address the Processor and Voltage Assignment with Probability (PVAP) problem of data-dependent aperiodic tasks in real-time embedded systems, ensuring that all the tasks can be done under the time constraint with areal-time embedded systems guaranteed probability. We adopt a task DAG (Directed Acyclic Graph) to model the PVAP problem. We first use a processor scheduling algorithm to map the task DAG onto a set of voltage-variable processors, and then use our dynamic programming algorithm to assign a proper voltage to each task and The experimental results demonstrate our approach outperforms state-of-the-art algorithms in this field (maximum improvement of 24.6%).
Pak, W., Choi, Y. J..  2017.  High Performance and High Scalable Packet Classification Algorithm for Network Security Systems. IEEE Transactions on Dependable and Secure Computing. 14:37–49.

Packet classification is a core function in network and security systems; hence, hardware-based solutions, such as packet classification accelerator chips or Ternary Content Addressable Memory (T-CAM), have been widely adopted for high-performance systems. With the rapid improvement of general hardware architectures and growing popularity of multi-core multi-threaded processors, software-based packet classification algorithms are attracting considerable attention, owing to their high flexibility in satisfying various industrial requirements for security and network systems. For high classification speed, these algorithms internally use large tables, whose size increases exponentially with the ruleset size; consequently, they cannot be used with a large rulesets. To overcome this problem, we propose a new software-based packet classification algorithm that simultaneously supports high scalability and fast classification performance by merging partition decision trees in a search table. While most partitioning-based packet classification algorithms show good scalability at the cost of low classification speed, our algorithm shows very high classification speed, irrespective of the number of rules, with small tables and short table building time. Our test results confirm that the proposed algorithm enables network and security systems to support heavy traffic in the most effective manner.

Zhou, G., Feng, Y., Bo, R., Chien, L., Zhang, X., Lang, Y., Jia, Y., Chen, Z..  2017.  GPU-Accelerated Batch-ACPF Solution for N-1 Static Security Analysis. IEEE Transactions on Smart Grid. 8:1406–1416.

Graphics processing unit (GPU) has been applied successfully in many scientific computing realms due to its superior performances on float-pointing calculation and memory bandwidth, and has great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power flow (ACPF) problems need to be solved. However, when applying existing GPU-accelerated algorithms to solve N-1 SSA problem, the degree of parallelism is limited because existing researches have been devoted to accelerating the solution of a single ACPF. This paper therefore proposes a GPU-accelerated solution that creates an additional layer of parallelism among batch ACPFs and consequently achieves a much higher level of overall parallelism. First, this paper establishes two basic principles for determining well-designed GPU algorithms, through which the limitation of GPU-accelerated sequential-ACPF solution is demonstrated. Next, being the first of its kind, this paper proposes a novel GPU-accelerated batch-QR solver, which packages massive number of QR tasks to formulate a new larger-scale problem and then achieves higher level of parallelism and better coalesced memory accesses. To further improve the efficiency of solving SSA, a GPU-accelerated batch-Jacobian-Matrix generating and contingency screening is developed and carefully optimized. Lastly, the complete process of the proposed GPU-accelerated batch-ACPF solution for SSA is presented. Case studies on an 8503-bus system show dramatic computation time reduction is achieved compared with all reported existing GPU-accelerated methods. In comparison to UMFPACK-library-based single-CPU counterpart using Intel Xeon E5-2620, the proposed GPU-accelerated SSA framework using NVIDIA K20C achieves up to 57.6 times speedup. It can even achieve four times speedup when compared to one of the fastest multi-core CPU parallel computing solution using KLU library. The prop- sed batch-solving method is practically very promising and lays a critical foundation for many other power system applications that need to deal with massive subtasks, such as Monte-Carlo simulation and probabilistic power flow.