Visible to the public Biblio

Filters: Keyword is multicore computing security  [Clear All Filters]
2020-02-10
Wan, Shengye, Sun, Jianhua, Sun, Kun, Zhang, Ning, Li, Qi.  2019.  SATIN: A Secure and Trustworthy Asynchronous Introspection on Multi-Core ARM Processors. 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). :289–301.

On ARM processors with TrustZone security extension, asynchronous introspection mechanisms have been developed in the secure world to detect security policy violations in the normal world. These mechanisms provide security protection via passively checking the normal world snapshot. However, since previous secure world checking solutions require to suspend the entire rich OS, asynchronous introspection has not been widely adopted in the real world. Given a multi-core ARM system that can execute the two worlds simultaneously on different cores, secure world introspection can check the rich OS without suspension. However, we identify a new normal-world evasion attack that can defeat the asynchronous introspection by removing the attacking traces in parallel from one core when the security checking is performing on another core. We perform a systematic study on this attack and present its efficiency against existing asynchronous introspection mechanisms. As the countermeasure, we propose a secure and trustworthy asynchronous introspection mechanism called SATIN, which can efficiently detect the evasion attacks by increasing the attackers' evasion time cost and decreasing the defender's execution time under a safe limit. We implement a prototype on an ARM development board and the experimental results show that SATIN can effectively prevent evasion attacks on multi-core systems with a minor system overhead.

Zhang, Jiemin, Mao, Jian, Liu, Jinming, Tang, Zhi, Gu, Zhiling, Liu, Yongmei.  2019.  Cloud-based Multi-core Architecture against DNS Attacks. 2019 14th International Conference on Computer Science Education (ICCSE). :391–393.
The domain name resolution system provides support service for website visits as the basic service of the Internet. With the increase of DNS attacks, it has brought copious challenges to network security. The paper studies on the key defense technologies against DNS attacks based on the DNS principle. The multi-core customized to the DNS is adopted to analyze hardware kernel, while AI algorithms being realized for malicious flow cleaning and intelligent routing running on the cloud system established specifically for DNS. The designed DNS intelligent cloud system can provide high-efficiency domain name resolution in practice, while ensuring the network security.
2019-11-04
Alomari, Mohammad Ahmed, Hafiz Yusoff, M., Samsudin, Khairulmizam, Ahmad, R. Badlishah.  2019.  Light Database Encryption Design Utilizing Multicore Processors for Mobile Devices. 2019 IEEE 15th International Colloquium on Signal Processing Its Applications (CSPA). :254–259.

The confidentiality of data stored in embedded and handheld devices has become an urgent necessity more than ever before. Encryption of sensitive data is a well-known technique to preserve their confidentiality, however it comes with certain costs that can heavily impact the device processing resources. Utilizing multicore processors, which are equipped with current embedded devices, has brought a new era to enhance data confidentiality while maintaining suitable device performance. Encrypting the complete storage area, also known as Full Disk Encryption (FDE) can still be challenging, especially with newly emerging massive storage systems. Alternatively, since the most user sensitive data are residing inside persisting databases, it will be more efficient to focus on securing SQLite databases, through encryption, where SQLite is the most common RDBMS in handheld and embedded systems. This paper addresses the problem of ensuring data protection in embedded and mobile devices while maintaining suitable device performance by mitigating the impact of encryption. We presented here a proposed design for a parallel database encryption system, called SQLite-XTS. The proposed system encrypts data stored in databases transparently on-the-fly without the need for any user intervention. To maintain a proper device performance, the system takes advantage of the commodity multicore processors available with most embedded and mobile devices.

2018-02-21
Yan, Mengjia, Gopireddy, Bhargava, Shull, Thomas, Torrellas, Josep.  2017.  Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Atacks. Proceedings of the 44th Annual International Symposium on Computer Architecture. :347–360.
In cache-based side channel attacks, a spy that shares a cache with a victim probes cache locations to extract information on the victim's access patterns. For example, in evict+reload, the spy repeatedly evicts and then reloads a probe address, checking if the victim has accessed the address in between the two operations. While there are many proposals to combat these cache attacks, they all have limitations: they either hurt performance, require programmer intervention, or can only defend against some types of attacks. This paper makes the following observation for an environment with an inclusive cache hierarchy: when the spy evicts the probe address from the shared cache, the address will also be evicted from the private cache of the victim process, creating an inclusion victim. Consequently, to disable cache attacks, this paper proposes to alter the line replacement algorithm of the shared cache, to prevent a process from creating inclusion victims in the caches of cores running other processes. By enforcing this rule, the spy cannot evict the probe address from the shared cache and, hence, cannot glimpse any information on the victim's access patterns. We call our proposal SHARP (Secure Hierarchy-Aware cache Replacement Policy). SHARP efficiently defends against all existing cross-core shared-cache attacks, needs only minimal hardware modifications, and requires no code modifications. We implement SHARP in a cycle-level full-system simulator. We show that it protects against real-world attacks, and that it introduces negligible average performance degradation.
Bai, Xu, Jiang, Lei, Dai, Qiong, Yang, Jiajia, Tan, Jianlong.  2017.  Acceleration of RSA processes based on hybrid ARM-FPGA cluster. 2017 IEEE Symposium on Computers and Communications (ISCC). :682–688.

Cooperation of software and hardware with hybrid architectures, such as Xilinx Zynq SoC combining ARM CPU and FPGA fabric, is a high-performance and low-power platform for accelerating RSA Algorithm. This paper adopts the none-subtraction Montgomery algorithm and the Chinese Remainder Theorem (CRT) to implement high-speed RSA processors, and deploys a 48-node cluster infrastructure based on Zynq SoC to achieve extremely high scalability and throughput of RSA computing. In this design, we use the ARM to implement node-to-node communication with the Message Passing Interface (MPI) while use the FPGA to handle complex calculation. Finally, the experimental results show that the overall performance is linear with the number of nodes. And the cluster achieves 6× 9× speedup against a multi-core desktop (Intel i7-3770) and comparable performance to a many-core server (288-core). In addition, we gain up to 2.5× energy efficiency compared to these two traditional platforms.

Hadagali, C..  2017.  Multicore implementation of EME2 AES disk encryption algorithm using OpenMP. 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT). :1–6.

Volume of digital data is increasing at a faster rate and the security of the data is at risk while being transit on a network as well as at rest. The execution time of full disk encryption in large servers is significant because of the computational complexity associated with disk encryption. Hence it is necessary to reduce the execution time of full disk encryption from the application point of view. In this work a full disk encryption algorithm namely EME2 AES (Encrypt Mix Encrypt V2 Advanced Encryption Standard) is analyzed. The execution speed of this algorithm is reduced by means of multicore compatible parallel implementation which makes use of available cores. Parallel implementation is executed on a multicore machine with 8 cores and speed up on the multicore implementation is measured. Results show that the multicore implementation of EME2 AES using OpenMP is up to 2.85 times faster than sequential execution for the chosen infrastructure and data range.

Conti, F., Schilling, R., Schiavone, P. D., Pullini, A., Rossi, D., Gürkaynak, F. K., Muehlberghuber, M., Gautschi, M., Loi, I., Haugou, G. et al..  2017.  An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics. IEEE Transactions on Circuits and Systems I: Regular Papers. 64:2481–2494.

Near-sensor data analytics is a promising direction for internet-of-things endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data are stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a system-on-chip (SoC) based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65-nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep convolutional neural network (CNN) consuming 3.16pJ per equivalent reduced instruction set computer operation, local CNN-based face detection with secured remote recognition in 5.74pJ/op, and seizure detection with encrypted data collection from electroencephalogram within 12.7pJ/op.

Kinsy, M. A., Khadka, S., Isakov, M., Farrukh, A..  2017.  Hermes: Secure heterogeneous multicore architecture design. 2017 IEEE International Symposium on Hardware Oriented Security and Trust (HOST). :14–20.

The emergence of general-purpose system-on-chip (SoC) architectures has given rise to a number of significant security challenges. The current trend in SoC design is system-level integration of heterogeneous technologies consisting of a large number of processing elements such as programmable RISC cores, memory, DSPs, and accelerator function units/ASIC. These processing elements may come from different providers, and application executable code may have varying levels of trust. Some of the pressing architecture design questions are: (1) how to implement multi-level user-defined security; (2) how to optimally and securely share resources and data among processing elements. In this work, we develop a secure multicore architecture, named Hermes. It represents a new architectural framework that integrates multiple processing elements (called tenants) of secure and non-secure cores into the same chip design while (a) maintaining individual tenant security, (b) preventing data leakage and corruption, and (c) promoting collaboration among the tenants. The Hermes architecture is based on a programmable secure router interface and a trust-aware routing algorithm. With 17% hardware overhead, it enables the implementation of processing-element-oblivious secure multicore systems with a programmable distributed group key management scheme.

Silva, M. R., Zeferino, C. A..  2017.  Confidentiality and Authenticity in a Platform Based on Network-on-Chip. 2017 VII Brazilian Symposium on Computing Systems Engineering (SBESC). :225–230.

In many-core systems, the processing elements are interconnected using Networks-on-Chip. An example of on-chip network is SoCIN, a low-cost interconnect architecture whose original design did not take into account security aspects. This network is vulnerable to eavesdropping and spoofing attacks, what limits its use in systems that require security. This work addresses this issue and aims to ensure the security properties of confidentiality and authenticity of SoCIN-based systems. For this, we propose the use of security mechanisms based on symmetric encryption at the network level using the AES (Advanced Encryption Standard) model. A reference multi-core platform was implemented and prototyped in programmable logic aiming at performing experiments to evaluate the implemented mechanisms. Results demonstrate the effectiveness of the proposed solution in protecting the system against the target attacks. The impact on the network performance is acceptable and the silicon overhead is equivalent to other solutions found in the literature.

Zhao, S., Ding, X..  2017.  On the Effectiveness of Virtualization Based Memory Isolation on Multicore Platforms. 2017 IEEE European Symposium on Security and Privacy (EuroS P). :546–560.

Virtualization based memory isolation has been widely used as a security primitive in many security systems. This paper firstly provides an in-depth analysis of its effectiveness in the multicore setting, a first in the literature. Our study reveals that memory isolation by itself is inadequate for security. Due to the fundamental design choices in hardware, it faces several challenging issues including page table maintenance, address mapping validation and thread identification. As demonstrated by our attacks implemented on XMHF and BitVisor, these issues undermine the security of memory isolation. Next, we propose a new isolation approach that is immune to the aforementioned problems. In our design, the hypervisor constructs a fully isolated micro computing environment (FIMCE) that exposes a minimal attack surface to an untrusted OS on a multicore platform. By virtue of its architectural niche, FIMCE offers stronger assurance and greater versatility than memory isolation. We have built a prototype of FIMCE and measured its performance. To show the benefits of using FIMCE as a building block, we have also implemented several practical applications which cannot be securely realized by using memory isolation alone.

Zheng, H., Zhang, X..  2017.  Optimizing Task Assignment with Minimum Cost on Heterogeneous Embedded Multicore Systems Considering Time Constraint. 2017 ieee 3rd international conference on big data security on cloud (bigdatasecurity), ieee international conference on high performance and smart computing (hpsc), and ieee international conference on intelligent data and security (ids). :225–230.
Time and cost are the most critical performance metrics for computer systems including embedded system, especially for the battery-based embedded systems, such as PC, mainframe computer, and smart phone. Most of the previous work focuses on saving energy in a deterministic way by taking the average or worst scenario into account. However, such deterministic approaches usually are inappropriate in modeling energy consumption because of uncertainties in conditional instructions on processors and time-varying external environments. Through studying the relationship between energy consumption, execution time and completion probability of tasks on heterogeneous multi-core architectures this paper proposes an optimal energy efficiency and system performance model and the OTHAP (Optimizing Task Heterogeneous Assignment with Probability) algorithm to address the Processor and Voltage Assignment with Probability (PVAP) problem of data-dependent aperiodic tasks in real-time embedded systems, ensuring that all the tasks can be done under the time constraint with areal-time embedded systems guaranteed probability. We adopt a task DAG (Directed Acyclic Graph) to model the PVAP problem. We first use a processor scheduling algorithm to map the task DAG onto a set of voltage-variable processors, and then use our dynamic programming algorithm to assign a proper voltage to each task and The experimental results demonstrate our approach outperforms state-of-the-art algorithms in this field (maximum improvement of 24.6%).
Pak, W., Choi, Y. J..  2017.  High Performance and High Scalable Packet Classification Algorithm for Network Security Systems. IEEE Transactions on Dependable and Secure Computing. 14:37–49.

Packet classification is a core function in network and security systems; hence, hardware-based solutions, such as packet classification accelerator chips or Ternary Content Addressable Memory (T-CAM), have been widely adopted for high-performance systems. With the rapid improvement of general hardware architectures and growing popularity of multi-core multi-threaded processors, software-based packet classification algorithms are attracting considerable attention, owing to their high flexibility in satisfying various industrial requirements for security and network systems. For high classification speed, these algorithms internally use large tables, whose size increases exponentially with the ruleset size; consequently, they cannot be used with a large rulesets. To overcome this problem, we propose a new software-based packet classification algorithm that simultaneously supports high scalability and fast classification performance by merging partition decision trees in a search table. While most partitioning-based packet classification algorithms show good scalability at the cost of low classification speed, our algorithm shows very high classification speed, irrespective of the number of rules, with small tables and short table building time. Our test results confirm that the proposed algorithm enables network and security systems to support heavy traffic in the most effective manner.

Zhou, G., Feng, Y., Bo, R., Chien, L., Zhang, X., Lang, Y., Jia, Y., Chen, Z..  2017.  GPU-Accelerated Batch-ACPF Solution for N-1 Static Security Analysis. IEEE Transactions on Smart Grid. 8:1406–1416.

Graphics processing unit (GPU) has been applied successfully in many scientific computing realms due to its superior performances on float-pointing calculation and memory bandwidth, and has great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power flow (ACPF) problems need to be solved. However, when applying existing GPU-accelerated algorithms to solve N-1 SSA problem, the degree of parallelism is limited because existing researches have been devoted to accelerating the solution of a single ACPF. This paper therefore proposes a GPU-accelerated solution that creates an additional layer of parallelism among batch ACPFs and consequently achieves a much higher level of overall parallelism. First, this paper establishes two basic principles for determining well-designed GPU algorithms, through which the limitation of GPU-accelerated sequential-ACPF solution is demonstrated. Next, being the first of its kind, this paper proposes a novel GPU-accelerated batch-QR solver, which packages massive number of QR tasks to formulate a new larger-scale problem and then achieves higher level of parallelism and better coalesced memory accesses. To further improve the efficiency of solving SSA, a GPU-accelerated batch-Jacobian-Matrix generating and contingency screening is developed and carefully optimized. Lastly, the complete process of the proposed GPU-accelerated batch-ACPF solution for SSA is presented. Case studies on an 8503-bus system show dramatic computation time reduction is achieved compared with all reported existing GPU-accelerated methods. In comparison to UMFPACK-library-based single-CPU counterpart using Intel Xeon E5-2620, the proposed GPU-accelerated SSA framework using NVIDIA K20C achieves up to 57.6 times speedup. It can even achieve four times speedup when compared to one of the fastest multi-core CPU parallel computing solution using KLU library. The prop- sed batch-solving method is practically very promising and lays a critical foundation for many other power system applications that need to deal with massive subtasks, such as Monte-Carlo simulation and probabilistic power flow.

2017-05-18
Park, Jungho, Jung, Wookeun, Jo, Gangwon, Lee, Ilkoo, Lee, Jaejin.  2016.  PIPSEA: A Practical IPsec Gateway on Embedded APUs. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. :1255–1267.

Accelerated Processing Unit (APU) is a heterogeneous multicore processor that contains general-purpose CPU cores and a GPU in a single chip. It also supports Heterogeneous System Architecture (HSA) that provides coherent physically-shared memory between the CPU and the GPU. In this paper, we present the design and implementation of a high-performance IPsec gateway using a low-cost commodity embedded APU. The HSA supported by the APUs eliminates the data copy overhead between the CPU and the GPU, which is unavoidable in the previous discrete GPU approaches. The gateway is implemented in OpenCL to exploit the GPU and uses zero-copy packet I/O APIs in DPDK. The IPsec gateway handles the real-world network traffic where each packet has a different workload. The proposed packet scheduling algorithm significantly improves GPU utilization for such traffic. It works not only for APUs but also for discrete GPUs. With three CPU cores and one GPU in the APU, the IPsec gateway achieves a throughput of 10.36 Gbps with an average latency of 2.79 ms to perform AES-CBC+HMAC-SHA1 for incoming packets of 1024 bytes.

Brookes, Scott, Taylor, Stephen.  2016.  Rethinking Operating System Design: Asymmetric Multiprocessing for Security and Performance. Proceedings of the 2016 New Security Paradigms Workshop. :68–79.

Developers and academics are constantly seeking to increase the speed and security of operating systems. Unfortunately, an increase in either one often comes at the cost of the other. In this paper, we present an operating system design that challenges a long-held tenet of multicore operating systems in order to produce an alternative architecture that has the potential to deliver both increased security and faster performance. In particular, we propose decoupling the operating system kernel from user processes by running each on completely separate processor cores instead of at different privilege levels within shared cores. Without using the hardware's privilege modes, virtualization and virtual memory contexts enforce the security policies necessary to maintain process isolation and protection. Our new kernel design paradigm offers the opportunity to simultaneously increase both performance and security; utilizing the hardware facilities for inter-core communication in place of those for privilege mode switching offers the opportunity for increased system call performance, while the hard separation between user processes and the kernel provides several strong security properties.

Chachmon, Nadav, Richins, Daniel, Cohn, Robert, Christensson, Magnus, Cui, Wenzhi, Reddi, Vijay Janapa.  2016.  Simulation and Analysis Engine for Scale-Out Workloads. Proceedings of the 2016 International Conference on Supercomputing. :22:1–22:13.

We introduce a system-level Simulation and Analysis Engine (SAE) framework based on dynamic binary instrumentation for fine-grained and customizable instruction-level introspection of everything that executes on the processor. SAE can instrument the BIOS, kernel, drivers, and user processes. It can also instrument multiple systems simultaneously using a single instrumentation interface, which is essential for studying scale-out applications. SAE is an x86 instruction set simulator designed specifically to enable rapid prototyping, evaluation, and validation of architectural extensions and program analysis tools using its flexible APIs. It is fast enough to execute full platform workloads–-a modern operating system can boot in a few minutes–-thus enabling research, evaluation, and validation of complex functionalities related to multicore configurations, virtualization, security, and more. To reach high speeds, SAE couples tightly with a virtual platform and employs both a just-in-time (JIT) compiler that helps simulate simple instructions efficiently and a fast interpreter for simulating new or complex instructions. We describe SAE's architecture and instrumentation engine design and show the framework's usefulness for single- and multi-system architectural and program analysis studies.

Bartolini, Davide B., Miedl, Philipp, Thiele, Lothar.  2016.  On the Capacity of Thermal Covert Channels in Multicores. Proceedings of the Eleventh European Conference on Computer Systems. :24:1–24:16.

Modern multicore processors feature easily accessible temperature sensors that provide useful information for dynamic thermal management. These sensors were recently shown to be a potential security threat, since otherwise isolated applications can exploit them to establish a thermal covert channel and leak restricted information. Previous research showed experiments that document the feasibility of (low-rate) communication over this channel, but did not further analyze its fundamental characteristics. For this reason, the important questions of quantifying the channel capacity and achievable rates remain unanswered. To address these questions, we devise and exploit a new methodology that leverages both theoretical results from information theory and experimental data to study these thermal covert channels on modern multicores. We use spectral techniques to analyze data from two representative platforms and estimate the capacity of the channels from a source application to temperature sensors on the same or different cores. We estimate the capacity to be in the order of 300 bits per second (bps) for the same-core channel, i.e., when reading the temperature on the same core where the source application runs, and in the order of 50 bps for the 1-hop channel, i.e., when reading the temperature of the core physically next to the one where the source application runs. Moreover, we show a communication scheme that achieves rates of more than 45 bps on the same-core channel and more than 5 bps on the 1-hop channel, with less than 1% error probability. The highest rate shown in previous work was 1.33 bps on the 1-hop channel with 11% error probability.

Wang, Xiao, Sabne, Amit, Kisner, Sherman, Raghunathan, Anand, Bouman, Charles, Midkiff, Samuel.  2016.  High Performance Model Based Image Reconstruction. Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. :2:1–2:12.

Computed Tomography (CT) Image Reconstruction is an important technique used in a wide range of applications, ranging from explosive detection, medical imaging to scientific imaging. Among available reconstruction methods, Model Based Iterative Reconstruction (MBIR) produces higher quality images and allows for the use of more general CT scanner geometries than is possible with more commonly used methods. The high computational cost of MBIR, however, often makes it impractical in applications for which it would otherwise be ideal. This paper describes a new MBIR implementation that significantly reduces the computational cost of MBIR while retaining its benefits. It describes a novel organization of the scanner data into super-voxels (SV) that, combined with a super-voxel buffer (SVB), dramatically increase locality and prefetching, enable parallelism across SVs and lead to an average speedup of 187 on 20 cores.