Multicore Computing and Security, 2014

Submitted by BrandonB on Mon, 05/18/2015 - 5:32pm

SoS Newsletter- Advanced Book Block


	Multicore Computing and Security, 2014

As high performance computing has evolved into larger and faster computing solutions, new approaches to security have been identified. The articles cited here focus on security issues related to multicore environments. These articles focus on a new secure processor that obfuscates its memory access trace, proactive dynamic load balancing on multicore systems, and, an experimental OS tailored to multicore processors of interest in signal processing. These materials were published in 2014.

Krishnan, S.P.T.; Veeravalli, B., "Performance Characterization and Evaluation of HPC Algorithms on Dissimilar Multicore Architectures," High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 1288, 1295, 20-22 Aug. 2014. doi: 10.1109/HPCC.2014.219
Abstract: In this paper, we share our experiences in using two important yet different High Performance Computing (HPC)architectures for evaluating two HPC algorithms. The first architecture is an Intel x64 ISA based homogenous multicore with Uniform Memory Access (UMA) type shared-memory based Symmetric Multi-Processing system. The second architecture is an IBM Power ISA based heterogenous multicore with Non-Uniform Memory Access (NUMA) based distributed-memoryAsymmetric Multi-Processing system. The two HPC algorithms are for predicting biological molecular structures, specifically the RNA secondary structures. The first algorithm that we created is a parallelized version of a popular serial RNA secondary structure prediction algorithm called PKNOTS. The second algorithm is a new parallel-by-design algorithm that we have developed called MARSs. Using real Ribo-Nucleic Acid(RNA) sequences, we conducted large-scale experiments involving hundreds of sequences using the above two algorithms. Based on thousands of data points that we collected as an outcome of our experiments, we report on the observed performance metrics for both the algorithms on the two architectures. Through our experiments, we infer that architectures with specialized coprocessors for number-crunching along with high-speed memory bus and dedicated bus controllers generally perform better than general-purpose multi-processor architectures. In addition, we observed that algorithms that are intrinsically parallelized by design are able to scale & perform better by taking advantage of the underlying parallel architecture. We further share best practices on handling scalability aspects with regards to workload size. We believe our results are applicable to other HPC applications on similar HPC architectures.
Keywords: parallel architectures; shared memory systems; HPC algorithms; HPC architectures; IBM Power ISA; Intel x64 ISA; MARSs; NUMA; PKNOTS; RNA secondary structures; RNA sequences; UMA type shared-memory; biological molecular structure prediction; dedicated bus controllers; dissimilar multicore architectures; distributed-memory asymmetric multiprocessing system; heterogenous multicore; high performance computing architectures; high-speed memorybus; homogenous multicore; nonuniform memory access; number-crunching; parallel architecture; parallel-by-design algorithm; parallelized version; performance characterization; ribo-nucleic acid sequences; serial RNA secondary structure prediction algorithm; specialized coprocessors; uniform memory access type shared-memory; Algorithm design and analysis; Measurement; Multicore processing; Prediction algorithms; Program processors; RNA (ID#: 15-5299)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7056909&isnumber=7056577

Venkatesan, V.; Wei Qingsong; Tay, Y.C., "Ex-Tmem: Extending Transcendent Memory with Non-volatile Memory for Virtual Machines," High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 966, 973, 20-22 Aug. 2014. doi: 10.1109/HPCC.2014.160
Abstract: Virtualization and multicore technology now make Transcendent memory, or Tmem, is a new approach to it possible to consolidate heterogeneous workloads on one physical optimize RAM utilization in a virtual environment where machine. Such consolidation helps reduce the amount of idle underutilized RAM from each guest VM and RAM unassigned resources. In particular, transcendent memory is a recent idea to to any guest (fallow memory), are collected into a central pool gather idle memory into a pool that is shared by virtual machines at hyper visor (or VMM), that is shared by VMs. It can be (VMs). However, the size of transcendent memory is unstable and viewed as a new level in the memory hierarchy for VMs, frequently fluctuates with changing workloads, contention among between main memory and disks. VMs over transcendent memory can cause increased cache misses. In this paper, we propose a mechanism to extend transcendent memory (called Ex-Tmem) by using emerging non-volatile memory. Ex-Tmem stores clean pages in a two-level buffering hierarchy with locality-aware data placement and replacement. In addition, Ex-Tmem enables memory-to-memory swapping by using non-volatile memory and eliminates expensive I/O caused by swapping. Extensive experiments on implemented prototype indicate that Ex-Tmem improves performance by up to 50% and reduces disk I/O by up to 37%, compared to existing Tmem.
Keywords: random-access storage; virtual machines; Ex-Tmem; RAM unassigned resources; VMM; cache misses; central pool gather idle memory; consolidate heterogeneous workloads; extending transcendent memory; guest VM; hyper visor; locality aware data placement; memory hierarchy; memory-to-memory swapping; multicore technology; nonvolatile memory; physical optimize RAM utilization; replacement; two-level buffering hierarchy; underutilized RAM; virtual environment; virtual machines; virtualization; Kernel; Nonvolatile memory; Phase change materials; Random access memory; Servers; Virtual machine monitors; Virtual machining; Non-volatile Memory; Transcendent Memory; Virtual Machines (VMs) (ID#: 15-5300)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7056862&isnumber=7056577

Shafaei, M.; Yunsi Fei, "HiTS: A High Throughput Memory Scheduling Scheme to Mitigate Denial-of-Service Attacks in Multi-core Systems," Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on, pp. 206, 213, 22-24 Oct. 2014. doi: 10.1109/SBAC-PAD.2014.36
Abstract: Sharing DRAM memory by multiple cores in a computer system potentially exposes the running threads on cores to denial-of-service (DoS) attacks. This issue is usually addressed by memory scheduling schemes that rotate the memory service among threads according to a certain ranking mechanism. These ranking-based schemes, however, often incur many memory banks' row-buffer conflicts which reduce the throughput of DRAM and the entire system. This paper proposes a new ranking-based memory scheduling scheme, called HiTS, to mitigate DoS attacks in multicore systems with the lowest performance degradation. HiTS achieves these by ranking threads according to each thread's memory usage/requirement. HiTS then enforces the ranking in a way that minimum performance overhead would occur and fairness is also balanced. The effectiveness of HiTS is evaluated by simulations with 18 different workloads running on 8- and 16-core machines. The simulation results show up to 15.8% improvements in terms of unfairness reduction and 24.1% in system throughput compared with the best existing scheduling scheme.
Keywords: DRAM chips; computer network security; multiprocessing systems; DRAM memory; DoS attacks; HiTS; denial-of-service attacks; high throughput memory scheduling scheme; multicore systems; ranking-based memory scheduling scheme; Benchmark testing; Computer crime; Instruction sets; Message systems; Random access memory; Switches; Throughput; DRAM memory; denial-of-service attack; memory scheduling scheme; multi-core systems (ID#: 15-5301)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6970666&isnumber=6970630

Moreira, J.; Teixeira, L.; Borin, E.; Rigo, S., "Leveraging Optimization Methods for Dynamically Assisted Control-Flow Integrity Mechanisms," Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on, pp. 49, 56, 22-24 Oct. 2014. doi: 10.1109/SBAC-PAD.2014.35
Abstract: Dynamic Binary Modification (DBM) tools are useful for cross-platform execution of binaries and are powerful run time environments that allow execution optimizations, instrumentation and profiling. These tools have also been used as enablers for control-flow integrity verification, a process that consists in the observation and analysis of a program's execution path focusing on the detection of anomalies, such as those arising from flow corruption based software attacks. Even though this class of tools helps us in identifying a myriad of attacks, it is typically expensive at run time and introduce significant overhead to the program execution. Considering their inherent high cost, further expanding the capabilities of such tools for detection of program flow anomalies can slow down the analysis to the point that it is unfeasible to run it in real world workflows. In this paper we present a mechanism for including program flow verification in DBMs that uses asynchronous analysis and applies different parallel-programming techniques that leverage current multi-core systems to control the overhead of our analysis. Our mechanism was tested against synthetic program flow corruption use cases and correctly detected all detours. With our new optimizations, we show that our system achieves an slowdown of only 1.46x, while a naively implemented verification system face 4.22x of overhead.
Keywords: multiprocessing systems; parallel programming; program control structures; program diagnostics; program verification; security of data; software tools; DBM tools; asynchronous analysis; control-flow integrity verification; cross-platform execution; dynamic binary modification tools; dynamically assisted control-flow integrity mechanisms; flow corruption based software attacks; multicore systems; optimization methods; parallel-programming techniques; program execution path; program flow anomaly detection; program flow verification; run time environments; synthetic program flow corruption; verification system; Benchmark testing; Computer architecture; Instruments; Monitoring; Optimization; Security; Software (ID#: 15-5302)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6970646&isnumber=6970630

March, J.L.; Petit, S.; Sahuquillo, J.; Hassan, H.; Duato, J., "Dynamic WCET Estimation for Real-Time Multicore Embedded Systems Supporting DVFS," High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 27, 33, 20-22 Aug. 2014. doi: 10.1109/HPCC.2014.11
Abstract: A key issue to reduce the number of deadline misses and improve energy savings in embedded real-time systems is to accurately estimate the execution time of tasks as a function of the processor frequency. Existing execution time models however, use to rely on off-line analysis or on the assumption that the memory access time (quantified in processor cycles) is constant, ignoring that memory system components are not affected by the processor clock. In this paper, we propose the Processor-Memory (Proc-Mem) model, which dynamically predicts the execution time of the applications running in a multicore processor when varying the processor frequency. Proc-Mem approach is compared with a typical Constant Memory Access Time model, namely CMAT. Results show that the deviation of Proc-Mem is always lower than 6% with respect to the measured execution time, while the deviation of the CMAT model always exceeds 30%. These results turn in important energy savings for a similar number of deadline misses. Energy savings are on average by 22.9%, and up to 47.8% in the studied mixes.
Keywords: embedded systems; energy conservation; multiprocessing systems; power aware computing; CMAT model; DVFS; constant memory access time model; deadline misses; dynamic WCET estimation; energy savings; execution time estimation; execution time models; memory system components; multicore processor; off-line analysis; proc-mem model; processor clock; processor cycles; processor frequency; processor-memory model; real-time multicore embedded systems; worst case execution time; Benchmark testing; Estimation; Frequency estimation; Mathematical model; Multicore processing; Real-time systems; Time-frequency analysis (ID#: 15-5303)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7056713&isnumber=7056577

Park, S.; Park, Y.B., "A Multi-Core Architectural Pattern Selection Method for the Transition from Single-Core to Multi-Core Architecture," IT Convergence and Security (ICITCS), 2014 International Conference on, pp. 1, 5, 28-30 Oct. 2014. doi: 10.1109/ICITCS.2014.7021712
Abstract: Along with rapid advancements of convergent devices, increased software complexity paired with contrastingly shortened software product lifecycle have introduced new challenges from which the need to transform legacy single-core based systems to multi-core systems have emerged. Unfortunately, existing software development processes are late in providing adequate support for multi-core parallelization, failing to keep up with the speed of advancements in multi-core based hardware systems. To address this gap, in our previous work we have proposed a software development process to support the transition of an existing single-core based software to a multi-core equivalent. We have also introduced a tool, the Architectural Decision Supporter (ADS), to assist in the selection of appropriate multi-core architectural patterns and in the search for proper construction components. In this paper, we introduce a selection method for choosing the most desirable candidate among various multi-core architectural patterns implemented using ADS. The proposed method provides the means to combine the contextual knowledge of domain applications and the technical knowledge of individual architectural pattern for multi-core processing.
Keywords: multiprocessing systems; software architecture; software maintenance; ADS; Architectural Decision Supporter; domain applications contextual knowledge; individual architectural pattern technical knowledge; multicore architectural pattern selection method; multicore processing; single-core architecture; Concurrent computing; Decoding; Educational institutions; Hardware; Multicore processing; Software (ID#: 15-5304)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7021712&isnumber=7021698

Joonho Kong; Koushanfar, F., "Processor-Based Strong Physical Unclonable Functions With Aging-Based Response Tuning," Emerging Topics in Computing, IEEE Transactions on, vol. 2, no. 1, pp.16, 29, March 2014. doi: 10.1109/TETC.2013.2289385
Abstract: A strong physically unclonable function (PUF) is a circuit structure that extracts an exponential number of unique chip signatures from a bounded number of circuit components. The strong PUF unique signatures can enable a variety of low-overhead security and intellectual property protection protocols applicable to several computing platforms. This paper proposes a novel lightweight (low overhead) strong PUF based on the timings of a classic processor architecture. A small amount of circuitry is added to the processor for on-the-fly extraction of the unique timing signatures. To achieve desirable strong PUF properties, we develop an algorithm that leverages intentional post-silicon aging to tune the inter- and intra-chip signatures variation. Our evaluation results show that the new PUF meets the desirable inter- and intra-chip strong PUF characteristics, whereas its overhead is much lower than the existing strong PUFs. For the processors implemented in 45 nm technology, the average inter-chip Hamming distance for 32-bit responses is increased by 16.1% after applying our post-silicon tuning method; the aging algorithm also decreases the average intra-chip Hamming distance by 98.1% (for 32-bit responses).
Keywords: computer architecture; cryptographic protocols; digital signatures; microprocessor chips; Hamming distance; PUF; aging based response tuning; circuit components; circuit structure; computing platforms; exponential number; intellectual property protection protocols; processor architecture; processor based strong physical unclonable functions; unique chip signatures; Aging; Circuit optimization; Delays; Logic gates; Microprocessors; Multicore processing; Network security; Silicon; Temperature measurement; Circuit aging; Multi-core processor; Negative bias temperature instability; Physically unclonable function; Post-silicon tuning; Secure computing platform; circuit aging; multi-core processor; negative bias temperature instability; postsilicon tuning; secure computing platform (ID#: 15-5305)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6656920&isnumber=6824880

Yuheng Yuan; Zhenzhong He; Zheng Gong; Weidong Qiu, "Acceleration of AES Encryption with OpenCL," Information Security (ASIA JCIS), 2014 Ninth Asia Joint Conference on, pp. 64, 70, 3-5 Sept. 2014. doi: 10.1109/AsiaJCIS.2014.19
Abstract: The occurrence of multi-core processors has made parallel techniques popular. OpenCL, enabling access to the computing power of multi-platforms, taking advantage of the parallel feature of computing devices, gradually obtains researchers' favor. However, when using parallel techniques, which computation granularity and memory allocation strategies to choose bother developers the most. To solve this problem, many researchers had implemented experiments on Nvidia GPUs and found out the best solution for using CUDA. When it comes to use OpenCL on AMD GPU, to the best of our knowledge, less solutions have been proposed in the literature. Therefore, we conduct several experiments to demonstrate the relation between computation granularity and memory allocation methods of the input data when using OpenCL on AES encoding. In granularity of 16 bytes/thread, the encryption throughput of our experiment can achieve 5 Gbps. Compared with previous works, the ratio between the price of GPU and performance from our experiment is promising.
Keywords: cryptography; graphics processing units; multiprocessing systems; parallel processing; storage allocation; AES encoding; AES encryption; AMD GPU;CUDA; Nvidia GPU; OpenCL; computation granularity; computing device; encryption throughput; memory allocation method; memory allocation strategy; multicore processor; parallel technique; Computational modeling; Encryption; Graphics processing units; Instruction sets; Parallel processing; Resource management; Throughput; AES; Cryptography algorithm; Fast parallel implementation; GPU; OpenCL (ID#: 15-5306)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7023241&isnumber=7022266

Martinsen, J.K.; Grahn, H.; Isberg, A.; Sundstrom, H., "Reducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications," High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 181, 184, 20-22 Aug. 2014. doi: 10.1109/HPCC.2014.34
Abstract: Thread-Level Speculation has been used to take advantage of multicore processors in virtual execution environments for the sequential JavaScript scripting language. While the results are promising the memory overhead is high. Here we propose to reduce the memory usage by limiting the checkpoint depth based on an in-depth study of the memory and execution time effects. We also propose an adaptive heuristic to dynamically adjust the checkpoints. We evaluate this using 15 web applications on an 8-core computer. The results show that the memory overhead is reduced for Thread Level Speculation by over 90% as compared to storing all checkpoints. Further, the performance is often better than when storing all the checkpoints and at worst 4% slower.
Keywords: Internet; Java; authoring languages; checkpointing; multiprocessing systems; virtual machines; JavaScript scripting language; Javascript virtual machine execution; Web applications; checkpoint depth; memory overhead; memory usage; multicore processor; software-based thread-level speculation; virtual execution environment; Electronic publishing; Encyclopedias; Instruction sets; Internet; Limiting; Memory management; multicore; thread-level speculation; web applications (ID#: 15-5307)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7056737&isnumber=7056577

Gray, A.; Stratford, K., "targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance," High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 312, 315, 20-22 Aug. 2014. doi: 10.1109/HPCC.2014.212
Abstract: To achieve high performance on modern computers, it is vital to map algorithmic parallelism to that inherent in the hardware. From an application developer's perspective, it is also important that code can be maintained in a portable manner across a range of hardware. Here we present targetDP (target Data Parallel), a lightweight programming layer that allows the abstraction of data parallelism for applications that employ structured grids. A single source code may be used to target both thread level parallelism (TLP) and instruction level parallelism (ILP) on either SIMD multi-core CPUs or GPU accelerated platforms. targetDP is implemented via standard Cpreprocessor macros and library functions, can be added to existing applications incrementally, and can be combined with higher-level paradigms such as MPI. We present CPU and GPU performance results for a benchmark taken from the lattice Boltzmann application that motivated this work. These demonstrate not only performance portability, but also the optimization resulting from the intelligent exposure of ILP.
Keywords: C language; application program interfaces; graphics processing units; lattice Boltzmann methods; macros; message passing; multi-threading; multiprocessing systems; programming environments; software libraries; source code (software);ILP;MPI;SIMD multicore CPU accelerated platform; SIMD multicore GPU-accelerated platform; TLP; algorithmic parallelism; data parallelism abstraction; higher-level paradigms; instruction level parallelism; lattice Boltzmann application; lattice based parallelism abstraction; library functions; lightweight programming layer; performance portability; portable performance; source code; standard C preprocessor macros; structured grids; target data parallel; targetDP; thread level parallelism; Computer architecture; Graphics processing units; Hardware; Lattices; Libraries; Parallel processing; Vectors (ID#: 15-5308)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7056758&isnumber=7056577

Lindsay, A.; Ravindran, B., "On Cache-Aware Task Partitioning for Multicore Embedded Real-Time Systems," High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 677, 684, 20-22 Aug. 2014. doi: 10.1109/HPCC.2014.105
Abstract: One approach for real-time scheduling on multicore platforms involves task partitioning, which statically assigns tasks to cores, enabling subsequent core-local scheduling. No past partitioning schemes explicitly consider cache effects. We present a partitioning scheme called LWFG, which minimizes cache misses by partitioning tasks that share memory onto the same core and by evenly distributing the total working set size across cores. Our implementation reveals that LWFG improves execution efficiency and reduces mean maximum tardiness over past works by as much as 15% and 60%, respectively.
Keywords: cache storage; multiprocessing systems; real-time systems; scheduling; shared memory systems; LWFG; cache aware task partitioning; core local scheduling; multicore embedded real-time systems; multicore platforms; real-time scheduling; share memory; Job shop scheduling; Linux; Multicore processing; Partitioning algorithms; Processor scheduling; Real-time systems; Schedules; WSS; cache; multicore; real-time; scheduling (ID#: 15-5309)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7056816&isnumber=7056577

Berger, M.; Erlacher, F.; Sommer, C.; Dressler, F., "Adaptive Load Allocation For Combining Anomaly Detectors Using Controlled Skips," Computing, Networking and Communications (ICNC), 2014 International Conference on, pp. 792, 796, 3-6 Feb. 2014. doi: 10.1109/ICCNC.2014.6785438
Abstract: Traditional Intrusion Detection Systems (IDS) can be complemented by an Anomaly Detection Algorithm (ADA) to also identify unknown attacks. We argue that, as each ADA has its own strengths and weaknesses, it might be beneficial to rely on multiple ADAs to obtain deeper insights. ADAs are very resource intensive; thus, real-time detection with multiple algorithms is even more challenging in high-speed networks. To handle such high data rates, we developed a controlled load allocation scheme that adaptively allocates multiple ADAs on a multi-core system. The key idea of this concept is to utilize as many algorithms as possible without causing random packet drops, which is the typical system behavior in overload situations. We developed a proof of concept anomaly detection framework with a sample set of ADAs. Our experiments confirm that the detection performance can substantially benefit from using multiple algorithms and that the developed framework is also able to cope with high packet rates.
Keywords: multiprocessing systems; real-time systems; resource allocation; security of data; ADA; IDS; adaptive load allocation; anomaly detection algorithm; controlled load allocation; controlled skips; high-speed networks; intrusion detection systems; multicore system; multiple algorithms; real-time detection; resource intensive; unknown attacks; High-speed networks; Intrusion detection; Probabilistic logic; Reliability; Uplink; World Wide Web (ID#: 15-5310)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6785438&isnumber=6785290

Ye, J.; Songyuan Li; Tianzhou Chen; Minghui Wu; Li Liu, "Core Affinity Code Block Schedule to Reduce Inter-core Data Synchronization of SpMT," High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 1002, 1007, 20-22 Aug. 2014. doi: 10.1109/HPCC.2014.175
Abstract: Extract parallelism from programs is growing important as the number of cores of processors is increasing. Parallelization usually involves splitting a sequential thread, and schedule the split code to run on multiple cores. E.g. Some previous Speculative Multi-Threading research used code block reordering to automatically parallelize a sequential thread on multi-core processors. Although the parallelized code blocks can run on different cores, there may still be some data dependences among them. Therefore such parallelization will introduce data dependences among the cores where the code blocks run, which should be resolved alongside the execution by cross-core data sync. Cross-core data sync is usually expensive. This paper proposes to minimize the cross-core data sync with core affinity aware code block scheduling. Our work is based on an Speculative Multi-Threading (SpMT) approach with code block reordering. We improve it by implementing an affinity aware block scheduling algorithm. We built a simulator to model the SpMT architecture, and conducted experiments with SPEC2006 benchmarks. The data shows that plenty of cross-core data sync could be reduced (e.g. Up to 28.7% for gromacs) by the affinity aware block scheduling. For inter-core register sync delay of 5 cycles, this may suggest 3.73% increase in performance.
Keywords: multi-threading; multiprocessing systems; scheduling; synchronisation; SpMT; affinity aware block scheduling algorithm; code parallelization; core affinity code block schedule; intercore data synchronization; speculative multithreading; Educational institutions; Instruction sets; Multicore processing; Parallel processing; Registers; Schedules; Synchronization; data sync; multi-core; parallelization; speculative multithreading (ID#: 15-5311)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7056867&isnumber=7056577

Kishore, N.; Kapoor, B., "An Efficient Parallel Algorithm For Hash Computation In Security And Forensics Applications," Advance Computing Conference (IACC), 2014 IEEE International, pp. 873, 877, 21-22 Feb. 2014. doi: 10.1109/IAdCC.2014.6779437
Abstract: Hashing algorithms are used extensively in information security and digital forensics applications. This paper presents an efficient parallel algorithm hash computation. It's a modification of the SHA-1 algorithm for faster parallel implementation in applications such as the digital signature and data preservation in digital forensics. The algorithm implements recursive hash to break the chain dependencies of the standard hash function. We discuss the theoretical foundation for the work including the collision probability and the performance implications. The algorithm is implemented using the OpenMP API and experiments performed using machines with multicore processors. The results show a performance gain by more than a factor of 3 when running on the 8-core configuration of the machine.
Keywords: application program interfaces; cryptography; digital forensics; digital signatures; file organisation; parallel algorithms; probability; OpenMP API; SHA-1 algorithm; collision probability; data preservation; digital forensics; digital signature; hash computation; hashing algorithms; information security; parallel algorithm; standard hash function; Algorithm design and analysis; Conferences; Cryptography; Multicore processing; Program processors; Standards; Cryptographic Hash Function; Digital Forensics; Digital Signature;MD5;Multicore Processors; OpenMP; SHA-1 (ID#: 15-5312)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6779437&isnumber=6779283

Tian Xu; Cockshott, P.; Oehler, S., "Acceleration of Stereo-Matching on Multi-core CPU and GPU," High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 108, 115, 20-22 Aug. 2014. doi: 10.1109/HPCC.2014.22
Abstract: This paper presents an accelerated version of a dense stereo-correspondence algorithm for two different parallelism enabled architectures, multi-core CPU and GPU. The algorithm is part of the vision system developed for a binocular robot-head in the context of the CloPeMa research project. This research project focuses on the conception of a new clothes folding robot with real-time and high resolution requirements for the vision system. The performance analysis shows that the parallelised stereo-matching algorithm has been significantly accelerated, maintaining 12× and 176× speed-up respectively for multi-core CPU and GPU, compared with SISD (Single Instruction, Single Data) single-thread CPU. To analyse the origin of the speed-up and gain deeper understanding about the choice of the optimal hardware, the algorithm was broken into key sub-tasks and the performance was tested for four different hardware architectures.
Keywords: graphics processing units; image matching; image resolution; multiprocessing systems; parallel architectures; robot vision; service robots; stereo image processing; CloPeMa research project; GPU;SISD single-thread CPU; binocular robot-head; clothes folding robot; dense stereo-correspondence algorithm; hardware architectures; high resolution requirements; multicore CPU; parallelised stereo-matching algorithm; single instruction single data; stereo-matching acceleration; vision system; Acceleration; Algorithm design and analysis; Graphics processing units; Image resolution; Instruction sets; Robots; Acceleration; Dense-correspondences; Multi-core CPU; Robotic vision; Stereo matching (ID#: 15-5313)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7056725&isnumber=7056577

Hongwei Zhou; Rangyu Deng; Zefu Dai; Xiaobo Yan; Ying Zhang; Caixia Sun, "The Virtual Open Page Buffer for Multi-core and Multi-thread Processors," High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 290, 297, 20-22 Aug. 2014. doi: 10.1109/HPCC.2014.52
Abstract: The performance of off-chip DDRx SDRAM has been greatly restricted by the single physical page that can be activated for each DRAM bank at any time. To alleviate this problem, an on-chip virtual open page buffer (VOPB) for multi-core multi-thread processor is proposed. The VOPB maintains a number of virtual active pages for each bank of off-chip memory, which effectively increases the maximum number of active pages and reduces page conflicts in the off-chip memory. Adopted by the FT-1500 processor, the VOPB along with optimized address mapping techniques greatly enhances the bandwidth, latency and energy efficiency of off-chip memory, especially for stream applications. Experimental results show that the VOPB improves off-chip memory bandwidth by 16.87% for Stream OpenMP and 6% for NPB-MPI on average.
Keywords: DRAM chips; message passing; multi-threading; multiprocessing systems; paged storage; DRAM bank;FT-1500 processor; NPB-MPI; Stream OpenMP; VOPB; active pages; address mapping technique optimization; bandwidth enhancement; energy efficiency enhancement; latency enhancement; multicore-multithread processor; off-chip DDRx SDRAM; off-chip memory; off-chip memory bandwidth improvement; on-chip virtual open page buffer; page conflict reduction; physical page; stream applications; virtual active pages; Arrays; Bandwidth; Prefetching; Random access memory; System-on-chip; memory bandwidth; multi-thread; virtual open page(ID#: 15-5314)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7056755&isnumber=7056577

Fengwei Zhang; Jiang Wang; Kun Sun; Stavrou, A., "HyperCheck: A Hardware-Assisted Integrity Monitor," Dependable and Secure Computing, IEEE Transactions on, vol. 11, no .4, pp.332,344, July-Aug. 2014. doi: 10.1109/TDSC.2013.53
Abstract: The advent of cloud computing and inexpensive multi-core desktop architectures has led to the widespread adoption of virtualization technologies. Furthermore, security researchers embraced virtual machine monitors (VMMs) as a new mechanism to guarantee deep isolation of untrusted software components, which, coupled with their popularity, promoted VMMs as a prime target for exploitation. In this paper, we present HyperCheck, a hardware-assisted tampering detection framework designed to protect the integrity of hypervisors and operating systems. Our approach leverages System Management Mode (SMM), a CPU mode in ×86 architecture, to transparently and securely acquire and transmit the full state of a protected machine to a remote server. We have implement two prototypes based on our framework design: HyperCheck-I and HyperCheck-II, that vary in their security assumptions and OS code dependence. In our experiments, we are able to identify rootkits that target the integrity of both hypervisors and operating systems. We show that HyperCheck can defend against attacks that attempt to evade our system. In terms of performance, we measured that HyperCheck can communicate the entire static code of Xen hypervisor and CPU register states in less than 90 million CPU cycles, or 90 ms on a 1 GHz CPU.
Keywords: security of data; virtual machines; virtualisation; CPU register; HyperCheck-I; HyperCheck-II;OS code dependence; SMM; VMM; Xen hypervisor; cloud computing; hardware-assisted integrity monitor; hardware-assisted tampering detection framework; multicore desktop architectures; operating systems; security assumptions; system management mode; untrusted software components; virtual machine monitors; Biomedical monitoring; Hardware; Kernel; Monitoring; Registers; Security; Virtual machine monitors; Coreboot; Hypervisor; kernel; system management mode (ID#: 15-5315)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6682894&isnumber=6851971

Chi Liu; Ping Song; Yi Liu; Qinfen Hao, "Efficient Work-Stealing with Blocking Deques," High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 149, 152, 20-22 Aug. 2014. doi: 10.1109/HPCC.2014.28
Abstract: Work stealing is a popular and effective approach to implement load balancing in modern multi-/many-core systems, where each parallel thread has its local deque to maintain its own work-set of tasks and performs load balancing by stealing tasks from other deques. Unfortunately, the existing concurrent deques have two limitations. Firstly, these algorithms require memory fences in the owner's critical path operations to ensure correctness, which is expensive in modern weak-memory architectures. Secondly, the concurrent deques are difficult to extend to support various flexible forms of task distribution strategies, which can be more sufficient to optimize computation in some special applications, such as steal-half strategy in solving large, irregular graph problems. This paper proposes a blocking work-stealing deque. We optimize work stealing task deques through effective ways of accessing the deques to decrease the synchronization overhead. These ways can reduce the frequency of races when different threads need to operate on the same deque, especially using massive threads. We present implementation of the algorithm as a C++ library and the experiment results show that it behaves well to Cilk plus on a series of benchmarks. Since our approach relies on blocking deques, it is easy to extend to support flexible task creation and distribution strategies and also reduces the memory fences impact on performance.
Keywords: C++ language; multi-threading; resource allocation; software libraries; C++ library; Cilk plus; blocking work-stealing deque; load balancing; many-core systems; memory fences impact reduction; multicore systems; parallel thread; synchronization overhead; Algorithm design and analysis; Benchmark testing; Containers; Instruction sets; Processor scheduling; Synchronization; deque; load balancing; scheduling strategies; work stealing (ID#: 15-5316)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7056731&isnumber=7056577

Castellanos, A.; Moreno, A.; Sorribes, J.; Margalef, T., "Predicting Performance of Hybrid Master/Worker Applications Using Model-Based Regression Trees," High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on, pp. 355, 362, 20-22 Aug. 2014. doi: 10.1109/HPCC.2014.61
Abstract: Nowadays, there are several features related to node architecture, network topology and programming model that significantly affect the performance of applications. Therefore, the task of adjusting the values of parameters of hybrid parallel applications to achieve the best performance requires a high degree of expertise and a huge effort. Determining a performance model that considers all the system and application features is a very complex task that in most cases produces poor results. In order to simplify this goal and improve the results, we introduce a model-based regression tree technique to improve the accuracy of performance prediction for parallel Master/Worker applications on homogeneous multicore systems. The technique has been used to model the iteration time of the general expression for performance prediction. This approach significantly reduces the effort in getting an accurate prediction model, although it requires a relatively large training data set. The proposed model determines the configuration of the appropriate number of workers and threads of the hybrid application to achieve the best possible performance.
Keywords: iterative methods; multiprocessing systems; parallel processing; performance evaluation; regression analysis; trees (mathematics);homogeneous multicore systems; hybrid master-worker applications; hybrid parallel applications; iteration time; large training data set; model-based regression tree technique; network topology; node architecture; performance prediction; programming model; Computational modeling; Message systems; Multicore processing; Predictive models; Regression tree analysis; Training; Training data; Hybrid applications; Master/Worker; Multicore; Performance model; Regression tree (ID#: 15-5317)
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7056765&isnumber=7056577

Note:

Articles listed on these pages have been found on publicly available internet pages and are cited with links to those pages. Some of the information included herein has been reprinted with permission from the authors or data repositories. Direct any requests via Email to news@scienceofsecurity.net for removal of the links or modifications to specific citations. Please include the ID# of the specific citation in your correspondence.

Printer-friendly version