Visible to the public Biblio

Filters: Keyword is Heterogeneous computing  [Clear All Filters]
2021-04-27
Agirre, I., Onaindia, P., Poggi, T., Yarza, I., Cazorla, F. J., Kosmidis, L., Grüttner, K., Abuteir, M., Loewe, J., Orbegozo, J. M. et al..  2020.  UP2DATE: Safe and secure over-the-air software updates on high-performance mixed-criticality systems. 2020 23rd Euromicro Conference on Digital System Design (DSD). :344–351.
Following the same trend of consumer electronics, safety-critical industries are starting to adopt Over-The-Air Software Updates (OTASU) on their embedded systems. The motivation behind this trend is twofold. On the one hand, OTASU offer several benefits to the product makers and users by improving or adding new functionality and services to the product without a complete redesign. On the other hand, the increasing connectivity trend makes OTASU a crucial cyber-security demand to download latest security patches. However, the application of OTASU in the safety-critical domain is not free of challenges, specially when considering the dramatic increase of software complexity and the resulting high computing performance demands. This is the mission of UP2DATE, a recently launched project funded within the European H2020 programme focused on new software update architectures for heterogeneous high-performance mixed-criticality systems. This paper gives an overview of UP2DATE and its foundations, which seeks to improve existing OTASU solutions by considering safety, security and availability from the ground up in an architecture that builds around composability and modularity.
2019-11-25
Liang, Tyng-Yeu, Yeh, Li-Wei, Wu, Chi-Hong.  2018.  A Visual MapReduce Program Development Environment for Heterogeneous Computing on Clouds. Proceedings of the 2018 International Conference on Computing and Data Engineering. :83–87.
This paper is aimed at proposing a visual MapReduce program development environment called VMR for heterogeneous computing on Clouds. This development environment mainly has three advantages as follows. First, it allows users to drag and drop graphical blocks instead of text typing for editing programs. Therefore, users can save their effort and time spent on MapReduce programming especially when they analyze data on clouds through mobile devices. Second, it can automatically translate the blocks of users' MapReduce programs into three different versions including Java, C and CUDA of source codes, and select one of these three versions according to the processor architecture of allocated resources for execution. Consequently, users can transparently and effectively exploit heterogeneous resources in clouds for executing their MapReduce programs while they has no need to individually write programs for each of different processor architectures by themselves. Third, it can enable clouds to outsource the computation tasks of MapReduce programs to mobile devices in order for increasing job throughput or program performance.
2018-08-23
Yu, Chenhan D., Levitt, James, Reiz, Severin, Biros, George.  2017.  Geometry-oblivious FMM for Compressing Dense SPD Matrices. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. :53:1–53:14.
We present GOFMM (geometry-oblivious FMM), a novel method that creates a hierarchical low-rank approximation, or "compression," of an arbitrary dense symmetric positive definite (SPD) matrix. For many applications, GOFMM enables an approximate matrix-vector multiplication in N log N or even N time, where N is the matrix size. Compression requires N log N storage and work. In general, our scheme belongs to the family of hierarchical matrix approximation methods. In particular, it generalizes the fast multipole method (FMM) to a purely algebraic setting by only requiring the ability to sample matrix entries. Neither geometric information (i.e., point coordinates) nor knowledge of how the matrix entries have been generated is required, thus the term "geometry-oblivious." Also, we introduce a shared-memory parallel scheme for hierarchical matrix computations that reduces synchronization barriers. We present results on the Intel Knights Landing and Haswell architectures, and on the NVIDIA Pascal architecture for a variety of matrices.
2018-06-20
Searles, R., Xu, L., Killian, W., Vanderbruggen, T., Forren, T., Howe, J., Pearson, Z., Shannon, C., Simmons, J., Cavazos, J..  2017.  Parallelization of Machine Learning Applied to Call Graphs of Binaries for Malware Detection. 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). :69–77.

Malicious applications have become increasingly numerous. This demands adaptive, learning-based techniques for constructing malware detection engines, instead of the traditional manual-based strategies. Prior work in learning-based malware detection engines primarily focuses on dynamic trace analysis and byte-level n-grams. Our approach in this paper differs in that we use compiler intermediate representations, i.e., the callgraph representation of binaries. Using graph-based program representations for learning provides structure of the program, which can be used to learn more advanced patterns. We use the Shortest Path Graph Kernel (SPGK) to identify similarities between call graphs extracted from binaries. The output similarity matrix is fed into a Support Vector Machine (SVM) algorithm to construct highly-accurate models to predict whether a binary is malicious or not. However, SPGK is computationally expensive due to the size of the input graphs. Therefore, we evaluate different parallelization methods for CPUs and GPUs to speed up this kernel, allowing us to continuously construct up-to-date models in a timely manner. Our hybrid implementation, which leverages both CPU and GPU, yields the best performance, achieving up to a 14.2x improvement over our already optimized OpenMP version. We compared our generated graph-based models to previously state-of-the-art feature vector 2-gram and 3-gram models on a dataset consisting of over 22,000 binaries. We show that our classification accuracy using graphs is over 19% higher than either n-gram model and gives a false positive rate (FPR) of less than 0.1%. We are also able to consider large call graphs and dataset sizes because of the reduced execution time of our parallelized SPGK implementation.

2017-05-18
Park, Jungho, Jung, Wookeun, Jo, Gangwon, Lee, Ilkoo, Lee, Jaejin.  2016.  PIPSEA: A Practical IPsec Gateway on Embedded APUs. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. :1255–1267.

Accelerated Processing Unit (APU) is a heterogeneous multicore processor that contains general-purpose CPU cores and a GPU in a single chip. It also supports Heterogeneous System Architecture (HSA) that provides coherent physically-shared memory between the CPU and the GPU. In this paper, we present the design and implementation of a high-performance IPsec gateway using a low-cost commodity embedded APU. The HSA supported by the APUs eliminates the data copy overhead between the CPU and the GPU, which is unavoidable in the previous discrete GPU approaches. The gateway is implemented in OpenCL to exploit the GPU and uses zero-copy packet I/O APIs in DPDK. The IPsec gateway handles the real-world network traffic where each packet has a different workload. The proposed packet scheduling algorithm significantly improves GPU utilization for such traffic. It works not only for APUs but also for discrete GPUs. With three CPU cores and one GPU in the APU, the IPsec gateway achieves a throughput of 10.36 Gbps with an average latency of 2.79 ms to perform AES-CBC+HMAC-SHA1 for incoming packets of 1024 bytes.

2017-05-16
Najafi, Ali, Rudell, Jacques C., Sathe, Visvesh.  2016.  Regenerative Breaking: Recovering Stored Energy from Inactive Voltage Domains for Energy-efficient Systems-on-Chip. Proceedings of the 2016 International Symposium on Low Power Electronics and Design. :94–99.

Modern Systems-on-Chip(SoCs) frequently power-off individual voltage domains to save leakage power across a variety of applications, from large-scale heterogeneous computing to ultra-low power systems in IoT applications. However, the considerable energy stored within the capacitance of the powered-off domain is lost through leakage. In this paper, we present an approach to leverage existing voltage regulators to recover this energy from the disabled voltage-domain back into the supply using a low-overhead all-digital runtime control system. Simulation experiments conducted in an industrial 65nm CMOS process indicate that over 90% of the stored energy can be recovered across a range of operating system voltages from 0.4V–1V.