Visible to the public Biblio

Filters: Keyword is GPU  [Clear All Filters]
2022-05-19
Zhang, Feng, Pan, Zaifeng, Zhou, Yanliang, Zhai, Jidong, Shen, Xipeng, Mutlu, Onur, Du, Xiaoyong.  2021.  G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression. 2021 IEEE 37th International Conference on Data Engineering (ICDE). :1679–1690.
Text analytics directly on compression (TADOC) has proven to be a promising technology for big data analytics. GPUs are extremely popular accelerators for data analytics systems. Unfortunately, no work so far shows how to utilize GPUs to accelerate TADOC. We describe G-TADOC, the first framework that provides GPU-based text analytics directly on compression, effectively enabling efficient text analytics on GPUs without decompressing the input data. G-TADOC solves three major challenges. First, TADOC involves a large amount of dependencies, which makes it difficult to exploit massive parallelism on a GPU. We develop a novel fine-grained thread-level workload scheduling strategy for GPU threads, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, in developing G-TADOC, thousands of GPU threads writing to the same result buffer leads to inconsistency while directly using locks and atomic operations lead to large synchronization overheads. We develop a memory pool with thread-safe data structures on GPUs to handle such difficulties. Third, maintaining the sequence information among words is essential for lossless compression. We design a sequence-support strategy, which maintains high GPU parallelism while ensuring sequence information. Our experimental evaluations show that G-TADOC provides 31.1× average speedup compared to state-of-the-art TADOC.
2021-01-11
Awad, M. A., Ashkiani, S., Porumbescu, S. D., Owens, J. D..  2020.  Dynamic Graphs on the GPU. 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). :739–748.
We present a fast dynamic graph data structure for the GPU. Our dynamic graph structure uses one hash table per vertex to store adjacency lists and achieves 3.4-14.8x faster insertion rates over the state of the art across a diverse set of large datasets, as well as deletion speedups up to 7.8x. The data structure supports queries and dynamic updates through both edge and vertex insertion and deletion. In addition, we define a comprehensive evaluation strategy based on operations, workloads, and applications that we believe better characterize and evaluate dynamic graph data structures.
2020-09-04
Gillela, Maruthi, Prenosil, Vaclav, Ginjala, Venkat Reddy.  2019.  Parallelization of Brute-Force Attack on MD5 Hash Algorithm on FPGA. 2019 32nd International Conference on VLSI Design and 2019 18th International Conference on Embedded Systems (VLSID). :88—93.
FPGA implementation of MD5 hash algorithm is faster than its software counterpart, but a pre-image brute-force attack on MD5 hash still needs 2ˆ(128) iterations theoretically. This work attempts to improve the speed of the brute-force attack on the MD5 algorithm using hardware implementation. A full 64-stage pipelining is done for MD5 hash generation and three architectures are presented for guess password generation. A 32/34/26-instance parallelization of MD5 hash generator and password generator pair is done to search for a password that was hashed using the MD5 algorithm. Total performance of about 6G trials/second has been achieved using a single Virtex-7 FPGA device.
2020-06-26
M, Raviraja Holla, D, Suma.  2019.  Memory Efficient High-Performance Rotational Image Encryption. 2019 International Conference on Communication and Electronics Systems (ICCES). :60—64.

Image encryption is an essential part of a Visual Cryptography. Existing traditional sequential encryption techniques are infeasible to real-time applications. High-performance reformulations of such methods are increasingly growing over the last decade. These reformulations proved better performances over their sequential counterparts. A rotational encryption scheme encrypts the images in such a way that the decryption is possible with the rotated encrypted images. A parallel rotational encryption technique makes use of a high-performance device. But it less-leverages the optimizations offered by them. We propose a rotational image encryption technique which makes use of memory coalescing provided by the Compute Unified Device Architecture (CUDA). The proposed scheme achieves improved global memory utilization and increased efficiency.

2020-03-30
Kim, Sejin, Oh, Jisun, Kim, Yoonhee.  2019.  Data Provenance for Experiment Management of Scientific Applications on GPU. 2019 20th Asia-Pacific Network Operations and Management Symposium (APNOMS). :1–4.
Graphics Processing Units (GPUs) are getting popularly utilized for multi-purpose applications in order to enhance highly performed parallelism of computation. As memory virtualization methods in GPU nodes are not efficiently provided to deal with diverse memory usage patterns for these applications, the success of their execution depends on exclusive and limited use of physical memory in GPU environments. Therefore, it is important to predict a pattern change of GPU memory usage during runtime execution of an application. Data provenance extracted from application characteristics, GPU runtime environments, input, and execution patterns from runtime monitoring, is defined for supporting application management to set runtime configuration and predict an experimental result, and utilize resource with co-located applications. In this paper, we define data provenance of an application on GPUs and manage data by profiling the execution of CUDA scientific applications. Data provenance management helps to predict execution patterns of other similar experiments and plan efficient resource configuration.
2020-01-21
Luo, Chao, Fei, Yunsi, Kaeli, David.  2019.  Side-Channel Timing Attack of RSA on a GPU. ACM Transactions on Architecture and Code Optimization (TACO). 16:32:1-32:18.
To increase computation throughput, general purpose Graphics Processing Units (GPUs) have been leveraged to accelerate computationally intensive workloads. GPUs have been used as cryptographic engines, improving encryption/decryption throughput and leveraging the GPU's Single Instruction Multiple Thread (SIMT) model. RSA is a widely used public-key cipher and has been ported onto GPUs for signing and decrypting large files. Although performance has been significantly improved, the security of RSA on GPUs is vulnerable to side-channel timing attacks and is an exposure overlooked in previous studies. GPUs tend to be naturally resilient to side-channel attacks, given that they execute a large number of concurrent threads, performing many RSA operations on different data in parallel. Given the degree of parallel execution on a GPU, there will be a significant amount of noise introduced into the timing channel given the thousands of concurrent threads executing concurrently. In this work, we build a timing model to capture the parallel characteristics of an RSA public-key cipher implemented on a GPU. We consider optimizations that include using Montgomery multiplication and sliding-window exponentiation to implement cryptographic operations. Our timing model considers the challenges of parallel execution, complications that do not occur in single-threaded computing platforms. Based on our timing model, we launch successful timing attacks on RSA running on a GPU, extracting the private key of RSA. We also present an effective error detection and correction mechanism. Our results demonstrate that GPU acceleration of RSA is vulnerable to side-channel timing attacks. We propose several countermeasures to defend against this class of attacks.
2019-12-30
Razaque, Abdul, Jinrui, Wang, Zancheng, Wang, Hani, Qassim Bani, Khaskheli, Murad Ali, Bhutto, Waseem Ahmed.  2018.  Integration of CPU and GPU to Accelerate RSA Modular Exponentiation Operation. 2018 IEEE Long Island Systems, Applications and Technology Conference (LISAT). :1-6.

Now-a-days, the security of data becomes more and more important, people store many personal information in their phones. However, stored information require security and maintain privacy. Encryption algorithm has become the main force of maintaining the security of data. Thus, the algorithm complexity and encryption efficiency have become the main measurement of whether the encryption algorithm is save or not. With the development of hardware, we have many tools to improve the algorithm at present. Because modular exponentiation in RSA algorithm can be divided into several parts mathematically. In this paper, we introduce a conception by dividing the process of encryption and add the model into graphics process unit (GPU). By using GPU's capacity in parallel computing, the core of RSA can be accelerated by using central process unit (CPU) and GPU. Compute unified device architecture (CUDA) is a platform which can combine CPU and GPU together to realize GPU parallel programming and this is the tool we use to perform experience of accelerating RSA algorithm. This paper will also build up a mathematical model to help understand the mechanism of RSA encryption algorithm.

2018-06-11
Peterson, Brad, Humphrey, Alan, Schmidt, John, Berzins, Martin.  2017.  Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs. Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware. :1:1–1:8.
Large-scale parallel applications with complex global data dependencies beyond those of reductions pose significant scalability challenges in an asynchronous runtime system. Internodal challenges include identifying the all-to-all communication of data dependencies among the nodes. Intranodal challenges include gathering together these data dependencies into usable data objects while avoiding data duplication. This paper addresses these challenges within the context of a large-scale, industrial coal boiler simulation using the Uintah asynchronous many-task runtime system on GPU architectures. We show significant reduction in time spent analyzing data dependencies through refinements in our dependency search algorithm. Multiple task graphs are used to eliminate subsequent analysis when task graphs change in predictable and repeatable ways. Using a combined data store and task scheduler redesign reduces data dependency duplication ensuring that problems fit within host and GPU memory. These modifications did not require any changes to application code or sweeping changes to the Uintah runtime system. We report results running on the DOE Titan system on 119K CPU cores and 7.5K GPUs simultaneously. Our solutions can be generalized to other task dependency problems with global dependencies among thousands of nodes which must be processed efficiently at large scale.
2018-04-04
Montella, Raffaele, Di Luccio, Diana, Marcellino, Livia, Galletti, Ardelio, Kosta, Sokol, Brizius, Alison, Foster, Ian.  2017.  Processing of Crowd-sourced Data from an Internet of Floating Things. Proceedings of the 12th Workshop on Workflows in Support of Large-Scale Science. :8:1–8:11.
Sensors incorporated into mobile devices provide unique opportunities to capture detailed environmental information that cannot be readily collected in other ways. We show here how data from networked navigational sensors on leisure vessels can be used to construct unique new datasets, using the example of underwater topography (bathymetry) to demonstrate the approach. Specifically, we describe an end-to-end workflow that involves the collection of large numbers of timestamped (position, depth) measurements from "internet of floating things" devices on leisure vessels; the communication of data to cloud resources, via a specialized protocol capable of dealing with delayed, intermittent, or even disconnected networks; the integration of measurement data into cloud storage; the efficient correction and interpolation of measurements on a cloud computing platform; and the creation of a continuously updated bathymetric database. Our prototype implementation of this workflow leverages the FACE-IT Galaxy workflow engine to integrate network communication and database components with a CUDA-enabled algorithm running in a virtualized cloud environment.
2018-03-26
Zahilah, R., Tahir, F., Zainal, A., Abdullah, A. H., Ismail, A. S..  2017.  Unified Approach for Operating System Comparisons with Windows OS Case Study. 2017 IEEE Conference on Application, Information and Network Security (AINS). :91–96.

The advancement in technology has changed how people work and what software and hardware people use. From conventional personal computer to GPU, hardware technology and capability have dramatically improved so does the operating systems that come along. Unfortunately, current industry practice to compare OS is performed with single perspective. It is either benchmark the hardware level performance or performs penetration testing to check the security features of an OS. This rigid method of benchmarking does not really reflect the true performance of an OS as the performance analysis is not comprehensive and conclusive. To illustrate this deficiency, the study performed hardware level and operational level benchmarking on Windows XP, Windows 7 and Windows 8 and the results indicate that there are instances where Windows XP excels over its newer counterparts. Overall, the research shows Windows 8 is a superior OS in comparison to its predecessors running on the same hardware. Furthermore, the findings also show that the automated benchmarking tools are proved less efficient benchmark systems that run on Windows XP and older OS as they do not support DirectX 11 and other advanced features that the hardware supports. There lies the need to have a unified benchmarking approach to compare other aspects of OS such as user oriented tasks and security parameters to provide a complete comparison. Therefore, this paper is proposing a unified approach for Operating System (OS) comparisons with the help of a Windows OS case study. This unified approach includes comparison of OS from three aspects which are; hardware level, operational level performance and security tests.

2017-12-28
Panetta, J., Filho, P. R. P. S., Laranjeira, L. A. F., Teixeira, C. A..  2017.  Scalability of CPU and GPU Solutions of the Prime Elliptic Curve Discrete Logarithm Problem. 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). :33–40.

Elliptic curve asymmetric cryptography has achieved increased popularity due to its capability of providing comparable levels of security as other existing cryptographic systems while requiring less computational work. Pollard Rho and Parallel Collision Search, the fastest known sequential and parallel algorithms for breaking this cryptographic system, have been successfully applied over time to break ever-increasing bit-length system instances using implementations heavily optimized for the available hardware. This work presents portable, general implementations of a Parallel Collision Search based solution for prime elliptic curve asymmetric cryptographic systems that use publicly available big integer libraries and make no assumption on prime curve properties. It investigates which bit-length keys can be broken in reasonable time by a user that has access to a state of the art, public HPC equipment with CPUs and GPUs. The final implementation breaks a 79-bit system in about two hours using 80 GPUs and 94-bits system in about 15 hours using 256 GPUs. Extensive experimentation investigates scalability of CPU, GPU and CPU+GPU runs. The discussed results indicate that speed-up is not a good metric for parallel scalability. This paper proposes and evaluates a new metric that is better suited for this task.

Rolinger, T. B., Simon, T. A., Krieger, C. D..  2017.  Performance challenges for heterogeneous distributed tensor decompositions. 2017 IEEE High Performance Extreme Computing Conference (HPEC). :1–7.

Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.

2017-12-04
Johnston, B., Lee, B., Angove, L., Rendell, A..  2017.  Embedded Accelerators for Scientific High-Performance Computing: An Energy Study of OpenCL Gaussian Elimination Workloads. 2017 46th International Conference on Parallel Processing Workshops (ICPPW). :59–68.

Energy efficient High-Performance Computing (HPC) is becoming increasingly important. Recent ventures into this space have introduced an unlikely candidate to achieve exascale scientific computing hardware with a small energy footprint. ARM processors and embedded GPU accelerators originally developed for energy efficiency in mobile devices, where battery life is critical, are being repurposed and deployed in the next generation of supercomputers. Unfortunately, the performance of executing scientific workloads on many of these devices is largely unknown, yet the bulk of computation required in high-performance supercomputers is scientific. We present an analysis of one such scientific code, in the form of Gaussian Elimination, and evaluate both execution time and energy used on a range of embedded accelerator SoCs. These include three ARM CPUs and two mobile GPUs. Understanding how these low power devices perform on scientific workloads will be critical in the selection of appropriate hardware for these supercomputers, for how can we estimate the performance of tens of thousands of these chips if the performance of one is largely unknown?

2017-11-01
Elsobky, Alaa Mahmoud, Farag, Abdelalim Kamal, Keshk, Arabi.  2016.  Efficient Implementation of McEliece Cryptosystem on Graphic Processing Unit. Proceedings of the 10th International Conference on Informatics and Systems. :247–253.
McEliece is a public-key cryptosystem based on error correcting codes. It has the ability to resist quantum-computer attacks which can break different modern public key cryptosystems such as RSA. Further more, it's encryption and decryption are very fast and have good characteristics for data parallel processing. Nowadays, modern graphic processing units (GPUs) are available in almost all hardware platforms. GPUs can comprise many compute cores which can process a huge data in parallel. In this paper, different implementations of McEliece cryptosystem are explored on NVIDIA GTX780 GPU using OpenCL framework. Our implementation results show that GPU is 331x faster than CPU when apply local memory with vector data-type to encrypt 216 messages.
2017-10-04
Lee, Won-Jong, Hwang, Seok Joong, Shin, Youngsam, Ryu, Soojung, Ihm, Insung.  2016.  Adaptive Multi-rate Ray Sampling on Mobile Ray Tracing GPU. SIGGRAPH ASIA 2016 Mobile Graphics and Interactive Applications. :3:1–3:6.
We present an adaptive multi-rate ray sampling algorithm targeting mobile ray-tracing GPUs. We efficiently combine two existing algorithms, adaptive supersampling and undersampling, into a single framework targeting ray-tracing GPUs and extend it to a new multi-rate sampling scheme by utilizing tile-based rendering and frame-to-frame coherency. The experimental results show that our implementation is a versatile solution for future ray-tracing GPUs as it provides up to 2.98 times better efficiency in terms of performance per Watt by reducing the number of rays to be fed into the dedicated hardware and minimizing the memory operations.
2017-05-18
Park, Jungho, Jung, Wookeun, Jo, Gangwon, Lee, Ilkoo, Lee, Jaejin.  2016.  PIPSEA: A Practical IPsec Gateway on Embedded APUs. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. :1255–1267.

Accelerated Processing Unit (APU) is a heterogeneous multicore processor that contains general-purpose CPU cores and a GPU in a single chip. It also supports Heterogeneous System Architecture (HSA) that provides coherent physically-shared memory between the CPU and the GPU. In this paper, we present the design and implementation of a high-performance IPsec gateway using a low-cost commodity embedded APU. The HSA supported by the APUs eliminates the data copy overhead between the CPU and the GPU, which is unavoidable in the previous discrete GPU approaches. The gateway is implemented in OpenCL to exploit the GPU and uses zero-copy packet I/O APIs in DPDK. The IPsec gateway handles the real-world network traffic where each packet has a different workload. The proposed packet scheduling algorithm significantly improves GPU utilization for such traffic. It works not only for APUs but also for discrete GPUs. With three CPU cores and one GPU in the APU, the IPsec gateway achieves a throughput of 10.36 Gbps with an average latency of 2.79 ms to perform AES-CBC+HMAC-SHA1 for incoming packets of 1024 bytes.

2017-05-16
Calix, Ricardo A., Cabrera, Armando, Iqbal, Irshad.  2016.  Analysis of Parallel Architectures for Network Intrusion Detection. Proceedings of the 5th Annual Conference on Research in Information Technology. :7–12.

Intrusion detection systems need to be both accurate and fast. Speed is important especially when operating at the network level. Additionally, many intrusion detection systems rely on signature based detection approaches. However, machine learning can also be helpful for intrusion detection. One key challenge when using machine learning, aside from the detection accuracy, is using machine learning algorithms that are fast. In this paper, several processing architectures are considered for use in machine learning based intrusion detection systems. These architectures include standard CPUs, GPUs, and cognitive processors. Results of their processing speeds are compared and discussed.

Anh, Pham Nguyen Quang, Fan, Rui, Wen, Yonggang.  2016.  Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication. Proceedings of the 2016 International Conference on Supercomputing. :36:1–36:12.

General sparse matrix-matrix multiplication (SpGEMM) is a core component of many algorithms. A number of recent works have used high throughput graphics processing units (GPUs) to accelerate SpGEMM. However, exploiting the power of GPUs for SpGEMM requires addressing a number of challenges, including highly imbalanced workloads and large numbers of inefficient random global memory accesses. This paper presents a SpGEMM algorithm which uses several novel techniques to overcome these problems. We first propose two low cost methods to achieve perfect load balancing during the most expensive step in SpGEMM. Next, we show how to eliminate nearly all random global memory accesses using shared memory based hash tables. To optimize the performance of the hash tables, we propose a lightweight method to estimate the number of nonzeros in the output matrix. We compared our algorithm to the CUSP, CUSPARSE and the state-of-the-art BHSPARSE GPU SpGEMM algorithms, and show that it performs 5.6x, 2.4x and 1.5x better on average, and up to 11.8x, 9.5x and 2.5x better in the best case, respectively. Furthermore, we show that our algorithm performs especially well on highly imbalanced and unstructured matrices.

2017-04-20
Najjar-Ghabel, S., Yousefi, S., Lighvan, M. Z..  2016.  A high speed implementation counter mode cryptography using hardware parallelism. 2016 Eighth International Conference on Information and Knowledge Technology (IKT). :55–60.
Nowadays, cryptography is one of the common security mechanisms. Cryptography algorithms are used to make secure data transmission over unsecured networks. Vital applications are required to techniques that encrypt/decrypt big data at the appropriate time, because the data should be encrypted/decrypted are variable size and usually the size of them is large. In this paper, for the mentioned requirements, the counter mode cryptography (CTR) algorithm with Data Encryption Standard (DES) core is paralleled by using Graphics Processing Unit (GPU). A secondary part of our work, this parallel CTR algorithm is applied on special network on chip (NoC) architecture that designed by Heracles toolkit. The results of numerical comparison show that GPU-based implementation can be achieved better runtime in comparison to the CPU-based one. Furthermore, our final implementations show that parallel CTR mode cryptography is achieved better runtime by using special NoC that applied on FPGA board in comparison to GPU-based and CPU ones.
2017-03-08
Degenbaeva, C., Klusch, M..  2015.  Critical Node Detection Problem Solving on GPU and in the Cloud. 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded S. :52–57.

The Critical Node Detection Problem (CNDP) is a well-known NP-complete, graph-theoretical problem with many real-world applications in various fields such as social network analysis, supply-chain network analysis, transport engineering, network immunization, and military strategic planning. We present the first parallel algorithms for CNDP solving in general, and for fast, approximated CND on GPU and in the cloud in particular. Finally, we discuss results of our experimental performance analysis of these solutions.

2017-02-14
A. Motamedi, M. Najafi, N. Erami.  2015.  "Parallel secure turbo code for security enhancement in physical layer". 2015 Signal Processing and Intelligent Systems Conference (SPIS). :179-184.

Turbo code has been one of the important subjects in coding theory since 1993. This code has low Bit Error Rate (BER) but decoding complexity and delay are big challenges. On the other hand, considering the complexity and delay of separate blocks for coding and encryption, if these processes are combined, the security and reliability of communication system are guaranteed. In this paper a secure decoding algorithm in parallel on General-Purpose Graphics Processing Units (GPGPU) is proposed. This is the first prototype of a fast and parallel Joint Channel-Security Coding (JCSC) system. Despite of encryption process, this algorithm maintains desired BER and increases decoding speed. We considered several techniques for parallelism: (1) distribute decoding load of a code word between multiple cores, (2) simultaneous decoding of several code words, (3) using protection techniques to prevent performance degradation. We also propose two kinds of optimizations to increase the decoding speed: (1) memory access improvement, (2) the use of new GPU properties such as concurrent kernel execution and advanced atomics to compensate buffering latency.