Visible to the public Biblio

Filters: Keyword is graphics processing units  [Clear All Filters]
2018-02-21
Zhou, G., Feng, Y., Bo, R., Chien, L., Zhang, X., Lang, Y., Jia, Y., Chen, Z..  2017.  GPU-Accelerated Batch-ACPF Solution for N-1 Static Security Analysis. IEEE Transactions on Smart Grid. 8:1406–1416.

Graphics processing unit (GPU) has been applied successfully in many scientific computing realms due to its superior performances on float-pointing calculation and memory bandwidth, and has great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power flow (ACPF) problems need to be solved. However, when applying existing GPU-accelerated algorithms to solve N-1 SSA problem, the degree of parallelism is limited because existing researches have been devoted to accelerating the solution of a single ACPF. This paper therefore proposes a GPU-accelerated solution that creates an additional layer of parallelism among batch ACPFs and consequently achieves a much higher level of overall parallelism. First, this paper establishes two basic principles for determining well-designed GPU algorithms, through which the limitation of GPU-accelerated sequential-ACPF solution is demonstrated. Next, being the first of its kind, this paper proposes a novel GPU-accelerated batch-QR solver, which packages massive number of QR tasks to formulate a new larger-scale problem and then achieves higher level of parallelism and better coalesced memory accesses. To further improve the efficiency of solving SSA, a GPU-accelerated batch-Jacobian-Matrix generating and contingency screening is developed and carefully optimized. Lastly, the complete process of the proposed GPU-accelerated batch-ACPF solution for SSA is presented. Case studies on an 8503-bus system show dramatic computation time reduction is achieved compared with all reported existing GPU-accelerated methods. In comparison to UMFPACK-library-based single-CPU counterpart using Intel Xeon E5-2620, the proposed GPU-accelerated SSA framework using NVIDIA K20C achieves up to 57.6 times speedup. It can even achieve four times speedup when compared to one of the fastest multi-core CPU parallel computing solution using KLU library. The prop- sed batch-solving method is practically very promising and lays a critical foundation for many other power system applications that need to deal with massive subtasks, such as Monte-Carlo simulation and probabilistic power flow.

2018-02-06
Badii, A., Faulkner, R., Raval, R., Glackin, C., Chollet, G..  2017.  Accelerated Encryption Algorithms for Secure Storage and Processing in the Cloud. 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP). :1–6.

The objective of this paper is to outline the design specification, implementation and evaluation of a proposed accelerated encryption framework which deploys both homomorphic and symmetric-key encryptions to serve the privacy preserving processing; in particular, as a sub-system within the Privacy Preserving Speech Processing framework architecture as part of the PPSP-in-Cloud Platform. Following a preliminary study of GPU efficiency gains optimisations benchmarked for AES implementation we have addressed and resolved the Big Integer processing challenges in parallel implementation of bilinear pairing thus enabling the creation of partially homomorphic encryption schemes which facilitates applications such as speech processing in the encrypted domain on the cloud. This novel implementation has been validated in laboratory tests using a standard speech corpus and can be used for other application domains to support secure computation and privacy preserving big data storage/processing in the cloud.

2018-01-10
Ouali, C., Dumouchel, P., Gupta, V..  2017.  Robust video fingerprints using positions of salient regions. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :3041–3045.
This paper describes a video fingerprinting system that is highly robust to audio and video transformations. The proposed system adapts a robust audio fingerprint extraction approach to video fingerprinting. The audio fingerprinting system converts the spectrogram into binary images, and then encodes the positions of salient regions selected from each binary image. Visual features are extracted in a similar way from the video images. We propose two visual fingerprint generation methods where fingerprints encode the positions of salient regions of greyscale video images. Salient regions of the first method are selected based on the intensity values of the image, while the second method identifies the regions that represent the highest variations between two successive images. The similarity between two fingerprints is defined as the intersection between their elements. The search algorithm is speeded up by an efficient implementation on a Graphics Processing Unit (GPU). We evaluate the performance of the proposed video system on TRECVID 2009 and 2010 datasets, and we show that this system achieves promising results and outperforms other state-of-the-art video copy detection methods for queries that do not includes geometric transformations. In addition, we show the effectiveness of this system for a challenging audio+video copy detection task.
2017-12-28
Panetta, J., Filho, P. R. P. S., Laranjeira, L. A. F., Teixeira, C. A..  2017.  Scalability of CPU and GPU Solutions of the Prime Elliptic Curve Discrete Logarithm Problem. 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). :33–40.

Elliptic curve asymmetric cryptography has achieved increased popularity due to its capability of providing comparable levels of security as other existing cryptographic systems while requiring less computational work. Pollard Rho and Parallel Collision Search, the fastest known sequential and parallel algorithms for breaking this cryptographic system, have been successfully applied over time to break ever-increasing bit-length system instances using implementations heavily optimized for the available hardware. This work presents portable, general implementations of a Parallel Collision Search based solution for prime elliptic curve asymmetric cryptographic systems that use publicly available big integer libraries and make no assumption on prime curve properties. It investigates which bit-length keys can be broken in reasonable time by a user that has access to a state of the art, public HPC equipment with CPUs and GPUs. The final implementation breaks a 79-bit system in about two hours using 80 GPUs and 94-bits system in about 15 hours using 256 GPUs. Extensive experimentation investigates scalability of CPU, GPU and CPU+GPU runs. The discussed results indicate that speed-up is not a good metric for parallel scalability. This paper proposes and evaluates a new metric that is better suited for this task.

Rolinger, T. B., Simon, T. A., Krieger, C. D..  2017.  Performance challenges for heterogeneous distributed tensor decompositions. 2017 IEEE High Performance Extreme Computing Conference (HPEC). :1–7.

Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.

2017-12-04
Johnston, B., Lee, B., Angove, L., Rendell, A..  2017.  Embedded Accelerators for Scientific High-Performance Computing: An Energy Study of OpenCL Gaussian Elimination Workloads. 2017 46th International Conference on Parallel Processing Workshops (ICPPW). :59–68.

Energy efficient High-Performance Computing (HPC) is becoming increasingly important. Recent ventures into this space have introduced an unlikely candidate to achieve exascale scientific computing hardware with a small energy footprint. ARM processors and embedded GPU accelerators originally developed for energy efficiency in mobile devices, where battery life is critical, are being repurposed and deployed in the next generation of supercomputers. Unfortunately, the performance of executing scientific workloads on many of these devices is largely unknown, yet the bulk of computation required in high-performance supercomputers is scientific. We present an analysis of one such scientific code, in the form of Gaussian Elimination, and evaluate both execution time and energy used on a range of embedded accelerator SoCs. These include three ARM CPUs and two mobile GPUs. Understanding how these low power devices perform on scientific workloads will be critical in the selection of appropriate hardware for these supercomputers, for how can we estimate the performance of tens of thousands of these chips if the performance of one is largely unknown?

2017-04-20
Najjar-Ghabel, S., Yousefi, S., Lighvan, M. Z..  2016.  A high speed implementation counter mode cryptography using hardware parallelism. 2016 Eighth International Conference on Information and Knowledge Technology (IKT). :55–60.
Nowadays, cryptography is one of the common security mechanisms. Cryptography algorithms are used to make secure data transmission over unsecured networks. Vital applications are required to techniques that encrypt/decrypt big data at the appropriate time, because the data should be encrypted/decrypted are variable size and usually the size of them is large. In this paper, for the mentioned requirements, the counter mode cryptography (CTR) algorithm with Data Encryption Standard (DES) core is paralleled by using Graphics Processing Unit (GPU). A secondary part of our work, this parallel CTR algorithm is applied on special network on chip (NoC) architecture that designed by Heracles toolkit. The results of numerical comparison show that GPU-based implementation can be achieved better runtime in comparison to the CPU-based one. Furthermore, our final implementations show that parallel CTR mode cryptography is achieved better runtime by using special NoC that applied on FPGA board in comparison to GPU-based and CPU ones.
2017-03-08
Degenbaeva, C., Klusch, M..  2015.  Critical Node Detection Problem Solving on GPU and in the Cloud. 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded S. :52–57.

The Critical Node Detection Problem (CNDP) is a well-known NP-complete, graph-theoretical problem with many real-world applications in various fields such as social network analysis, supply-chain network analysis, transport engineering, network immunization, and military strategic planning. We present the first parallel algorithms for CNDP solving in general, and for fast, approximated CND on GPU and in the cloud in particular. Finally, we discuss results of our experimental performance analysis of these solutions.

2017-02-14
A. Motamedi, M. Najafi, N. Erami.  2015.  "Parallel secure turbo code for security enhancement in physical layer". 2015 Signal Processing and Intelligent Systems Conference (SPIS). :179-184.

Turbo code has been one of the important subjects in coding theory since 1993. This code has low Bit Error Rate (BER) but decoding complexity and delay are big challenges. On the other hand, considering the complexity and delay of separate blocks for coding and encryption, if these processes are combined, the security and reliability of communication system are guaranteed. In this paper a secure decoding algorithm in parallel on General-Purpose Graphics Processing Units (GPGPU) is proposed. This is the first prototype of a fast and parallel Joint Channel-Security Coding (JCSC) system. Despite of encryption process, this algorithm maintains desired BER and increases decoding speed. We considered several techniques for parallelism: (1) distribute decoding load of a code word between multiple cores, (2) simultaneous decoding of several code words, (3) using protection techniques to prevent performance degradation. We also propose two kinds of optimizations to increase the decoding speed: (1) memory access improvement, (2) the use of new GPU properties such as concurrent kernel execution and advanced atomics to compensate buffering latency.