Biblio
Image encryption is an essential part of a Visual Cryptography. Existing traditional sequential encryption techniques are infeasible to real-time applications. High-performance reformulations of such methods are increasingly growing over the last decade. These reformulations proved better performances over their sequential counterparts. A rotational encryption scheme encrypts the images in such a way that the decryption is possible with the rotated encrypted images. A parallel rotational encryption technique makes use of a high-performance device. But it less-leverages the optimizations offered by them. We propose a rotational image encryption technique which makes use of memory coalescing provided by the Compute Unified Device Architecture (CUDA). The proposed scheme achieves improved global memory utilization and increased efficiency.
Now-a-days, the security of data becomes more and more important, people store many personal information in their phones. However, stored information require security and maintain privacy. Encryption algorithm has become the main force of maintaining the security of data. Thus, the algorithm complexity and encryption efficiency have become the main measurement of whether the encryption algorithm is save or not. With the development of hardware, we have many tools to improve the algorithm at present. Because modular exponentiation in RSA algorithm can be divided into several parts mathematically. In this paper, we introduce a conception by dividing the process of encryption and add the model into graphics process unit (GPU). By using GPU's capacity in parallel computing, the core of RSA can be accelerated by using central process unit (CPU) and GPU. Compute unified device architecture (CUDA) is a platform which can combine CPU and GPU together to realize GPU parallel programming and this is the tool we use to perform experience of accelerating RSA algorithm. This paper will also build up a mathematical model to help understand the mechanism of RSA encryption algorithm.
The advancement in technology has changed how people work and what software and hardware people use. From conventional personal computer to GPU, hardware technology and capability have dramatically improved so does the operating systems that come along. Unfortunately, current industry practice to compare OS is performed with single perspective. It is either benchmark the hardware level performance or performs penetration testing to check the security features of an OS. This rigid method of benchmarking does not really reflect the true performance of an OS as the performance analysis is not comprehensive and conclusive. To illustrate this deficiency, the study performed hardware level and operational level benchmarking on Windows XP, Windows 7 and Windows 8 and the results indicate that there are instances where Windows XP excels over its newer counterparts. Overall, the research shows Windows 8 is a superior OS in comparison to its predecessors running on the same hardware. Furthermore, the findings also show that the automated benchmarking tools are proved less efficient benchmark systems that run on Windows XP and older OS as they do not support DirectX 11 and other advanced features that the hardware supports. There lies the need to have a unified benchmarking approach to compare other aspects of OS such as user oriented tasks and security parameters to provide a complete comparison. Therefore, this paper is proposing a unified approach for Operating System (OS) comparisons with the help of a Windows OS case study. This unified approach includes comparison of OS from three aspects which are; hardware level, operational level performance and security tests.
Elliptic curve asymmetric cryptography has achieved increased popularity due to its capability of providing comparable levels of security as other existing cryptographic systems while requiring less computational work. Pollard Rho and Parallel Collision Search, the fastest known sequential and parallel algorithms for breaking this cryptographic system, have been successfully applied over time to break ever-increasing bit-length system instances using implementations heavily optimized for the available hardware. This work presents portable, general implementations of a Parallel Collision Search based solution for prime elliptic curve asymmetric cryptographic systems that use publicly available big integer libraries and make no assumption on prime curve properties. It investigates which bit-length keys can be broken in reasonable time by a user that has access to a state of the art, public HPC equipment with CPUs and GPUs. The final implementation breaks a 79-bit system in about two hours using 80 GPUs and 94-bits system in about 15 hours using 256 GPUs. Extensive experimentation investigates scalability of CPU, GPU and CPU+GPU runs. The discussed results indicate that speed-up is not a good metric for parallel scalability. This paper proposes and evaluates a new metric that is better suited for this task.
Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.
Energy efficient High-Performance Computing (HPC) is becoming increasingly important. Recent ventures into this space have introduced an unlikely candidate to achieve exascale scientific computing hardware with a small energy footprint. ARM processors and embedded GPU accelerators originally developed for energy efficiency in mobile devices, where battery life is critical, are being repurposed and deployed in the next generation of supercomputers. Unfortunately, the performance of executing scientific workloads on many of these devices is largely unknown, yet the bulk of computation required in high-performance supercomputers is scientific. We present an analysis of one such scientific code, in the form of Gaussian Elimination, and evaluate both execution time and energy used on a range of embedded accelerator SoCs. These include three ARM CPUs and two mobile GPUs. Understanding how these low power devices perform on scientific workloads will be critical in the selection of appropriate hardware for these supercomputers, for how can we estimate the performance of tens of thousands of these chips if the performance of one is largely unknown?
Accelerated Processing Unit (APU) is a heterogeneous multicore processor that contains general-purpose CPU cores and a GPU in a single chip. It also supports Heterogeneous System Architecture (HSA) that provides coherent physically-shared memory between the CPU and the GPU. In this paper, we present the design and implementation of a high-performance IPsec gateway using a low-cost commodity embedded APU. The HSA supported by the APUs eliminates the data copy overhead between the CPU and the GPU, which is unavoidable in the previous discrete GPU approaches. The gateway is implemented in OpenCL to exploit the GPU and uses zero-copy packet I/O APIs in DPDK. The IPsec gateway handles the real-world network traffic where each packet has a different workload. The proposed packet scheduling algorithm significantly improves GPU utilization for such traffic. It works not only for APUs but also for discrete GPUs. With three CPU cores and one GPU in the APU, the IPsec gateway achieves a throughput of 10.36 Gbps with an average latency of 2.79 ms to perform AES-CBC+HMAC-SHA1 for incoming packets of 1024 bytes.
Intrusion detection systems need to be both accurate and fast. Speed is important especially when operating at the network level. Additionally, many intrusion detection systems rely on signature based detection approaches. However, machine learning can also be helpful for intrusion detection. One key challenge when using machine learning, aside from the detection accuracy, is using machine learning algorithms that are fast. In this paper, several processing architectures are considered for use in machine learning based intrusion detection systems. These architectures include standard CPUs, GPUs, and cognitive processors. Results of their processing speeds are compared and discussed.
General sparse matrix-matrix multiplication (SpGEMM) is a core component of many algorithms. A number of recent works have used high throughput graphics processing units (GPUs) to accelerate SpGEMM. However, exploiting the power of GPUs for SpGEMM requires addressing a number of challenges, including highly imbalanced workloads and large numbers of inefficient random global memory accesses. This paper presents a SpGEMM algorithm which uses several novel techniques to overcome these problems. We first propose two low cost methods to achieve perfect load balancing during the most expensive step in SpGEMM. Next, we show how to eliminate nearly all random global memory accesses using shared memory based hash tables. To optimize the performance of the hash tables, we propose a lightweight method to estimate the number of nonzeros in the output matrix. We compared our algorithm to the CUSP, CUSPARSE and the state-of-the-art BHSPARSE GPU SpGEMM algorithms, and show that it performs 5.6x, 2.4x and 1.5x better on average, and up to 11.8x, 9.5x and 2.5x better in the best case, respectively. Furthermore, we show that our algorithm performs especially well on highly imbalanced and unstructured matrices.
The Critical Node Detection Problem (CNDP) is a well-known NP-complete, graph-theoretical problem with many real-world applications in various fields such as social network analysis, supply-chain network analysis, transport engineering, network immunization, and military strategic planning. We present the first parallel algorithms for CNDP solving in general, and for fast, approximated CND on GPU and in the cloud in particular. Finally, we discuss results of our experimental performance analysis of these solutions.
Turbo code has been one of the important subjects in coding theory since 1993. This code has low Bit Error Rate (BER) but decoding complexity and delay are big challenges. On the other hand, considering the complexity and delay of separate blocks for coding and encryption, if these processes are combined, the security and reliability of communication system are guaranteed. In this paper a secure decoding algorithm in parallel on General-Purpose Graphics Processing Units (GPGPU) is proposed. This is the first prototype of a fast and parallel Joint Channel-Security Coding (JCSC) system. Despite of encryption process, this algorithm maintains desired BER and increases decoding speed. We considered several techniques for parallelism: (1) distribute decoding load of a code word between multiple cores, (2) simultaneous decoding of several code words, (3) using protection techniques to prevent performance degradation. We also propose two kinds of optimizations to increase the decoding speed: (1) memory access improvement, (2) the use of new GPU properties such as concurrent kernel execution and advanced atomics to compensate buffering latency.