Title | G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression |
Publication Type | Conference Paper |
Year of Publication | 2021 |
Authors | Zhang, Feng, Pan, Zaifeng, Zhou, Yanliang, Zhai, Jidong, Shen, Xipeng, Mutlu, Onur, Du, Xiaoyong |
Conference Name | 2021 IEEE 37th International Conference on Data Engineering (ICDE) |
Keywords | analytics on compressed data, composability, Data analysis, data structures, GPU, graphics processing units, Human Behavior, Instruction sets, Metrics, parallel processing, parallelism, pubcrawl, Scalability, Sensitivity, TADOC, text analytics, Writing |
Abstract | Text analytics directly on compression (TADOC) has proven to be a promising technology for big data analytics. GPUs are extremely popular accelerators for data analytics systems. Unfortunately, no work so far shows how to utilize GPUs to accelerate TADOC. We describe G-TADOC, the first framework that provides GPU-based text analytics directly on compression, effectively enabling efficient text analytics on GPUs without decompressing the input data. G-TADOC solves three major challenges. First, TADOC involves a large amount of dependencies, which makes it difficult to exploit massive parallelism on a GPU. We develop a novel fine-grained thread-level workload scheduling strategy for GPU threads, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, in developing G-TADOC, thousands of GPU threads writing to the same result buffer leads to inconsistency while directly using locks and atomic operations lead to large synchronization overheads. We develop a memory pool with thread-safe data structures on GPUs to handle such difficulties. Third, maintaining the sequence information among words is essential for lossless compression. We design a sequence-support strategy, which maintains high GPU parallelism while ensuring sequence information. Our experimental evaluations show that G-TADOC provides 31.1x average speedup compared to state-of-the-art TADOC. |
DOI | 10.1109/ICDE51399.2021.00148 |
Citation Key | zhang_g-tadoc_2021 |