Visible to the public G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression

TitleG-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression
Publication TypeConference Paper
Year of Publication2021
AuthorsZhang, Feng, Pan, Zaifeng, Zhou, Yanliang, Zhai, Jidong, Shen, Xipeng, Mutlu, Onur, Du, Xiaoyong
Conference Name2021 IEEE 37th International Conference on Data Engineering (ICDE)
Keywordsanalytics on compressed data, composability, Data analysis, data structures, GPU, graphics processing units, Human Behavior, Instruction sets, Metrics, parallel processing, parallelism, pubcrawl, Scalability, Sensitivity, TADOC, text analytics, Writing
AbstractText analytics directly on compression (TADOC) has proven to be a promising technology for big data analytics. GPUs are extremely popular accelerators for data analytics systems. Unfortunately, no work so far shows how to utilize GPUs to accelerate TADOC. We describe G-TADOC, the first framework that provides GPU-based text analytics directly on compression, effectively enabling efficient text analytics on GPUs without decompressing the input data. G-TADOC solves three major challenges. First, TADOC involves a large amount of dependencies, which makes it difficult to exploit massive parallelism on a GPU. We develop a novel fine-grained thread-level workload scheduling strategy for GPU threads, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, in developing G-TADOC, thousands of GPU threads writing to the same result buffer leads to inconsistency while directly using locks and atomic operations lead to large synchronization overheads. We develop a memory pool with thread-safe data structures on GPUs to handle such difficulties. Third, maintaining the sequence information among words is essential for lossless compression. We design a sequence-support strategy, which maintains high GPU parallelism while ensuring sequence information. Our experimental evaluations show that G-TADOC provides 31.1x average speedup compared to state-of-the-art TADOC.
DOI10.1109/ICDE51399.2021.00148
Citation Keyzhang_g-tadoc_2021