Cross-Layer Aggregation with Transformers for Multi-Label Image Classification

Submitted by grigby1 on Tue, 05/30/2023 - 3:18pm

Title	Cross-Layer Aggregation with Transformers for Multi-Label Image Classification
Publication Type	Conference Paper
Year of Publication	2022
Authors	Zhang, Weibo, Zhu, Fuqing, Han, Jizhong, Guo, Tao, Hu, Songlin
Conference Name	ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Keywords	Benchmark testing, composability, compositionality, Computational efficiency, Costs, Cross Layer Security, Cross-layer aggregation, feature extraction, multi-label image classification, pubcrawl, resilience, Resiliency, Task Analysis, Training, Transformers
Abstract	Multi-label image classification task aims to predict multiple object labels in a given image and faces the challenge of variable-sized objects. Limited by the size of CNN convolution kernels, existing CNN-based methods have difficulty capturing global dependencies and effectively fusing multiple layers features, which is critical for this task. Recently, transformers have utilized multi-head attention to extract feature with long range dependencies. Inspired by this, this paper proposes a Cross-layer Aggregation with Transformers (CAT) framework, which leverages transformers to capture the long range dependencies of CNN-based features with Long Range Dependencies module and aggregate the features layer by layer with Cross-Layer Fusion module. To make the framework efficient, a multi-head pre-max attention is designed to reduce the computation cost when fusing the high-resolution features of lower-layers. On two widely-used benchmarks (i.e., VOC2007 and MS-COCO), CAT provides a stable improvement over the baseline and produces a competitive performance.
DOI	10.1109/ICASSP43922.2022.9747696
Citation Key	zhang_cross-layer_2022

Groups:

Science of Security VO