TY - JOUR
T1 - RTA
T2 - A Reconfigurable Transformer Accelerator Exploiting Sparsity via Low-Bit-Width Prediction
AU - Chen, Yujie
AU - Yang, Chen
AU - Xia, Yuheng
AU - Meng, Yishuo
AU - Wang, Jianfei
AU - Fu, Qiang
AU - Geng, Li
N1 - Publisher Copyright:
© 1993-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Transformer models have received widespread attention in recent years. They have gradually replaced recurrent neural networks (RNNs) in natural language processing (NLP) and are widely used in tasks such as machine translation, text generation, and language understanding. Similarly, transformers have shown impressive results in computer vision (CV). However, their unique attention mechanism places high demands on the computational and storage resources of the hardware. Deploying transformers on edge computing platforms is challenging due to their complex data flow, intensive matrix calculations, and the need for high-precision nonlinear functions. To address these challenges, we propose reconfigurable transformer accelerator (RTA), a transformer hardware accelerator that uses low-bit-width prediction to achieve dynamic sparsity. RTA reduces resource consumption by performing sparse matrix multiplications using low-bit-width operations, while its reconfigurable design allows the sparse module to be used for high-precision large-bit-width matrix multiplications. We have also optimized the RTA computing pipeline to reduce resource usage and improve computational efficiency. Additionally, we incorporate feature sharing to enhance the resource utilization efficiency of the hardware accelerator. Experimental results on the transformer-base model show that RTA achieves an average performance of 994 GOPS and a digital signal processor (DSP) efficiency of 1412. Compared to state-of-the-art transformer accelerators, RTA achieves (Formula presented) efficiency.
AB - Transformer models have received widespread attention in recent years. They have gradually replaced recurrent neural networks (RNNs) in natural language processing (NLP) and are widely used in tasks such as machine translation, text generation, and language understanding. Similarly, transformers have shown impressive results in computer vision (CV). However, their unique attention mechanism places high demands on the computational and storage resources of the hardware. Deploying transformers on edge computing platforms is challenging due to their complex data flow, intensive matrix calculations, and the need for high-precision nonlinear functions. To address these challenges, we propose reconfigurable transformer accelerator (RTA), a transformer hardware accelerator that uses low-bit-width prediction to achieve dynamic sparsity. RTA reduces resource consumption by performing sparse matrix multiplications using low-bit-width operations, while its reconfigurable design allows the sparse module to be used for high-precision large-bit-width matrix multiplications. We have also optimized the RTA computing pipeline to reduce resource usage and improve computational efficiency. Additionally, we incorporate feature sharing to enhance the resource utilization efficiency of the hardware accelerator. Experimental results on the transformer-base model show that RTA achieves an average performance of 994 GOPS and a digital signal processor (DSP) efficiency of 1412. Compared to state-of-the-art transformer accelerators, RTA achieves (Formula presented) efficiency.
KW - Hardware acceleration
KW - hardware reconfigurability
KW - low-bit-width prediction
KW - transformer
UR - https://www.scopus.com/pages/publications/105008907320
U2 - 10.1109/TVLSI.2025.3578092
DO - 10.1109/TVLSI.2025.3578092
M3 - 文章
AN - SCOPUS:105008907320
SN - 1063-8210
VL - 33
SP - 2702
EP - 2714
JO - IEEE Transactions on Very Large Scale Integration (VLSI) Systems
JF - IEEE Transactions on Very Large Scale Integration (VLSI) Systems
IS - 10
ER -