Skip to main navigation Skip to search Skip to main content

FP2: A 2-bit Floating-Point Format for Edge-AI Inference and Fine-Tuning

  • Qiwei Dang
  • , Chengyu Ma
  • , Haiduo Huang
  • , Gelin Fu
  • , Zhiwang Huo
  • , Guoming Yang
  • , Pengchen Zong
  • , Tian Xia
  • , Wenzhe Zhao
  • , Pengju Ren
  • Xi'an Jiaotong University

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The increasing scale of Deep Neural Networks (DNNs) has made 2-bit quantization crucial for mitigating memory bottlenecks on edge devices. Low-bitwidth floating-point formats, offering larger dynamic ranges and avoiding quantization steps, have emerged as promising alternatives to fixed-point quantization. However, constructing viable floating-point representations with fewer than 3 bits remains challenging, as conventional formats require at least one sign bit, one exponent bit, and one mantissa bit. We address this challenge by introducing a novel data compression method that uses a 4-bit encoding space to represent two floating-point values, achieving an effective storage density of 2 bits per value. Depending on the bit width of the exponent and mantissa, we propose two different 2-bit floating-point encodings: fp2-e1m0 and fp2-e0m1. Based on fp2, we introduce two computing architectures that simplify floating-point multiply-accumulate (MAC) operations into bitwise addition and logic operations, reducing floating-point computation by factors of 2x and 4x. As a result, fp2 offers a practical solution for efficient inference using floating-point arithmetic on resource-constrained edge devices. Moreover, we analyze the error characteristics of the fp2 data format from three perspectives. To validate the effectiveness of the fp2 format, we conduct experiments on ResNet18/50 and ConvNeXt-Tiny using the CIFAR-10 and ImageNet-1K datasets. Compared to fp4, our approach reduces model size by 47%, with accuracy loss is less than 2 percentage points. Notably, on CIFAR-10, some results are close to those of fp32. In contrast, when evaluated under 2-bit GPTQ, fp2 demonstrates significant advantages over the baseline method on the LLAMA model. For hardware evaluation, we implement our design at the RTL level and evaluate it on both FPGA and ASIC platforms. Compared to computation architectures based on fp4, our fp4 x fp2 processing element (PE) array reduces area by 15% and power consumption by 8%. Furthermore, our fp2 x fp2 PE array achieves a remarkable 78% reduction in both area and power consumption.

Keywords

  • Deep neural network
  • hardware accelerator
  • low-precision floating-point number

Fingerprint

Dive into the research topics of 'FP2: A 2-bit Floating-Point Format for Edge-AI Inference and Fine-Tuning'. Together they form a unique fingerprint.

Cite this