- Oct 24, 2023 Date of publishing
- 10 Pages
Neural network quantization is one of the most effective ways to improve the efficiency of neural networks. Quantization allows weights, activations, and gradients to be represented in low bit-width formats. This leads to a reduction in data movement and enables the use of low bit-width computations, resulting in faster inference and lower energy consumption.
Integer and floating-point formats are two primary ways to represent numbers in computing systems. The key distinction between them lies in the value distribution. Integer format has a uniform distribution across the representable range with a difference of 1 between two consecutive numbers. The floating- point format exhibits a non-uniform distribution, thus providing higher precision for smaller numbers and lower precision for larger ones.
Low-bit floating-point formats have recently emerged as promising alternatives for DNN quantization. Leading hardware vendors such as NVIDIA already support FP8 in H100.
The optimal format is influenced by a combination of various factors, including the static/dynamic nature of tensors, and quantization bit-width. As evidenced by other research for weight tensors with static distribution, INT quantization outperforms FP quantization at 8-bit but this advantage diminishes at 4-bit. For activation tensors with dynamic distribution, FP quantization surpasses INT quantization due to FP having the ability to represent large values with lower precision and small values with higher precision.
In this paper we describe the Prodigy 4-bit Tachyum AI (TAI) format, including effectively 2-bit per weight (TAI2) format, end-to-end training techniques with TAI, inference using TAI, and provide experimental results.