Unprecedented Scale and Efficiency in Generative AI with FP8 8:3 Super-Sparsity

Abstract

The current trend in the field of deep learning is to increase the number of model parameters so that they can handle difficult tasks and model complex nonlinear systems such as natural language processing. However, it has its downside, as the complex system requirements to train such large models become the prerogative of just a few labs in the world. This paper focuses on the benefits of compression for models both during and after training. Our work shows that it is possible to jointly quantize the model during training in 8-bit accuracy and then perform block pruning, which leaves only 37% of the parameters. This method significantly increases the speed of training and at the same time reduces the memory footprint of the model after training. Moreover, if the hardware on which the model runs effectively supports 8-bit operations and sparse matrices, this further accelerates the model.

Key words

8-bit Floating Point, Quantization Aware Training, Post-training Quantization, Block Pruning, Model Compression