Mainstreaming Large Language Models With 2-bit TAI Weights

Introduction

Generative Pre-trained Transformer (GPT) models set themselves apart through breakthrough performance across complex language modelling tasks, in text generation, few-shot learning, reasoning, protein sequence modelling, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly accurate GPT models may require multiple performant GPUs to execute, which limits the usability of such models.

The capacities of language models increase dramatically by more than 1,000 times within a few years, from BERT 340 million parameters to the Megatron Turing 530 billion dense parameters and to the sparse Switch Transformer 1.6 trillion sparse parameters and lower precision (bfloat16). Scaling up language models has been incredibly successful. It significantly improves a model’s performance on language and vision tasks, and the models demonstrate amazing few shot capabilities similar to that of human beings.

Efficient deployment of large language models (LLMs) necessitates low-bit quantization to minimize model size and inference cost. While low-bit integer formats (e.g., INT8/INT4) have been the conventional choice, emerging low-bit floating-point formats (e.g., FP8/4bit TAI) offer a compelling alternative. Striking a balance between computational efficiency and maintaining model quality is a formidable challenge.