White Papers on Tachyum

Tachyum Prodigy Solutions for Post-Quantum Cryptography

Tue, 21 Jan 2025 00:00:00 +0000

Quantum computing is quickly emerging as an exciting new leading-edge technology that is moving from theory to practical applications. Quantum computing is being driven by certain applications that aren’t addressed well with classical computers which include problems that involve exponential scaling. Two examples of this type of problem are optimization and chemistry.

An example optimization problem taken from daily life is the possible seating configurations around a table for a meeting in a conference room. If the meeting has four participants the number of possible seating configurations is 4! or 24. If you increase the number of participants to eight, the number of possible seating configurations increases to 40,320, and if you hold a meeting with 10 people seated around the conference room table, the number grows to 3,628,800. This is an example of a problem that scales exponentially, making it difficult and impractical for a classical computer to solve in an exhaustive way as the number becomes very large. Classical computers solve these problems by approximation.

Chemistry is another area where quantum computers are showing a lot of promise. Simulating molecular clusters is very challenging for classical computers since the simulations need to account for every electron-electron repulsion and every attraction of each electron to the nuclei, so currently even the largest supercomputers can only simulate very small molecular clusters.

There are fundamental differences between classical computing and quantum computing that allow quantum computers to address the above types of problems much more efficiently than classical computers. First of all, classical computers use bits that are either in the “0” or “1” state, so everything is a series of 0s and 1s, and the capacity of a classical computer increases linearly with the number of bits. Quantum computers use qubits instead of bits, which have very unique characteristics that don’t exist with classical computers.

A qubit can represent a 0 or 1 like a classical bit, but you can also apply quantum rules to it, which include some very interesting properties. One property is superposition. Quantum superposition is the ability of a quantum system to act as if it is in multiple states at the same time until it is measured. One qubit can represent not just 0 or 1, but a superposition of 0 and 1. In addition, complex superpositions can exist, so 2 qubits can be in a superposition of four states, three qubits in a superposition of eight states, etc. What this means is that the power of quantum computers increases exponentially with the number of qubits.

Another property is entanglement. Entangled qubits allow changing the state of one qubit to immediately change the state of the entangled qubit, so the states of entangled qubits cannot be described independently of each other. Quantum computers utilize superposition and entanglement to solve problems that are difficult if not impossible for classical computers such as the problems described above.

Quantum computing is still in the early stages, but is growing and maturing quickly. In 2019 Google announced the largest quantum computer at that time with 54 qubits, and as of late 2024 IBM’s Condor was considered to be the largest with 1121 qubits. Another critical area of research in addition to the number of qubits is developing reliable ECC protection, as the qubits in current systems are quite noisy. Incorporating reliable, scalable ECC will be a key area of focus as quantum computers continue to mature. At this point quantum computers are at the academic level for limited applications.

Tachyum Prodigy RAS Features

Tue, 26 Mar 2024 00:00:00 +0000

Introduction

Tachyum Prodigy, the world’s first Universal Processor which unifies the functionality of CPU, GPGPU, and TPU, was designed from the ground up to provide leading-edge cloud, AI, and HPC performance across a wide range of applications and workloads without the need for costly and power-hungry hardware accelerators.

Prodigy’s revolutionary new architecture employs state-of-the-art technology which requires that the design be complemented by high reliability, availability, and serviceability, or RAS, to ensure that customer platforms are not only high performance but reliable and easy to service in the field.

RAS is becoming increasingly important across the industry as devices become more complex, denser, and consume more power while data centers grow in size. As part of their recent keynote for GTC 24, Nvidia stressed the importance of RAS in their latest product introduction, spending valuable keynote time to include RAS features as part of their new products and features overview, and the importance of RAS was highlighted as they showed a large potential data center deployment that would provide 645 EF of AI performance.

This paper provides an overview of Prodigy’s RAS strategy and delves into the key RAS features in each major area that enable Prodigy to provide a complete solution for the high performance and reliability demands of today’s data centers. The paper complements an earlier Tachyum white paper which showcased a large Prodigy lead customer data center designed to run 8 ZF of AI performance where RAS will clearly be a critical component.

Tachyum’s Prodigy Evolves to Next Level: 1U/2U Solutions for Data Centers

Sun, 25 Feb 2024 00:00:00 +0000

Introduction

Tachyum Prodigy, the world’s first Universal Processor, was designed from the ground up to provide leading-edge AI features to address the emerging demand for AI across a wide range of applications and workloads. Prodigy’s revolutionary new architecture unifies the functionality of CPU, GPGPU, and TPU to address today’s ever-increasing demands for AI, HPC, and cloud without costly and power-hungry accelerators.

The high-performance processing that Prodigy offers requires a system-level strategy to ensure that customers and partners can quickly and easily test, benchmark, and deploy Prodigy solutions across the broad array of applications and workloads that Prodigy supports. This paper presents Prodigy’s data center strategy to deliver 1U and 2U platform solutions to the market.

Tachyum’s Prodigy ATX Platform - Democratizing AI for Everyone

Thu, 08 Feb 2024 00:00:00 +0000

Introduction

Tachyum Prodigy, the world’s first Universal Processor, was designed from the ground up to provide leading-edge AI features to address the emerging demand for AI across a wide range of applications and workloads. Prodigy’s revolutionary new architecture unifies the functionality of CPU, GPGPU, and TPU to address a wide range of workloads, including today’s ever-increasing AI demands without costly and power-hungry accelerators.

In addition to its unified architecture, Prodigy’s AI subsystem incorporates groundbreaking features that deliver high performance and efficiency for AI applications, including the 4-bit TAI exponential data type and multiple levels of sparse matrix processing which enables Prodigy to process large language models (LLMs) with 2-bit effective weights, providing never-before-seen efficiency. In addition, Prodigy integrates up to 16 DDR5 memory controllers to provide unprecedented memory bandwidth and capacity. Prodigy’s powerful AI capabilities enable LLMs to run much easier and cost-effectively than existing CPU + GPGPU based systems. A single 96-core Prodigy with 1 TB of memory can run a ChatGPT4 model with 1.7 trillion parameters, whereas it requires 52 Nvidia H100 GPUs to run the same thing at significantly higher cost and power consumption.

This paper presents the Prodigy ATX Platform, focusing on the hardware architecture, target applications, and how it will democratize AI for those who wouldn’t normally have access to sophisticated AI models. The Prodigy ATX Platform allows everyone to develop and run cutting edge AI models for as low as $5,000 in an entry-level platform SKU configuration featuring a 48-core Prodigy and 256 GB of DDR5 memory.

Tachyum Prodigy Universal Processor Enabling 50 EF / 8 AI ZF Supercomputers in 2025

Tue, 12 Dec 2023 00:00:00 +0000

Introduction

Supercomputing reached a crucial milestone in 2022 with the deployment of Frontier, the first exascale supercomputer, at Oak Ridge National Laboratory (ORNL). Frontier runs at 1.2 exaflops (EF) of FP64 performance. 2023 promises more of the same with the rollout of Aurora at Argonne National Laboratory (ANL), which is estimated to provide more than 2 EF of performance when it’s fully deployed. Additional exaflop clusters will follow, driven by the ever-expanding needs of compute-intensive workloads such as quantum mechanics, weather forecasting, oil and gas exploration, molecular modeling, aerodynamics, nuclear fusion research, and cryptoanalysis.

Hyperscalers are offering large HPC systems in the cloud, and in 2023 a system named Eagle, installed in the Microsoft Azure Cloud, has taken the No. 3 spot in the top 500 list with 561.2 PF of performance. This is the highest rank a cloud system has ever achieved in the TOP500. Two years ago, a previous Azure system was the first cloud system ever to enter the TOP10 in the number 10 spot.

Looking ahead to next generation supercomputers, the US Dept. of Energy (DOE), who runs several of the world’s premier supercomputing labs including ORNL and ANL, issued an RFI (request for information) in June of 2022 outlining their supercomputer requirements for the 2025 – 2030 timeframe. Key supercomputer requirements outlined in the RFI for 2025 – 2030 include the following:

HPC Performance (FP64): 10 – 20+ EF
AI Performance: 8 – 16x FP64 performance: 80 – 320+ EF
Supercomputer area: 4,000 – 6,000 square feet

Earlier this year, Tachyum announced that it had accepted a purchase order from a US company to build a large supercomputer that far exceeds not only the performance of existing supercomputers, but also the target performance for next-generation systems. The new supercomputer will be built with Tachyum’s Prodigy Universal Processor, delivering the unprecedented performance of 50 EF of FP64, and 8 ZF of AI training for large language models. Prodigy’s revolutionary new architecture coupled with advanced system scalability enables this extraordinary performance.

This paper provides an overview of the Tachyum Prodigy Universal Processor, summarizes how the Prodigy Family differs from existing processor architectures, discusses the Prodigy roadmap, and presents details for the new HPC/AI supercomputer data center design developed by Tachyum’s world-class systems, solutions, and software engineering teams.

Mainstreaming Large Language Models With 2-bit TAI Weights

Tue, 14 Nov 2023 00:00:00 +0000

Introduction

Generative Pre-trained Transformer (GPT) models set themselves apart through breakthrough performance across complex language modelling tasks, in text generation, few-shot learning, reasoning, protein sequence modelling, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly accurate GPT models may require multiple performant GPUs to execute, which limits the usability of such models.

The capacities of language models increase dramatically by more than 1,000 times within a few years, from BERT 340 million parameters to the Megatron Turing 530 billion dense parameters and to the sparse Switch Transformer 1.6 trillion sparse parameters and lower precision (bfloat16). Scaling up language models has been incredibly successful. It significantly improves a model’s performance on language and vision tasks, and the models demonstrate amazing few shot capabilities similar to that of human beings.

Efficient deployment of large language models (LLMs) necessitates low-bit quantization to minimize model size and inference cost. While low-bit integer formats (e.g., INT8/INT4) have been the conventional choice, emerging low-bit floating-point formats (e.g., FP8/4bit TAI) offer a compelling alternative. Striking a balance between computational efficiency and maintaining model quality is a formidable challenge.

Image AI Processing at the Next Level With 4b TAI & 2b Effective Weights

Tue, 24 Oct 2023 00:00:00 +0000

Introduction

Neural network quantization is one of the most effective ways to improve the efficiency of neural networks. Quantization allows weights, activations, and gradients to be represented in low bit-width formats. This leads to a reduction in data movement and enables the use of low bit-width computations, resulting in faster inference and lower energy consumption.

Integer and floating-point formats are two primary ways to represent numbers in computing systems. The key distinction between them lies in the value distribution. Integer format has a uniform distribution across the representable range with a difference of 1 between two consecutive numbers. The floating- point format exhibits a non-uniform distribution, thus providing higher precision for smaller numbers and lower precision for larger ones.

Low-bit floating-point formats have recently emerged as promising alternatives for DNN quantization. Leading hardware vendors such as NVIDIA already support FP8 in H100.

The optimal format is influenced by a combination of various factors, including the static/dynamic nature of tensors, and quantization bit-width. As evidenced by other research for weight tensors with static distribution, INT quantization outperforms FP quantization at 8-bit but this advantage diminishes at 4-bit. For activation tensors with dynamic distribution, FP quantization surpasses INT quantization due to FP having the ability to represent large values with lower precision and small values with higher precision.

In this paper we describe the Prodigy 4-bit Tachyum AI (TAI) format, including effectively 2-bit per weight (TAI2) format, end-to-end training techniques with TAI, inference using TAI, and provide experimental results.

Unprecedented Scale and Efficiency in Generative AI with FP8 8:3 Super-Sparsity

Tue, 22 Aug 2023 00:00:00 +0000

Abstract

The current trend in the field of deep learning is to increase the number of model parameters so that they can handle difficult tasks and model complex nonlinear systems such as natural language processing. However, it has its downside, as the complex system requirements to train such large models become the prerogative of just a few labs in the world. This paper focuses on the benefits of compression for models both during and after training. Our work shows that it is possible to jointly quantize the model during training in 8-bit accuracy and then perform block pruning, which leaves only 37% of the parameters. This method significantly increases the speed of training and at the same time reduces the memory footprint of the model after training. Moreover, if the hardware on which the model runs effectively supports 8-bit operations and sparse matrices, this further accelerates the model.

Key words

8-bit Floating Point, Quantization Aware Training, Post-training Quantization, Block Pruning, Model Compression

Tachyum Prodigy - Solution for Data Centers, That Are Hungry for Energy

Tue, 25 Jul 2023 00:00:00 +0000

Increases of internet-connected devices, growing demand for compute-intensive applications like AI, as well as remote work culture and the race for digitizing business processes will require massive expanding of data centers capacity, which will massively increase electricity consumption.

Credit Unions, Blockchain, CBDC, FinTech, and Tachyum Prodigy

Tue, 18 Jul 2023 00:00:00 +0000

Since the Federal Credit Union Act of 1934 became law, U.S. Credit Unions have consistently been the early adopters of new product trends and technologies in banking.

The Federal Credit Union Act established the National Credit Union Administration (NCUA) as a regulatory body that oversees the unique needs of cooperative banking, apart from the retail and commercial banking sector, regulated by the Federal Deposit Insurance Corporation (FDIC).

Part of the motivation for the Credit Union Act of 1934 was to counter an ongoing power grab by Industrialists, and to give banking cooperatives, savings and loans, and mutuals, a way to exist without excessive regulation and control – particularly regulation and control dictated by central banks.

Tachyum Prodigy Artificial Intelligence Second Edition

Wed, 17 May 2023 00:00:00 +0000

Introduction

Tachyum Prodigy is the industry’s first universal processor, unifying the functionality of CPU, GPGPU, and TPU into a single monolithic chip. Prodigy’s revolutionary new architecture provides 6x more raw performance on AI training and inference workloads than the industry’s highest performing GPU, and up to 10x performance at the same power. Prodigy’s features include 128 high performance 64-bit processor cores running up to 5.7GHz with each core integrating a cutting-edge AI subsystem that includes a 4096-bit matrix processor supporting 16x16, 8x8, and 4x4 matrix operations. In addition, Prodigy’s memory subsystem integrates 16 DDR5 memory controllers that run up to DDR5-7200, providing the memory bandwidth and capacity to enable the highest performance processing of the most complex AI models.

This paper presents details of Prodigy’s AI subsystem and architecture, providing a deep dive into Prodigy’s AI features and how they deliver the highest performance for today’s demanding applications and workloads. The topics covered include trends in the industry driving the need for higher performance, mixed precision training, FP8 quantization, and sparsity, including Tachyum-invented super-sparsity.

Tachyum Prodigy Universal Processor Enabling Next Generation 20+ EF Supercomputers

Wed, 12 Apr 2023 00:00:00 +0000

Introduction

Supercomputing reached a crucial milestone in 2022 with the deployment of Frontier, the first exascale supercomputer, at Oakridge National Laboratory (ORNL). Frontier runs at 1.2 exaflops (EF) of FP64 performance while consuming 21 megawatts. 2023 promises more of the same with the rollout of Aurora at Argonne National Laboratory (ANL), which is estimated to provide more than 2 EF of performance. Additional exaflop clusters will follow, driven by the ever expanding needs of compute intensive workloads such as quantum mechanics, weather forecasting, oil and gas exploration, molecular modeling, aerodynamics, nuclear fusion research, and cryptoanalysis.

HPC Performance (FP64): 10 – 20+ EF
AI Performance: 8 – 16x FP64 performance: 80 – 320+ EF
Power Consumption: 20 – 60 MW
Supercomputer area: 4,000 – 6,000 square feet
- 4,000 sq ft is area consumed by ORNL Frontier
- 6,000 sq ft adds a 1.5x multiplier to Frontier for future supercomputers

This paper provides an overview of the Tachyum Prodigy Universal Processor Family, summarizes how the Prodigy Family differs from existing processor architectures, discusses the Prodigy roadmap, and presents HPC/AI supercomputer data center designs developed by Tachyum’s world-class systems, solutions, and software engineering teams for the first two generations of the Prodigy Family, Prodigy and Prodigy 2. The system designs presented assume the requirements outlined above from the recent DOE RFI as a baseline for the 2025 – 2030 timeframe.

Tachyum HPC/AI Software Stack for Prodigy Data Center Deployments

Thu, 16 Mar 2023 00:00:00 +0000

Introduction

Tachyum is best known for its flagship product, Prodigy, the world’s first Universal Processor that unifies the functionality of CPU, GPGPU, and TPU into a single monolithic device. The Prodigy Universal Processor delivers industry leading performance and efficiency for next generation supercomputers.

For customers in HPC/AI area, the software environment is crucial. In this document, the standard software stack for HPC/AI is presented. All these software packages will be offered and maintained by the Tachyum team. In more areas, alternatives for the same purpose are presented, it is up to the customer to choose the best fit. The creation of the Tachyum software stack is driven by Tachyum’s early customers and partners. Open source software has been a preferred choice for Tachyum HPC/AI software stack.

Air Dominance Powered by Prodigy

Tue, 07 Feb 2023 00:00:00 +0000

Background

Air superiority is a prerequisite for successful joint combat operations. Without control of the airspace in a contested theatre of operations, military objectives on the ground can never be fully realized.

Today, the United States Air Force is operating a fighter fleet perched on the brink of disaster. Most of the service’s air superiority aircraft were designed at the conclusion of the Vietnam War, produced in the 1980s, and are ill-suited to meet future threats. With only 186 F-22 air superiority fighters (versus 750 planned) and about 200 F-35 multi-role aircraft to complement its aging 4^th generation fighters, America has too few fighters to defend its territory, and the territory of its allies in the face of 21^st century threats. The National Defense Strategy Commission recently concluded that America’s hard military power has eroded “to a dangerous degree.” America’s ability to defend its allies, partners, and its own vital interests against a modern combat threat is increasingly in doubt, the commission stated, and if the US does not act promptly, the consequences will be “grave and lasting.”

The modern threat environment is defined by 5^th generation (soon to be AI-enabled 6^th gen) adversaries in the air and on the ground, which require exponential improvements in combat capabilities over 4^th generation fighters, even though 4^th generation fighters comprise over 85% of our combat aircraft.

Tachyum Storage Strategy and Direction for Prodigy Data Center Deployments

Fri, 20 Jan 2023 00:00:00 +0000

Introduction

This paper begins with an overview of the different types of storage deployed in data centers and how they are used. It continues with the strategy that Tachyum has developed to provide the optimal storage infrastructure to support Prodigy compute nodes in upcoming supercomputer deployments, the storage strategy implemented by a leading HPC lab, and the paper concludes with a glimpse into Tachyum’s future direction based on the trends in high performance storage technologies.

Tachyum Networking Strategy and Direction for Prodigy Data Center Deployments

Wed, 26 Oct 2022 00:00:00 +0000

Introduction

Tachyum is best known for its flagship product, Prodigy, the world’s first Universal Processor that unifies the functionality of CPU, GPGPU, and TPU into a single monolithic device.

The Prodigy Universal Processor delivers industry leading performance and efficiency for next generation supercomputers and requires cutting-edge networking infrastructure to complement the computing power for the supercomputer clusters. This paper begins with an overview of modern-day networking and how it has evolved since the invention of Ethernet and continues with the strategy that Tachyum has developed to provide the optimal networking infrastructure to support Prodigy compute nodes in upcoming supercomputer deployments.

The paper concludes with a glimpse into Tachyum’s future direction based on the trends in high performance networking technologies.

Tachyum Prodigy Architectural Overview

Tue, 04 Oct 2022 00:00:00 +0000

Introduction

Tachyum’s Prodigy Universal Processor addresses the major challenges and pain points facing today’s data centers, including increasingly high power consumption, low server utilization, and the processor performance plateau that has occurred over the past two decades.

Prodigy, the world’s first Universal Processor, delivers a revolutionary new architecture that unifies the functionality of CPU, GPGPU, and TPU into a single monolithic device, enabling Prodigy processors to address the high demands of cloud and HPC/AI workloads, without expensive and power-hungry accelerators, by using a simple homogenous software model that is aligned with software composability and dynamic reallocation of server resources to maximize utilization.

Processor performance plateauing has its root cause in increasing wire delays on the processor silicon. As the silicon process shrinks, the transistors speed up, but the wires slow down, and we are now at the point where performance is being throttled by wire delays. Since the resistivity of a wire is a function of the cross-sectional area of the wire, the resistivity increases with the square of the process shrink, so a 10x decrease in process geometry results in a 100x increase in resistivity, which is proportional to wire delay. The industry’s conversion from aluminum to copper interconnects and the use of low K dielectrics has helped, but wire delays have still become the dominant factor limiting processor performance increases from generation to generation.

Prodigy addresses the processor performance plateau caused by slow wires with architectural innovations that minimize data transmission over the slow wires. It does this by keeping a CPU calculation and the input data required for that calculation local, thus circumventing the core problem limiting CPU performance: slow wires.