Tachyum Prodigy RAS Features

Introduction

Tachyum Prodigy, the world’s first Universal Processor which unifies the functionality of CPU, GPGPU, and TPU, was designed from the ground up to provide leading-edge cloud, AI, and HPC performance across a wide range of applications and workloads without the need for costly and power-hungry hardware accelerators.

Prodigy’s revolutionary new architecture employs state-of-the-art technology which requires that the design be complemented by high reliability, availability, and serviceability, or RAS, to ensure that customer platforms are not only high performance but reliable and easy to service in the field.

RAS is becoming increasingly important across the industry as devices become more complex, denser, and consume more power while data centers grow in size. As part of their recent keynote for GTC 24, Nvidia stressed the importance of RAS in their latest product introduction, spending valuable keynote time to include RAS features as part of their new products and features overview, and the importance of RAS was highlighted as they showed a large potential data center deployment that would provide 645 EF of AI performance.

This paper provides an overview of Prodigy’s RAS strategy and delves into the key RAS features in each major area that enable Prodigy to provide a complete solution for the high performance and reliability demands of today’s data centers. The paper complements an earlier Tachyum white paper which showcased a large Prodigy lead customer data center designed to run 8 ZF of AI performance where RAS will clearly be a critical component.