Google has introduced ‘Ironwood,’ the First Google Tensor Processing Unit for the Age of Inference
Yesterday Google Cloud Next 25 introduced Ironwood, their seventh-generation Tensor Processing Unit (TPU) — our most performant and scalable custom AI accelerator to date, and the first designed specifically for inference. Artificial intelligence (AI) inference which provides the ability of trained AI models to recognize patterns and draw conclusions from information that they haven’t seen before.
For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads, and have enabled their Cloud customers to do the same. Ironwood is Google’s most powerful, capable and energy efficient TPU yet. And it's purpose-built to power thinking, inferential AI models at scale.
Ironwood represents a significant shift in the development of AI and the infrastructure that powers its progress. It’s a move from responsive AI models that provide real-time information for people to interpret, to models that provide the proactive generation of insights and interpretation. This is what we call the “age of inference” where AI agents will proactively retrieve and generate data to collaboratively deliver insights and answers, not just data.
Ironwood is designed to gracefully manage the complex computation and communication demands of "thinking models," which encompass Large Language Models (LLMs), Mixture of Experts (MoEs) and advanced reasoning tasks. These models require massive parallel processing and efficient memory access
Ironwood is built to support this next phase of generative AI and its tremendous computational and communication requirements. It scales up to 9,216 liquid cooled chips linked with breakthrough Inter-Chip Interconnect (ICI) networking spanning nearly 10 MW. It is one of several new components of Google Cloud AI Hypercomputer architecture, which optimizes hardware and software together for the most demanding AI workloads. With Ironwood, developers can also leverage Google’s own Pathways software stack to reliably and easily harness the combined computing power of tens of thousands of Ironwood TPUs.
Powering the age of inference with Ironwood
Ironwood is designed to gracefully manage the complex computation and communication demands of "thinking models," which encompass Large Language Models (LLMs), Mixture of Experts (MoEs) and advanced reasoning tasks. These models require massive parallel processing and efficient memory access. In particular, Ironwood is designed to minimize data movement and latency on chip while carrying out massive tensor manipulations. At the frontier, the computation demands of thinking models extend well beyond the capacity of any single chip. We designed Ironwood TPUs with a low-latency, high bandwidth ICI network to support coordinated, synchronous communication at full TPU pod scale.
For Google Cloud customers, Ironwood comes in two sizes based on AI workload demands: a 256 chip configuration and a 9,216 chip configuration.
- When scaled to 9,216 chips per pod for a total of 42.5 Exaflops, Ironwood supports more than 24x the compute power of the world’s largest supercomputer – El Capitan – which offers just 1.7 Exaflops per pod. Ironwood delivers the massive parallel processing power necessary for the most demanding AI workloads, such as super large size dense LLM or MoE models with thinking capabilities for training and inference. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability. Ironwood’s memory and network architecture ensures that the right data is always available to support peak performance at this massive scale.
- Ironwood also features an enhanced SparseCore, a specialized accelerator for processing ultra-large embeddings common in advanced ranking and recommendation workloads. Expanded SparseCore support in Ironwood allows for a wider range of workloads to be accelerated, including moving beyond the traditional AI domain to financial and scientific domains.
- Pathways, Google’s own ML runtime developed by Google DeepMind, enables efficient distributed computing across multiple TPU chips. Pathways on Google Cloud makes moving beyond a single Ironwood Pod straightforward, enabling hundreds of thousands of Ironwood chips to be composed together to rapidly advance the frontiers of gen AI computation.