Microsoft Maia 200

Maia 200 is Microsoft’s second-generation AI accelerator, optimized specifically for high-volume AI inference and token generation in Azure. It is fabricated on TSMC 3nm.

Each accelerator has¹²

140B transistors on TSMC 3 nm process
Matrix engine: FP8/FP6/FP4
Vector engine: BF16/FP16/FP32
216 GB HBM3e
- 7 TB/s
- 6x 36GB 12-high stacks³
272 MB on-die SRAM scratchpads (Cluster SRAM (CSRAM) + Tile SRAM (TSRAM))
On-die NIC, Ethernet-based
- 1.4+1.4 TB/s bandwidth per accelerator
- Split between scale-up within node and scale-out across nodes
- Collectives supported up to 6,144 accelerators

Maia 200 architecture is organized as:²

Tile (smallest autonomous unit)
- Tile Tensor Unit (TTU): matrix multiply / convolution optimized for FP8/FP6/FP4; supports mixed precision such as FP8 activations × FP4 weights
- Tile Vector Processor (TVP): programmable SIMD engine; supports FP8 plus BF16/FP16/FP32
- Tile SRAM (TSRAM): multi-banked local SRAM feeding the tile execution engines
- Tile DMA: moves data into and out of TSRAM without stalling the compute pipeline
- Tile Control Processor (TCP): orchestrates TTU and DMA work issuance; hardware semaphores for fine-grained synchronization
Cluster (second tier of locality)
- Multiple tiles per cluster
- Cluster SRAM (CSRAM): large, multi-banked SRAM shared across tiles in a cluster
- Cluster DMA: stages traffic between CSRAM and co-packaged HBM
- Cluster core: control and synchronization for coordinated multi-tile execution
- Redundancy schemes for tiles and SRAM to improve yield while preserving the hierarchical execution model

On-chip network:

Logical planes separating bulk tensor traffic (data plane) from latency-sensitive control/synchronization (control plane)
QoS mechanisms to prioritize critical low-latency traffic
Hierarchical broadcast and localized cluster traffic to reduce redundant HBM reads
Layered DMA hierarchy (Tile DMAs, Cluster DMAs, Network DMAs) to overlap movement with compute

Intra-node topology uses Fully Connected Quads (FCQ) between accelerator packages (sounds like their answer to NVLink)

Interconnect:

Ethernet-based with custom Microsoft AI Transport Layer (ATL) protocol
Transport-layer features such as packet spraying, multipath routing, and congestion-resistant flow control (sounds like MRC)

Performance

Maia 200 is designed for low-precision inference throughput within a 750W TDP.¹²

Metric	Value
Peak FP4	>10 PFLOPS (10.1 PetaOPS cited)
Peak FP8	>5 PFLOPS
HBM capacity	216 GB HBM3e
HBM bandwidth	7 TB/s
On-die SRAM	272 MB (CSRAM + TSRAM)
Network bandwidth	2.8 TB/s bidirectional per accelerator
Power	750 W SoC TDP

System architecture

Node/tray is four accelerators
- One FCQ
- Direct, non-switched links¹
Network scales out to 6,144 accelerators in a two-tier topology.¹²
- 1,536 nodes/trays
- This suggests 128-port switches, or 2x400Gx64-port switches with reliance on packet spraying
- Same ATL protocol is used intra-rack and inter-rack,¹ so packet spraying is supported
- This further suggests an 8-plane multi-plane fat tree

https://blogs.microsoft.com/blog/2026/01/26/maia-200-the-ai-accelerator-built-for-inference/ ↩ ↩² ↩³ ↩⁴ ↩⁵
https://techcommunity.microsoft.com/blog/azureinfrastructureblog/deep-dive-into-the-maia-200-architecture/4489312 ↩ ↩² ↩³ ↩⁴
Can infer this from the HBM3e packaging and the die photos showing six stacks ↩

Glenn's Digital Garden

Explorer

Microsoft Maia 200

Performance

System architecture

Graph View

Table of Contents

Glenn's Digital Garden

Explorer

Microsoft Maia 200

Performance

System architecture

Footnotes

Graph View

Table of Contents