Maia 200 is Microsoft’s second-generation AI accelerator, optimized specifically for high-volume AI inference and token generation in Azure. It is fabricated on TSMC 3nm.
-
140B transistors on TSMC 3 nm process
- Matrix engine: FP8/FP6/FP4
- Vector engine: BF16/FP16/FP32
- 216 GB HBM3e
- 7 TB/s
- 6x 36GB 12-high stacks3
- 272 MB on-die SRAM scratchpads (Cluster SRAM (CSRAM) + Tile SRAM (TSRAM))
- On-die NIC, Ethernet-based
- 1.4+1.4 TB/s bandwidth per accelerator
- Split between scale-up within node and scale-out across nodes
- Collectives supported up to 6,144 accelerators
Maia 200 architecture is organized as:2
- Tile (smallest autonomous unit)
- Tile Tensor Unit (TTU): matrix multiply / convolution optimized for FP8/FP6/FP4; supports mixed precision such as FP8 activations × FP4 weights
- Tile Vector Processor (TVP): programmable SIMD engine; supports FP8 plus BF16/FP16/FP32
- Tile SRAM (TSRAM): multi-banked local SRAM feeding the tile execution engines
- Tile DMA: moves data into and out of TSRAM without stalling the compute pipeline
- Tile Control Processor (TCP): orchestrates TTU and DMA work issuance; hardware semaphores for fine-grained synchronization
- Cluster (second tier of locality)
- Multiple tiles per cluster
- Cluster SRAM (CSRAM): large, multi-banked SRAM shared across tiles in a cluster
- Cluster DMA: stages traffic between CSRAM and co-packaged HBM
- Cluster core: control and synchronization for coordinated multi-tile execution
- Redundancy schemes for tiles and SRAM to improve yield while preserving the hierarchical execution model
On-chip network:
- Logical planes separating bulk tensor traffic (data plane) from latency-sensitive control/synchronization (control plane)
- QoS mechanisms to prioritize critical low-latency traffic
- Hierarchical broadcast and localized cluster traffic to reduce redundant HBM reads
- Layered DMA hierarchy (Tile DMAs, Cluster DMAs, Network DMAs) to overlap movement with compute
Intra-node topology uses Fully Connected Quads (FCQ) between accelerator packages (sounds like their answer to NVLink)
Interconnect:
- Ethernet-based with custom Microsoft AI Transport Layer (ATL) protocol
- Transport-layer features such as packet spraying, multipath routing, and congestion-resistant flow control (sounds like MRC)
Performance
Maia 200 is designed for low-precision inference throughput within a 750W TDP.12
| Metric | Value |
|---|---|
| Peak FP4 | >10 PFLOPS (10.1 PetaOPS cited) |
| Peak FP8 | >5 PFLOPS |
| HBM capacity | 216 GB HBM3e |
| HBM bandwidth | 7 TB/s |
| On-die SRAM | 272 MB (CSRAM + TSRAM) |
| Network bandwidth | 2.8 TB/s bidirectional per accelerator |
| Power | 750 W SoC TDP |
System architecture
- Node/tray is four accelerators
- One FCQ
- Direct, non-switched links1
- Network scales out to 6,144 accelerators in a two-tier topology.12
- 1,536 nodes/trays
- This suggests 128-port switches, or 2x400Gx64-port switches with reliance on packet spraying
- Same ATL protocol is used intra-rack and inter-rack,1 so packet spraying is supported
- This further suggests an 8-plane multi-plane fat tree
Footnotes
-
https://blogs.microsoft.com/blog/2026/01/26/maia-200-the-ai-accelerator-built-for-inference/ ↩ ↩2 ↩3 ↩4 ↩5
-
https://techcommunity.microsoft.com/blog/azureinfrastructureblog/deep-dive-into-the-maia-200-architecture/4489312 ↩ ↩2 ↩3 ↩4
-
Can infer this from the HBM3e packaging and the die photos showing six stacks ↩