Maia 200 is Microsoft’s second-generation AI accelerator, optimized specifically for high-volume AI inference and token generation in Azure. It is fabricated on TSMC 3nm.

Each accelerator has12

  • 140B transistors on TSMC 3 nm process

  • Matrix engine: FP8/FP6/FP4
  • Vector engine: BF16/FP16/FP32
  • 216 GB HBM3e
    • 7 TB/s
    • 6x 36GB 12-high stacks3
  • 272 MB on-die SRAM scratchpads (Cluster SRAM (CSRAM) + Tile SRAM (TSRAM))
  • On-die NIC, Ethernet-based
    • 1.4+1.4 TB/s bandwidth per accelerator
    • Split between scale-up within node and scale-out across nodes
    • Collectives supported up to 6,144 accelerators

Maia 200 architecture is organized as:2

  • Tile (smallest autonomous unit)
    • Tile Tensor Unit (TTU): matrix multiply / convolution optimized for FP8/FP6/FP4; supports mixed precision such as FP8 activations × FP4 weights
    • Tile Vector Processor (TVP): programmable SIMD engine; supports FP8 plus BF16/FP16/FP32
    • Tile SRAM (TSRAM): multi-banked local SRAM feeding the tile execution engines
    • Tile DMA: moves data into and out of TSRAM without stalling the compute pipeline
    • Tile Control Processor (TCP): orchestrates TTU and DMA work issuance; hardware semaphores for fine-grained synchronization
  • Cluster (second tier of locality)
    • Multiple tiles per cluster
    • Cluster SRAM (CSRAM): large, multi-banked SRAM shared across tiles in a cluster
    • Cluster DMA: stages traffic between CSRAM and co-packaged HBM
    • Cluster core: control and synchronization for coordinated multi-tile execution
    • Redundancy schemes for tiles and SRAM to improve yield while preserving the hierarchical execution model

On-chip network:

  • Logical planes separating bulk tensor traffic (data plane) from latency-sensitive control/synchronization (control plane)
  • QoS mechanisms to prioritize critical low-latency traffic
  • Hierarchical broadcast and localized cluster traffic to reduce redundant HBM reads
  • Layered DMA hierarchy (Tile DMAs, Cluster DMAs, Network DMAs) to overlap movement with compute

Intra-node topology uses Fully Connected Quads (FCQ) between accelerator packages (sounds like their answer to NVLink)

Interconnect:

  • Ethernet-based with custom Microsoft AI Transport Layer (ATL) protocol
  • Transport-layer features such as packet spraying, multipath routing, and congestion-resistant flow control (sounds like MRC)

Performance

Maia 200 is designed for low-precision inference throughput within a 750W TDP.12

MetricValue
Peak FP4>10 PFLOPS (10.1 PetaOPS cited)
Peak FP8>5 PFLOPS
HBM capacity216 GB HBM3e
HBM bandwidth7 TB/s
On-die SRAM272 MB (CSRAM + TSRAM)
Network bandwidth2.8 TB/s bidirectional per accelerator
Power750 W SoC TDP

System architecture

  • Node/tray is four accelerators
    • One FCQ
    • Direct, non-switched links1
  • Network scales out to 6,144 accelerators in a two-tier topology.12
    • 1,536 nodes/trays
    • This suggests 128-port switches, or 2x400Gx64-port switches with reliance on packet spraying
    • Same ATL protocol is used intra-rack and inter-rack,1 so packet spraying is supported
    • This further suggests an 8-plane multi-plane fat tree

Footnotes

  1. https://blogs.microsoft.com/blog/2026/01/26/maia-200-the-ai-accelerator-built-for-inference/ 2 3 4 5

  2. https://techcommunity.microsoft.com/blog/azureinfrastructureblog/deep-dive-into-the-maia-200-architecture/4489312 2 3 4

  3. Can infer this from the HBM3e packaging and the die photos showing six stacks