MAI-1 was a mixture of experts model developed by Microsoft AI that was pre- and post-trained on “~15,000 NVIDIA H100 GPUs.”1
MAI-Thinking-1 is a mixture of experts model which is a trillion-parameter model with 35 billion active parameters2 (3.5% active).
It was fine-tuned from MAI-Base-1, which was trained on 8,192 GB200 NVL72 GPUs on a single InfiniBand domain, probably on Microsoft AI’s Arizona supercomputer).3
It has:
- 35 billion active parameters, 1 trillion total parameters
- 78 layers
- hidden dimension of 6,652
- Full FFN dimension of 13,312
- Expert dimension of 10,240
- Down projection dimension of 3,072
- 512 experts, 8 active
Training MAI-Base-1
MAI followed a five-step process of training to generate MAI-Base-1:
- Pre-training: 30T tokens for next-token prediction on broad corpus, 16K context
- Mid-training 1: 3.4T tokens for next-token prediction but data mixture shifted towards STEM/math/code. 64K context.
- Mid-training 2: 150B tokens for next-token prediction with more STEM/math/code. 262K context. Fewer GPUs.
The MAI-Base-1 model was trained on 8,192 GB200 NVL72 GPUs on one “logical cluster at one site”3 (probably their Arizona supercomputer) and NVL64 domains. Mid-training 2 halved the GPU count to 4,096 GPUs.
From the MAI-Thinking-1 technical report,3
MAI-Base-1 pre-training run reached 90.0% goodput at 8K GPUs, despite being larger than earlier pre-training runs. Total overhead dropped to 51 hours. Recomputation, the time spent reproducing previously computed steps after falling back to a checkpoint, fell to 6.5 hours, only 15% of overhead. Non-stepping time dropped to 14 hours, or 27% indicating that the system become much better at staying alive, avoiding repeated rework, and recovering without long manual intervention. However, the final run also showed the next bottleneck clearly. MFU drop overhead became the largest single remaining category, at 18 hours and 35% of overhead. This was driven by checkpointing, network degradation, memory pressure, and hardware health transitions. The failure trends also improved but did not disappear.
From this, we can infer that the entire pre-training took 510 hours. Of that,
- 18 hours was lost to checkpointing, network degradation, memory pressure, and hardware failures
- 14 hours was lost to time when the job simply wasn’t computing (restarting)
- 6.5 hours was lost to recomputation after a restart-from-checkpoint
The pre-training took 459 hours and processed 30T tokens over 8,192 B200 GPUs. This amounts to 3.76 million GPU hours at a rate of 2.2 million tokens per GPU per second. This is consistent with the 20% MFU cited for the v5 pre-training run.
Fine-tuning MAI-Thinking-1
The largest reinforcement learning job used
- 4,096 GPUs for rollout generation (inference) using SGLang
- 768 GPUs to learn (update model weights) using Microsoft AI’s in-house “YOLO” framework
The reward model itself was a fine-tuned version of MAI-Base-1.