Trinity is a family of open-source MOE transformers developed by Arcee. The largest is Trinity Large with 400B parameters with 256 experts and 1 shared and 3 routed experts active per token (or 13B active parameters per token). They trained on 17T tokens.
Most of this information is from Trinity Large.
Architecture
From Arcee Trinity Large Technical Report:1
| Trinity Nano | Trinity Mini | Trinity Large | |
|---|---|---|---|
| Transformer layers | 56 | 32 | 60 |
| Initial dense layers | 2 | 2 | 6 |
| Model dim (dmodel) | 1024 | 2048 | 3072 |
| FFN intermediate dim | 3072 | 6144 | 12288 |
| Attention heads (hq) | 8 | 32 | 48 |
| Per-head dim (dh) | 128 | 128 | 128 |
| KV heads (hkv) | 2 | 4 | 8 |
| Local window size | 2048 | 2048 | 4096 |
| Pre-training seq len | 4096 | 4096 | 8192 |
| MoE shared experts | 1 | 1 | 1 |
| MoE routed experts | 128 | 128 | 256 |
| Activated experts / token | 8 | 8 | 4 |
| Route scale | 2.826 | 2.826 | 2.448 |
| Expert size | 256 | 1024 | 3072 |
| Initialization (trunc normal σ) | 0.016 | 0.011 | 0.009 |
- 400B parameters MOE, 13B active per token
- 256 experts with 4 active experts per token. 1 shared, 3 routed.
- 1.56% routing fraction; this is very high sparsity.
MoE routing stability mechanics:
- 6 dense layers to stabilize routing at this sparsity level. This means 54 layers were MOE and 6 layers were dense transformer layers.
- They explicitly state routing stability was a challenge requiring architectural adjustment mid-design
Data Pipeline
Trained on 17T tokens.
- Split inputs across 3 phases (10T/4T/3T)
- Curated by DatologyAI
- 8T tokens of synthetic data (web, code, math, reasoning, multilingual—14 non-English languages)
- “State-of-the-art rephrasing approaches” (no specifics)
- Data mix evolved specifically for Trinity Large vs smaller Trinity models
This is heavy on synthetic data relative to typical ratios. The fact that they call out “curation advancements” suggests they learned something between smaller Trinity models and this one, but they don’t say what broke.
Training dynamics worth noting:
- Smooth loss curve with “clear phase transitions, no spikes”
- They frame this as success after achieving “stability dialed in”
- Muon optimizer mentioned as enabling larger batch sizes (vs AdamW)
- They reference MiniMax-01 paper for batch-size scaling justification
This suggests they had instability issues early and had to tune heavily to get clean training. The architectural changes (3 to 6 dense layers) and routing tweaks likely came from failed runs.
Training Infrastructure & Scale
- 2048 B300 GPUs for 33 days pretraining (claimed as “largest publicly stated” B300 run)
- Total cost: $20M all-in for 4 models over 6 months (compute, salaries, data, storage, ops)
- Training throughput optimization via HSDP with expert parallelism=8, totaling 2,048 data-parallel ranks
- Batch size increased after 5T tokens (justified by high sparsity + Muon optimizer’s larger critical batch size tolerance)
Three Checkpoint Release Strategy
There are three variants:
- Preview: Light post-training, instruct-style (non-reasoning), optimized for creative tasks and agentic workflows
- Base: Full 17T pretraining checkpoint
- TrueBase: 10T checkpoint with zero instruct data, no LR annealing. Explicitly marketed as “real baseline” for researchers
Inference Context & Hosting
- Native 512K context support
- Preview API running at 128K with 8-bit quantization while they tune infrastructure
- The launched was framed as a “preview of hosting platform” as much as model launch
Claims vs. substance
What they say:
- “Frontier-class foundation model”
- Matches/exceeds open-base peers across benchmarks
- 2-3x inference throughput advantage
What’s missing:
- No detailed hardware utilization metrics (MFU, throughput/GPU)
- No ablations on sparsity vs performance trade-off
- Vague on what routing instability they hit and how they fixed it