Trinity is a family of open-source MOE transformers developed by Arcee. The largest is Trinity Large with 400B parameters with 256 experts and 1 shared and 3 routed experts active per token (or 13B active parameters per token). They trained on 17T tokens.

Most of this information is from Trinity Large.

Architecture

From Arcee Trinity Large Technical Report:1

Trinity NanoTrinity MiniTrinity Large
Transformer layers563260
Initial dense layers226
Model dim (dmodel)102420483072
FFN intermediate dim3072614412288
Attention heads (hq)83248
Per-head dim (dh)128128128
KV heads (hk​v)248
Local window size204820484096
Pre-training seq len409640968192
MoE shared experts111
MoE routed experts128128256
Activated experts / token884
Route scale2.8262.8262.448
Expert size25610243072
Initialization (trunc normal σ)0.0160.0110.009
  • 400B parameters MOE, 13B active per token
  • 256 experts with 4 active experts per token. 1 shared, 3 routed.
  • 1.56% routing fraction; this is very high sparsity.

MoE routing stability mechanics:

  • 6 dense layers to stabilize routing at this sparsity level. This means 54 layers were MOE and 6 layers were dense transformer layers.
  • They explicitly state routing stability was a challenge requiring architectural adjustment mid-design

Data Pipeline

Trained on 17T tokens.

  • Split inputs across 3 phases (10T/4T/3T)
  • Curated by DatologyAI
  • 8T tokens of synthetic data (web, code, math, reasoning, multilingual—14 non-English languages)
  • “State-of-the-art rephrasing approaches” (no specifics)
  • Data mix evolved specifically for Trinity Large vs smaller Trinity models

This is heavy on synthetic data relative to typical ratios. The fact that they call out “curation advancements” suggests they learned something between smaller Trinity models and this one, but they don’t say what broke.

Training dynamics worth noting:

  • Smooth loss curve with “clear phase transitions, no spikes”
  • They frame this as success after achieving “stability dialed in”
  • Muon optimizer mentioned as enabling larger batch sizes (vs AdamW)
  • They reference MiniMax-01 paper for batch-size scaling justification

This suggests they had instability issues early and had to tune heavily to get clean training. The architectural changes (3 to 6 dense layers) and routing tweaks likely came from failed runs.

Training Infrastructure & Scale

  • 2048 B300 GPUs for 33 days pretraining (claimed as “largest publicly stated” B300 run)
  • Total cost: $20M all-in for 4 models over 6 months (compute, salaries, data, storage, ops)
  • Training throughput optimization via HSDP with expert parallelism=8, totaling 2,048 data-parallel ranks
  • Batch size increased after 5T tokens (justified by high sparsity + Muon optimizer’s larger critical batch size tolerance)

Three Checkpoint Release Strategy

There are three variants:

  1. Preview: Light post-training, instruct-style (non-reasoning), optimized for creative tasks and agentic workflows
  2. Base: Full 17T pretraining checkpoint
  3. TrueBase: 10T checkpoint with zero instruct data, no LR annealing. Explicitly marketed as “real baseline” for researchers

Inference Context & Hosting

  • Native 512K context support
  • Preview API running at 128K with 8-bit quantization while they tune infrastructure
  • The launched was framed as a “preview of hosting platform” as much as model launch

Claims vs. substance

What they say:

  • “Frontier-class foundation model”
  • Matches/exceeds open-base peers across benchmarks
  • 2-3x inference throughput advantage

What’s missing:

  • No detailed hardware utilization metrics (MFU, throughput/GPU)
  • No ablations on sparsity vs performance trade-off
  • Vague on what routing instability they hit and how they fixed it

Footnotes

  1. Arcee Trinity Large Technical Report