Trinity

Trinity is a family of open-source MOE transformers developed by Arcee. The largest is Trinity Large with 400B parameters with 256 experts and 1 shared and 3 routed experts active per token (or 13B active parameters per token). They trained on 17T tokens.

Most of this information is from Trinity Large.

Architecture

From Arcee Trinity Large Technical Report:¹

	Trinity Nano	Trinity Mini	Trinity Large
Transformer layers	56	32	60
Initial dense layers	2	2	6
Model dim (dmodel)	1024	2048	3072
FFN intermediate dim	3072	6144	12288
Attention heads (hq)	8	32	48
Per-head dim (dh)	128	128	128
KV heads (hkv)	2	4	8
Local window size	2048	2048	4096
Pre-training seq len	4096	4096	8192
MoE shared experts	1	1	1
MoE routed experts	128	128	256
Activated experts / token	8	8	4
Route scale	2.826	2.826	2.448
Expert size	256	1024	3072
Initialization (trunc normal σ)	0.016	0.011	0.009

400B parameters MOE, 13B active per token
256 experts with 4 active experts per token. 1 shared, 3 routed.
1.56% routing fraction; this is very high sparsity.

MoE routing stability mechanics:

6 dense layers to stabilize routing at this sparsity level. This means 54 layers were MOE and 6 layers were dense transformer layers.
They explicitly state routing stability was a challenge requiring architectural adjustment mid-design

Data Pipeline

Trained on 17T tokens.

Split inputs across 3 phases (10T/4T/3T)
Curated by DatologyAI
8T tokens of synthetic data (web, code, math, reasoning, multilingual—14 non-English languages)
“State-of-the-art rephrasing approaches” (no specifics)
Data mix evolved specifically for Trinity Large vs smaller Trinity models

This is heavy on synthetic data relative to typical ratios. The fact that they call out “curation advancements” suggests they learned something between smaller Trinity models and this one, but they don’t say what broke.

Training dynamics worth noting:

Smooth loss curve with “clear phase transitions, no spikes”
They frame this as success after achieving “stability dialed in”
Muon optimizer mentioned as enabling larger batch sizes (vs AdamW)
They reference MiniMax-01 paper for batch-size scaling justification

This suggests they had instability issues early and had to tune heavily to get clean training. The architectural changes (3 to 6 dense layers) and routing tweaks likely came from failed runs.

Training Infrastructure & Scale

2048 B300 GPUs for 33 days pretraining (claimed as “largest publicly stated” B300 run)
Total cost: $20M all-in for 4 models over 6 months (compute, salaries, data, storage, ops)
Training throughput optimization via HSDP with expert parallelism=8, totaling 2,048 data-parallel ranks
Batch size increased after 5T tokens (justified by high sparsity + Muon optimizer’s larger critical batch size tolerance)

Three Checkpoint Release Strategy

There are three variants:

Preview: Light post-training, instruct-style (non-reasoning), optimized for creative tasks and agentic workflows
Base: Full 17T pretraining checkpoint
TrueBase: 10T checkpoint with zero instruct data, no LR annealing. Explicitly marketed as “real baseline” for researchers

Inference Context & Hosting

Native 512K context support
Preview API running at 128K with 8-bit quantization while they tune infrastructure
The launched was framed as a “preview of hosting platform” as much as model launch

Claims vs. substance

What they say:

“Frontier-class foundation model”
Matches/exceeds open-base peers across benchmarks
2-3x inference throughput advantage

What’s missing:

No detailed hardware utilization metrics (MFU, throughput/GPU)
No ablations on sparsity vs performance trade-off
Vague on what routing instability they hit and how they fixed it

Arcee Trinity Large Technical Report ↩

Glenn's Digital Garden

Explorer

Trinity

Architecture

Data Pipeline

Training Infrastructure & Scale

Three Checkpoint Release Strategy

Inference Context & Hosting

Claims vs. substance

Graph View

Table of Contents

Glenn's Digital Garden

Explorer

Trinity

Architecture

Data Pipeline

Training Infrastructure & Scale

Three Checkpoint Release Strategy

Inference Context & Hosting

Claims vs. substance

Footnotes

Graph View

Table of Contents