Rubin CPX is a GPU announced at the 2025 AI Infrastructure Summit that is “specifically optimized for context processing.” It swaps out HBM for GDDR7 and increases the NVFP4 performance over Rubin by 20%. From the announcement, it has:

  • 30 PF NVFP4 (assume sparse)
  • “3x Exponent Operations” - which are “attention acceleration cores” - compared to
  • 128 GB GDDR7 instead of HBM - because prefill is compute-limited, not memory bandwidth limited like decode.
  • 4 NVENC and 4 NVDEC encoders/decoders for processing and generating AI video

It is to be released at the end of 2026 as a fast follow-on to the R200 launch.

Platforms

The following slide from Ian Buck summarizes the two ways in which NVIDIA will ship CPX:1

“Vera Rubin NVL144 CPX”

There will be a new VR200 NVL144 tray (“VR NVL 144”) which incorporates 8x CPX GPUs in addition to the 8x Rubin GPUs in each tray:

FeatureVR144-onlyVR144 with CPX
NVFP4 FLOPS3.6 EF8.0 EF
Memory Bandwidth1.4 PB/s1.7 PB/s
”Fast memory”75 TB150 TB
Network8x ConnectX-98x ConnectX-9

In one of these V

now 8 EF NVFP4, 1.7 PB/s memory, 100 TB “fast memory”

”Vera Rubin CPX Dual Rack”

NVIDIA will also make a CPX-only tray (VR-CPX) with 8x CPX. From the slide above, it doesn’t look like the CPX trays have scale-up NVLink connectivity. This implies that the non-CPX nodes will have to fetch KV caches from CPX nodes over InfiniBand.

Target applications

Ian suggested the use case would be…

  1. Perform prefill on CPX nodes in one part of a datacenter.
  2. As soon as the first token is ready, ship all keys and values for the prompt to a non-CPX node (presumably via InfiniBand, as these CPX nodes are not on the same NVLink domain as the HBM nodes)
  3. HBM GPUs begin decode using the computed key and value vectors.

Industries

NVIDIA boasted a few AI companies as launch partners:

  • Code generation: Cursor, Magic
  • Inferencing-as-a-Service platforms: Fireworks AI, and together.ai are trial customers, citing the need for huge context windows (1M-100M tokens) to ingest entire codebases for code generation applications.
  • Creative/media generation: Runway

Pricing?

I’m not sure what pricing to expect for this GPU. It doesn’t have HBM which is a significant cost/complexity savings, but prefill is often the most expensive part of LLM inferencing. Reducing the time to inferencing (and therefore inferencing throughput) seems like a high-value proposition that reduces the number of expensive Rubin HBM GPUs required.

Perhaps because prefill is only the dominant cost for very large prompts though, its overall value would be reduced.

Footnotes

  1. https://www.nvidia.com/en-us/events/ai-infra-summit/