On-prem AI 8 min read

Gemma-4 v23 on the DGX Spark

New vLLM v0.23.0 runs for Gemma-4 on the DGX Spark: BF16, NVFP4 and MTP compared across decode, TTFT, tails and practical local-agent limits.

Written by Django de Vreng

NVFP4 is still the practical default for Gemma-4 on the DGX Spark, but MTP is now the interesting middle position. In the new vLLM v0.23.0 runs, NVFP4 still leads on chat and multi-turn, while MTP clearly moves past the BF16 run without switching to NVIDIA’s re-quant.

I reran the same Gemma-4-26B-A4B family on the DGX Spark, now with vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404. The raw data lives in the benchmark repo at commit 605faab6a599. The Arena now has three new entries: BF16 v23, MTP v23 and NVFP4 v23.

The earlier Gemma post was mostly about the price of context in BF16. This run answers a different question: what changes when the same machine, the same model family and the same workloads run on vLLM v0.23.0, with three serving profiles side by side?

The setup that stayed the same

All three runs use the same machine and benchmark shape:

ComponentValue
HardwareDGX Spark NVIDIA GB10, 128 GB unified memory
vLLM imagevllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404
KV-cachefp8
Prefix cachingoff
Max model length131072
Benchmark commit605faab6a599

The three profiles:

ProfileModelServed nameGenerated
BF16 v23google/gemma-4-26B-A4B-itgemma-4-26b-a4b2026-06-22T23:16:36+02:00
MTP v23google/gemma-4-26B-A4B-itgemma-4-26b-a4b-mtp2026-06-23T03:29:52+02:00
NVFP4 v23nvidia/Gemma-4-26B-A4B-NVFP4gemma-4-26b-a4b-nvfp42026-06-23T01:35:33+02:00

MTP uses the same Google model path as BF16, but served with the MTP profile. NVFP4 uses the NVIDIA re-quant. That distinction matters, because otherwise you quietly compare two things at once: engine behavior and model artifact.

Chat: NVFP4 leads, MTP catches BF16

The first useful comparison is Run C: 1024 prompt tokens, 1024 output tokens, ten concurrent requests. That is a clean chat shape: not trivially short, not a context monster either.

ProfileTTFT c10Decode/user c10Total decode c10
BF16 v231342.98 ± 449.90 ms11.47 ± 0.45 tok/s90.83 ± 7.87 tok/s
MTP v231400.13 ± 142.07 ms17.79 ± 1.55 tok/s138.97 ± 6.68 tok/s
NVFP4 v231138.26 ± 385.15 ms21.59 ± 0.98 tok/s151.22 ± 15.96 tok/s

This is the core. MTP gives roughly 55 percent more per-user decode than BF16 on this chat run. NVFP4 is still above that, but the gap between MTP and NVFP4 is much smaller than the gap between BF16 and MTP.

The latency to first token stays in the same range. NVFP4 is fastest here, MTP is not faster in TTFT than BF16. That fits the pattern: these profiles mostly affect decode throughput. Prefill is still work.

Multi-turn is where NVFP4 opens up

Run E is the most production-shaped closed-loop test for me: five turns per conversation, ten conversations in parallel, 2048 starting tokens and 512 output tokens per turn.

ProfileTTFT c10Decode/user c10Total decode c10
BF16 v232154.60 ± 858.63 ms10.69 ± 0.25 tok/s98.35 ± 3.95 tok/s
MTP v232368.00 ± 789.47 ms16.57 ± 1.32 tok/s143.47 ± 4.67 tok/s
NVFP4 v231966.10 ± 735.30 ms20.01 ± 0.80 tok/s182.90 ± 6.67 tok/s

This is where NVFP4 feels right. 182.90 tok/s total for ten multi-turn conversations on a Spark is not a demo number, it is a usable local inference profile.

MTP stays useful. Not as the winner, but as an answer to: what if I want to keep serving the Google BF16 model artifact and still get more decode? Then 16.57 tok/s per user is a big difference from 10.69.

Long output: more tokens, not automatically more pain

For agents and code generation, Run G matters: 256 prompt tokens, 4096 output tokens, ten concurrent requests. This shape tells you whether long generations make the machine collapse.

ProfileTTFT c10Decode/user c10Total decode c10
BF16 v23490.95 ± 4.88 ms12.47 ± 0.94 tok/s87.16 ± 3.88 tok/s
MTP v23564.16 ± 14.86 ms17.67 ± 1.92 tok/s127.52 ± 9.05 tok/s
NVFP4 v23368.83 ± 54.97 ms23.69 ± 1.65 tok/s120.96 ± 50.17 tok/s

Notice the odd shape: NVFP4 has the highest per-user decode, but total decode has much more spread. MTP is lower per user, but stabler in this specific run. So I would not only look at the tallest bar here. For agents you also want predictability, especially when multiple runs keep streaming for a long time.

25k context is still the wall

Quantization and MTP do not change the fact that large context is mostly prefill. At 25k prompt tokens and c10, it looks like this:

ProfileTTFT c10Decode/user c10Total decode c10
BF16 v2339281.43 ± 20075.74 ms5.28 ± 2.13 tok/s28.49 ± 0.62 tok/s
MTP v2345640.37 ± 23247.85 ms6.05 ± 3.24 tok/s27.62 ± 0.27 tok/s
NVFP4 v2338575.15 ± 19624.30 ms7.40 ± 4.24 tok/s33.54 ± 0.03 tok/s

This is no longer chat. At ten concurrent 25k prompts, you wait around 39 to 46 seconds on average for the first token. NVFP4 helps decode a little, but the user mostly feels an empty window before the stream starts.

That is the same lesson as in the earlier Gemma-4 benchmark post, now with vLLM v0.23.0 added: context is not a free input box. If you make a local agent carry 25k tokens around, you pay for it in TTFT.

Open-loop: the office shape remains usable

The open-loop tests matter more for feel than the closed-loop tables. They dispatch requests according to an arrival pattern instead of starting everything at once.

H: office baseline

200 random prompts, request rate 0.3, burstiness 0.7.

ProfileOKOutput tok/sP95 TTFTP95 TPOT
BF16 v23200/200129.922835.43 ms197.57 ms
MTP v23200/200132.353394.53 ms178.77 ms
NVFP4 v23200/200139.052393.78 ms77.98 ms

NVFP4 is clearly nicer here. Not because of much higher output throughput, because 139.05 versus 129.92 tok/s is not a revolution. The difference is TPOT: 77.98 ms p95 versus 197.57 ms for BF16. The stream feels much faster once it starts.

I: ShareGPT replay

250 real conversations, same request rate.

ProfileOKOutput tok/sP95 TTFTP95 TPOT
BF16 v23250/25060.93456.10 ms115.31 ms
MTP v23250/25061.47576.82 ms77.32 ms
NVFP4 v23250/25061.99225.09 ms45.30 ms

This is the best proxy for normal chat. Short, real conversations. NVFP4 gives p95 TTFT of 225.09 ms and p95 TPOT of 45.30 ms. Locally, that does not feel like a compromise.

J: Monday morning peak

300 random prompts, target 1.5 rps, max concurrency 25.

ProfileOKOutput tok/sP95 TTFTP95 TPOT
BF16 v23300/300132.043006.73 ms199.23 ms
MTP v23300/300172.323870.47 ms235.91 ms
NVFP4 v23300/300218.902390.17 ms124.58 ms

Under overload, NVFP4 also stays the most usable. Every request succeeds, but the queue decides who feels the pain. BF16 and MTP produce less friendly tails here. MTP has more output throughput than BF16, but worse p95 TTFT and p95 TPOT. That is exactly why I want percentiles, not only tokens per second.

What I put into the Arena

I added three new Arena entries instead of overwriting the old Gemma-4 entries. The old v0.20.1 runs remain useful as historical comparison points. These new entries are explicitly v23:

The short ranking for my own use:

  1. NVFP4 v23 for local chat, agents and office load.
  2. MTP v23 if you want to keep the Google model artifact but BF16 decode is too slow.
  3. BF16 v23 as a control line and for comparisons where precision matters more than serving speed.

For 25k context, none of the three solves the real problem. There you work on prompt budget, retrieval, memory compaction and agent architecture. Not on hoping a serving profile makes the wait disappear.

Esc