Gemma-4 v23 on the DGX Spark

NVFP4 is still the practical default for Gemma-4 on the DGX Spark, but MTP is now the interesting middle position. In the new vLLM v0.23.0 runs, NVFP4 still leads on chat and multi-turn, while MTP clearly moves past the BF16 run without switching to NVIDIA’s re-quant.

I reran the same Gemma-4-26B-A4B family on the DGX Spark, now with vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404. The raw data lives in the benchmark repo at commit 605faab6a599. The Arena now has three new entries: BF16 v23, MTP v23 and NVFP4 v23.

The earlier Gemma post was mostly about the price of context in BF16. This run answers a different question: what changes when the same machine, the same model family and the same workloads run on vLLM v0.23.0, with three serving profiles side by side?

The setup that stayed the same

All three runs use the same machine and benchmark shape:

Component	Value
Hardware	DGX Spark NVIDIA GB10, 128 GB unified memory
vLLM image	`vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404`
KV-cache	`fp8`
Prefix caching	off
Max model length	131072
Benchmark commit	`605faab6a599`

The three profiles:

Profile	Model	Served name	Generated
BF16 v23	`google/gemma-4-26B-A4B-it`	`gemma-4-26b-a4b`	2026-06-22T23:16:36+02:00
MTP v23	`google/gemma-4-26B-A4B-it`	`gemma-4-26b-a4b-mtp`	2026-06-23T03:29:52+02:00
NVFP4 v23	`nvidia/Gemma-4-26B-A4B-NVFP4`	`gemma-4-26b-a4b-nvfp4`	2026-06-23T01:35:33+02:00

MTP uses the same Google model path as BF16, but served with the MTP profile. NVFP4 uses the NVIDIA re-quant. That distinction matters, because otherwise you quietly compare two things at once: engine behavior and model artifact.

Chat: NVFP4 leads, MTP catches BF16

The first useful comparison is Run C: 1024 prompt tokens, 1024 output tokens, ten concurrent requests. That is a clean chat shape: not trivially short, not a context monster either.

Profile	TTFT c10	Decode/user c10	Total decode c10
BF16 v23	1342.98 ± 449.90 ms	11.47 ± 0.45 tok/s	90.83 ± 7.87 tok/s
MTP v23	1400.13 ± 142.07 ms	17.79 ± 1.55 tok/s	138.97 ± 6.68 tok/s
NVFP4 v23	1138.26 ± 385.15 ms	21.59 ± 0.98 tok/s	151.22 ± 15.96 tok/s

This is the core. MTP gives roughly 55 percent more per-user decode than BF16 on this chat run. NVFP4 is still above that, but the gap between MTP and NVFP4 is much smaller than the gap between BF16 and MTP.

The latency to first token stays in the same range. NVFP4 is fastest here, MTP is not faster in TTFT than BF16. That fits the pattern: these profiles mostly affect decode throughput. Prefill is still work.

Multi-turn is where NVFP4 opens up

Run E is the most production-shaped closed-loop test for me: five turns per conversation, ten conversations in parallel, 2048 starting tokens and 512 output tokens per turn.

Profile	TTFT c10	Decode/user c10	Total decode c10
BF16 v23	2154.60 ± 858.63 ms	10.69 ± 0.25 tok/s	98.35 ± 3.95 tok/s
MTP v23	2368.00 ± 789.47 ms	16.57 ± 1.32 tok/s	143.47 ± 4.67 tok/s
NVFP4 v23	1966.10 ± 735.30 ms	20.01 ± 0.80 tok/s	182.90 ± 6.67 tok/s

This is where NVFP4 feels right. 182.90 tok/s total for ten multi-turn conversations on a Spark is not a demo number, it is a usable local inference profile.

MTP stays useful. Not as the winner, but as an answer to: what if I want to keep serving the Google BF16 model artifact and still get more decode? Then 16.57 tok/s per user is a big difference from 10.69.

Long output: more tokens, not automatically more pain

For agents and code generation, Run G matters: 256 prompt tokens, 4096 output tokens, ten concurrent requests. This shape tells you whether long generations make the machine collapse.

Profile	TTFT c10	Decode/user c10	Total decode c10
BF16 v23	490.95 ± 4.88 ms	12.47 ± 0.94 tok/s	87.16 ± 3.88 tok/s
MTP v23	564.16 ± 14.86 ms	17.67 ± 1.92 tok/s	127.52 ± 9.05 tok/s
NVFP4 v23	368.83 ± 54.97 ms	23.69 ± 1.65 tok/s	120.96 ± 50.17 tok/s

Notice the odd shape: NVFP4 has the highest per-user decode, but total decode has much more spread. MTP is lower per user, but stabler in this specific run. So I would not only look at the tallest bar here. For agents you also want predictability, especially when multiple runs keep streaming for a long time.

25k context is still the wall

Quantization and MTP do not change the fact that large context is mostly prefill. At 25k prompt tokens and c10, it looks like this:

Profile	TTFT c10	Decode/user c10	Total decode c10
BF16 v23	39281.43 ± 20075.74 ms	5.28 ± 2.13 tok/s	28.49 ± 0.62 tok/s
MTP v23	45640.37 ± 23247.85 ms	6.05 ± 3.24 tok/s	27.62 ± 0.27 tok/s
NVFP4 v23	38575.15 ± 19624.30 ms	7.40 ± 4.24 tok/s	33.54 ± 0.03 tok/s

This is no longer chat. At ten concurrent 25k prompts, you wait around 39 to 46 seconds on average for the first token. NVFP4 helps decode a little, but the user mostly feels an empty window before the stream starts.

That is the same lesson as in the earlier Gemma-4 benchmark post, now with vLLM v0.23.0 added: context is not a free input box. If you make a local agent carry 25k tokens around, you pay for it in TTFT.

Open-loop: the office shape remains usable

The open-loop tests matter more for feel than the closed-loop tables. They dispatch requests according to an arrival pattern instead of starting everything at once.

H: office baseline

200 random prompts, request rate 0.3, burstiness 0.7.

Profile	OK	Output tok/s	P95 TTFT	P95 TPOT
BF16 v23	200/200	129.92	2835.43 ms	197.57 ms
MTP v23	200/200	132.35	3394.53 ms	178.77 ms
NVFP4 v23	200/200	139.05	2393.78 ms	77.98 ms

NVFP4 is clearly nicer here. Not because of much higher output throughput, because 139.05 versus 129.92 tok/s is not a revolution. The difference is TPOT: 77.98 ms p95 versus 197.57 ms for BF16. The stream feels much faster once it starts.

I: ShareGPT replay

250 real conversations, same request rate.

Profile	OK	Output tok/s	P95 TTFT	P95 TPOT
BF16 v23	250/250	60.93	456.10 ms	115.31 ms
MTP v23	250/250	61.47	576.82 ms	77.32 ms
NVFP4 v23	250/250	61.99	225.09 ms	45.30 ms

This is the best proxy for normal chat. Short, real conversations. NVFP4 gives p95 TTFT of 225.09 ms and p95 TPOT of 45.30 ms. Locally, that does not feel like a compromise.

J: Monday morning peak

300 random prompts, target 1.5 rps, max concurrency 25.

Profile	OK	Output tok/s	P95 TTFT	P95 TPOT
BF16 v23	300/300	132.04	3006.73 ms	199.23 ms
MTP v23	300/300	172.32	3870.47 ms	235.91 ms
NVFP4 v23	300/300	218.90	2390.17 ms	124.58 ms

Under overload, NVFP4 also stays the most usable. Every request succeeds, but the queue decides who feels the pain. BF16 and MTP produce less friendly tails here. MTP has more output throughput than BF16, but worse p95 TTFT and p95 TPOT. That is exactly why I want percentiles, not only tokens per second.

What I put into the Arena

I added three new Arena entries instead of overwriting the old Gemma-4 entries. The old v0.20.1 runs remain useful as historical comparison points. These new entries are explicitly v23:

The short ranking for my own use:

NVFP4 v23 for local chat, agents and office load.
MTP v23 if you want to keep the Google model artifact but BF16 decode is too slow.
BF16 v23 as a control line and for comparisons where precision matters more than serving speed.

For 25k context, none of the three solves the real problem. There you work on prompt budget, retrieval, memory compaction and agent architecture. Not on hoping a serving profile makes the wait disappear.

The setup that stayed the same

Chat: NVFP4 leads, MTP catches BF16

Multi-turn is where NVFP4 opens up

Long output: more tokens, not automatically more pain

25k context is still the wall

Open-loop: the office shape remains usable

H: office baseline

I: ShareGPT replay

J: Monday morning peak

What I put into the Arena

The three numbers behind a fast DGX Spark

Gemma-4 on the DGX Spark: NVFP4 vs BF16

Nemotron-3 on the DGX Spark: BF16 vs FP8 vs NVFP4

The setup that stayed the same

Chat: NVFP4 leads, MTP catches BF16

Multi-turn is where NVFP4 opens up

Long output: more tokens, not automatically more pain

25k context is still the wall

Open-loop: the office shape remains usable

H: office baseline

I: ShareGPT replay

J: Monday morning peak

What I put into the Arena

Read next

The three numbers behind a fast DGX Spark

Gemma-4 on the DGX Spark: NVFP4 vs BF16

Nemotron-3 on the DGX Spark: BF16 vs FP8 vs NVFP4