Gemma-4 v23 on the DGX Spark
New vLLM v0.23.0 runs for Gemma-4 on the DGX Spark: BF16, NVFP4 and MTP compared across decode, TTFT, tails and practical local-agent limits.
NVFP4 is still the practical default for Gemma-4 on the DGX Spark, but MTP is now the interesting middle position. In the new vLLM v0.23.0 runs, NVFP4 still leads on chat and multi-turn, while MTP clearly moves past the BF16 run without switching to NVIDIA’s re-quant.
I reran the same Gemma-4-26B-A4B family on the DGX Spark, now with vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404. The raw data lives in the benchmark repo at commit 605faab6a599. The Arena now has three new entries: BF16 v23, MTP v23 and NVFP4 v23.
The earlier Gemma post was mostly about the price of context in BF16. This run answers a different question: what changes when the same machine, the same model family and the same workloads run on vLLM v0.23.0, with three serving profiles side by side?
The setup that stayed the same
All three runs use the same machine and benchmark shape:
| Component | Value |
|---|---|
| Hardware | DGX Spark NVIDIA GB10, 128 GB unified memory |
| vLLM image | vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404 |
| KV-cache | fp8 |
| Prefix caching | off |
| Max model length | 131072 |
| Benchmark commit | 605faab6a599 |
The three profiles:
| Profile | Model | Served name | Generated |
|---|---|---|---|
| BF16 v23 | google/gemma-4-26B-A4B-it | gemma-4-26b-a4b | 2026-06-22T23:16:36+02:00 |
| MTP v23 | google/gemma-4-26B-A4B-it | gemma-4-26b-a4b-mtp | 2026-06-23T03:29:52+02:00 |
| NVFP4 v23 | nvidia/Gemma-4-26B-A4B-NVFP4 | gemma-4-26b-a4b-nvfp4 | 2026-06-23T01:35:33+02:00 |
MTP uses the same Google model path as BF16, but served with the MTP profile. NVFP4 uses the NVIDIA re-quant. That distinction matters, because otherwise you quietly compare two things at once: engine behavior and model artifact.
Chat: NVFP4 leads, MTP catches BF16
The first useful comparison is Run C: 1024 prompt tokens, 1024 output tokens, ten concurrent requests. That is a clean chat shape: not trivially short, not a context monster either.
| Profile | TTFT c10 | Decode/user c10 | Total decode c10 |
|---|---|---|---|
| BF16 v23 | 1342.98 ± 449.90 ms | 11.47 ± 0.45 tok/s | 90.83 ± 7.87 tok/s |
| MTP v23 | 1400.13 ± 142.07 ms | 17.79 ± 1.55 tok/s | 138.97 ± 6.68 tok/s |
| NVFP4 v23 | 1138.26 ± 385.15 ms | 21.59 ± 0.98 tok/s | 151.22 ± 15.96 tok/s |
This is the core. MTP gives roughly 55 percent more per-user decode than BF16 on this chat run. NVFP4 is still above that, but the gap between MTP and NVFP4 is much smaller than the gap between BF16 and MTP.
The latency to first token stays in the same range. NVFP4 is fastest here, MTP is not faster in TTFT than BF16. That fits the pattern: these profiles mostly affect decode throughput. Prefill is still work.
Multi-turn is where NVFP4 opens up
Run E is the most production-shaped closed-loop test for me: five turns per conversation, ten conversations in parallel, 2048 starting tokens and 512 output tokens per turn.
| Profile | TTFT c10 | Decode/user c10 | Total decode c10 |
|---|---|---|---|
| BF16 v23 | 2154.60 ± 858.63 ms | 10.69 ± 0.25 tok/s | 98.35 ± 3.95 tok/s |
| MTP v23 | 2368.00 ± 789.47 ms | 16.57 ± 1.32 tok/s | 143.47 ± 4.67 tok/s |
| NVFP4 v23 | 1966.10 ± 735.30 ms | 20.01 ± 0.80 tok/s | 182.90 ± 6.67 tok/s |
This is where NVFP4 feels right. 182.90 tok/s total for ten multi-turn conversations on a Spark is not a demo number, it is a usable local inference profile.
MTP stays useful. Not as the winner, but as an answer to: what if I want to keep serving the Google BF16 model artifact and still get more decode? Then 16.57 tok/s per user is a big difference from 10.69.
Long output: more tokens, not automatically more pain
For agents and code generation, Run G matters: 256 prompt tokens, 4096 output tokens, ten concurrent requests. This shape tells you whether long generations make the machine collapse.
| Profile | TTFT c10 | Decode/user c10 | Total decode c10 |
|---|---|---|---|
| BF16 v23 | 490.95 ± 4.88 ms | 12.47 ± 0.94 tok/s | 87.16 ± 3.88 tok/s |
| MTP v23 | 564.16 ± 14.86 ms | 17.67 ± 1.92 tok/s | 127.52 ± 9.05 tok/s |
| NVFP4 v23 | 368.83 ± 54.97 ms | 23.69 ± 1.65 tok/s | 120.96 ± 50.17 tok/s |
Notice the odd shape: NVFP4 has the highest per-user decode, but total decode has much more spread. MTP is lower per user, but stabler in this specific run. So I would not only look at the tallest bar here. For agents you also want predictability, especially when multiple runs keep streaming for a long time.
25k context is still the wall
Quantization and MTP do not change the fact that large context is mostly prefill. At 25k prompt tokens and c10, it looks like this:
| Profile | TTFT c10 | Decode/user c10 | Total decode c10 |
|---|---|---|---|
| BF16 v23 | 39281.43 ± 20075.74 ms | 5.28 ± 2.13 tok/s | 28.49 ± 0.62 tok/s |
| MTP v23 | 45640.37 ± 23247.85 ms | 6.05 ± 3.24 tok/s | 27.62 ± 0.27 tok/s |
| NVFP4 v23 | 38575.15 ± 19624.30 ms | 7.40 ± 4.24 tok/s | 33.54 ± 0.03 tok/s |
This is no longer chat. At ten concurrent 25k prompts, you wait around 39 to 46 seconds on average for the first token. NVFP4 helps decode a little, but the user mostly feels an empty window before the stream starts.
That is the same lesson as in the earlier Gemma-4 benchmark post, now with vLLM v0.23.0 added: context is not a free input box. If you make a local agent carry 25k tokens around, you pay for it in TTFT.
Open-loop: the office shape remains usable
The open-loop tests matter more for feel than the closed-loop tables. They dispatch requests according to an arrival pattern instead of starting everything at once.
H: office baseline
200 random prompts, request rate 0.3, burstiness 0.7.
| Profile | OK | Output tok/s | P95 TTFT | P95 TPOT |
|---|---|---|---|---|
| BF16 v23 | 200/200 | 129.92 | 2835.43 ms | 197.57 ms |
| MTP v23 | 200/200 | 132.35 | 3394.53 ms | 178.77 ms |
| NVFP4 v23 | 200/200 | 139.05 | 2393.78 ms | 77.98 ms |
NVFP4 is clearly nicer here. Not because of much higher output throughput, because 139.05 versus 129.92 tok/s is not a revolution. The difference is TPOT: 77.98 ms p95 versus 197.57 ms for BF16. The stream feels much faster once it starts.
I: ShareGPT replay
250 real conversations, same request rate.
| Profile | OK | Output tok/s | P95 TTFT | P95 TPOT |
|---|---|---|---|---|
| BF16 v23 | 250/250 | 60.93 | 456.10 ms | 115.31 ms |
| MTP v23 | 250/250 | 61.47 | 576.82 ms | 77.32 ms |
| NVFP4 v23 | 250/250 | 61.99 | 225.09 ms | 45.30 ms |
This is the best proxy for normal chat. Short, real conversations. NVFP4 gives p95 TTFT of 225.09 ms and p95 TPOT of 45.30 ms. Locally, that does not feel like a compromise.
J: Monday morning peak
300 random prompts, target 1.5 rps, max concurrency 25.
| Profile | OK | Output tok/s | P95 TTFT | P95 TPOT |
|---|---|---|---|---|
| BF16 v23 | 300/300 | 132.04 | 3006.73 ms | 199.23 ms |
| MTP v23 | 300/300 | 172.32 | 3870.47 ms | 235.91 ms |
| NVFP4 v23 | 300/300 | 218.90 | 2390.17 ms | 124.58 ms |
Under overload, NVFP4 also stays the most usable. Every request succeeds, but the queue decides who feels the pain. BF16 and MTP produce less friendly tails here. MTP has more output throughput than BF16, but worse p95 TTFT and p95 TPOT. That is exactly why I want percentiles, not only tokens per second.
What I put into the Arena
I added three new Arena entries instead of overwriting the old Gemma-4 entries. The old v0.20.1 runs remain useful as historical comparison points. These new entries are explicitly v23:
The short ranking for my own use:
- NVFP4 v23 for local chat, agents and office load.
- MTP v23 if you want to keep the Google model artifact but BF16 decode is too slow.
- BF16 v23 as a control line and for comparisons where precision matters more than serving speed.
For 25k context, none of the three solves the real problem. There you work on prompt budget, retrieval, memory compaction and agent architecture. Not on hoping a serving profile makes the wait disappear.