Django de Vreng

Gemma-4 v23 on the DGX Spark

2026-06-23T00:00:00.000Z

NVFP4 is still the practical default for Gemma-4 on the DGX Spark, but MTP is now the interesting middle position. In the new vLLM v0.23.0 runs, NVFP4 still leads on chat and multi-turn, while MTP clearly moves past the BF16 run without switching to NVIDIA's re-quant.

I reran the same Gemma-4-26B-A4B family on the DGX Spark, now with vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404. The raw data lives in the benchmark repo at commit 605faab6a599. The Arena now has three new entries: BF16 v23, MTP v23 and NVFP4 v23.

The earlier Gemma post was mostly about the price of context in BF16. This run answers a different question: what changes when the same machine, the same model family and the same workloads run on vLLM v0.23.0, with three serving profiles side by side?

The setup that stayed the same

All three runs use the same machine and benchmark shape:

Component	Value
Hardware	DGX Spark NVIDIA GB10, 128 GB unified memory
vLLM image	`vllm/vllm-openai:v0.23.0-aarch64-cu129-ubuntu2404`
KV-cache	`fp8`
Prefix caching	off
Max model length	131072
Benchmark commit	`605faab6a599`

The three profiles:

Profile	Model	Served name	Generated
BF16 v23	`google/gemma-4-26B-A4B-it`	`gemma-4-26b-a4b`	2026-06-22T23:16:36+02:00
MTP v23	`google/gemma-4-26B-A4B-it`	`gemma-4-26b-a4b-mtp`	2026-06-23T03:29:52+02:00
NVFP4 v23	`nvidia/Gemma-4-26B-A4B-NVFP4`	`gemma-4-26b-a4b-nvfp4`	2026-06-23T01:35:33+02:00

MTP uses the same Google model path as BF16, but served with the MTP profile. NVFP4 uses the NVIDIA re-quant. That distinction matters, because otherwise you quietly compare two things at once: engine behavior and model artifact.

Chat: NVFP4 leads, MTP catches BF16

The first useful comparison is Run C: 1024 prompt tokens, 1024 output tokens, ten concurrent requests. That is a clean chat shape: not trivially short, not a context monster either.

Profile	TTFT c10	Decode/user c10	Total decode c10
BF16 v23	1342.98 ± 449.90 ms	11.47 ± 0.45 tok/s	90.83 ± 7.87 tok/s
MTP v23	1400.13 ± 142.07 ms	17.79 ± 1.55 tok/s	138.97 ± 6.68 tok/s
NVFP4 v23	1138.26 ± 385.15 ms	21.59 ± 0.98 tok/s	151.22 ± 15.96 tok/s

This is the core. MTP gives roughly 55 percent more per-user decode than BF16 on this chat run. NVFP4 is still above that, but the gap between MTP and NVFP4 is much smaller than the gap between BF16 and MTP.

The latency to first token stays in the same range. NVFP4 is fastest here, MTP is not faster in TTFT than BF16. That fits the pattern: these profiles mostly affect decode throughput. Prefill is still work.

Multi-turn is where NVFP4 opens up

Run E is the most production-shaped closed-loop test for me: five turns per conversation, ten conversations in parallel, 2048 starting tokens and 512 output tokens per turn.

Profile	TTFT c10	Decode/user c10	Total decode c10
BF16 v23	2154.60 ± 858.63 ms	10.69 ± 0.25 tok/s	98.35 ± 3.95 tok/s
MTP v23	2368.00 ± 789.47 ms	16.57 ± 1.32 tok/s	143.47 ± 4.67 tok/s
NVFP4 v23	1966.10 ± 735.30 ms	20.01 ± 0.80 tok/s	182.90 ± 6.67 tok/s

This is where NVFP4 feels right. 182.90 tok/s total for ten multi-turn conversations on a Spark is not a demo number, it is a usable local inference profile.

MTP stays useful. Not as the winner, but as an answer to: what if I want to keep serving the Google BF16 model artifact and still get more decode? Then 16.57 tok/s per user is a big difference from 10.69.

Long output: more tokens, not automatically more pain

For agents and code generation, Run G matters: 256 prompt tokens, 4096 output tokens, ten concurrent requests. This shape tells you whether long generations make the machine collapse.

Profile	TTFT c10	Decode/user c10	Total decode c10
BF16 v23	490.95 ± 4.88 ms	12.47 ± 0.94 tok/s	87.16 ± 3.88 tok/s
MTP v23	564.16 ± 14.86 ms	17.67 ± 1.92 tok/s	127.52 ± 9.05 tok/s
NVFP4 v23	368.83 ± 54.97 ms	23.69 ± 1.65 tok/s	120.96 ± 50.17 tok/s

Notice the odd shape: NVFP4 has the highest per-user decode, but total decode has much more spread. MTP is lower per user, but stabler in this specific run. So I would not only look at the tallest bar here. For agents you also want predictability, especially when multiple runs keep streaming for a long time.

25k context is still the wall

Quantization and MTP do not change the fact that large context is mostly prefill. At 25k prompt tokens and c10, it looks like this:

Profile	TTFT c10	Decode/user c10	Total decode c10
BF16 v23	39281.43 ± 20075.74 ms	5.28 ± 2.13 tok/s	28.49 ± 0.62 tok/s
MTP v23	45640.37 ± 23247.85 ms	6.05 ± 3.24 tok/s	27.62 ± 0.27 tok/s
NVFP4 v23	38575.15 ± 19624.30 ms	7.40 ± 4.24 tok/s	33.54 ± 0.03 tok/s

This is no longer chat. At ten concurrent 25k prompts, you wait around 39 to 46 seconds on average for the first token. NVFP4 helps decode a little, but the user mostly feels an empty window before the stream starts.

That is the same lesson as in the earlier Gemma-4 benchmark post, now with vLLM v0.23.0 added: context is not a free input box. If you make a local agent carry 25k tokens around, you pay for it in TTFT.

Open-loop: the office shape remains usable

The open-loop tests matter more for feel than the closed-loop tables. They dispatch requests according to an arrival pattern instead of starting everything at once.

H: office baseline

200 random prompts, request rate 0.3, burstiness 0.7.

Profile	OK	Output tok/s	P95 TTFT	P95 TPOT
BF16 v23	200/200	129.92	2835.43 ms	197.57 ms
MTP v23	200/200	132.35	3394.53 ms	178.77 ms
NVFP4 v23	200/200	139.05	2393.78 ms	77.98 ms

NVFP4 is clearly nicer here. Not because of much higher output throughput, because 139.05 versus 129.92 tok/s is not a revolution. The difference is TPOT: 77.98 ms p95 versus 197.57 ms for BF16. The stream feels much faster once it starts.

I: ShareGPT replay

250 real conversations, same request rate.

Profile	OK	Output tok/s	P95 TTFT	P95 TPOT
BF16 v23	250/250	60.93	456.10 ms	115.31 ms
MTP v23	250/250	61.47	576.82 ms	77.32 ms
NVFP4 v23	250/250	61.99	225.09 ms	45.30 ms

This is the best proxy for normal chat. Short, real conversations. NVFP4 gives p95 TTFT of 225.09 ms and p95 TPOT of 45.30 ms. Locally, that does not feel like a compromise.

J: Monday morning peak

300 random prompts, target 1.5 rps, max concurrency 25.

Profile	OK	Output tok/s	P95 TTFT	P95 TPOT
BF16 v23	300/300	132.04	3006.73 ms	199.23 ms
MTP v23	300/300	172.32	3870.47 ms	235.91 ms
NVFP4 v23	300/300	218.90	2390.17 ms	124.58 ms

Under overload, NVFP4 also stays the most usable. Every request succeeds, but the queue decides who feels the pain. BF16 and MTP produce less friendly tails here. MTP has more output throughput than BF16, but worse p95 TTFT and p95 TPOT. That is exactly why I want percentiles, not only tokens per second.

What I put into the Arena

I added three new Arena entries instead of overwriting the old Gemma-4 entries. The old v0.20.1 runs remain useful as historical comparison points. These new entries are explicitly v23:

The short ranking for my own use:

NVFP4 v23 for local chat, agents and office load.
MTP v23 if you want to keep the Google model artifact but BF16 decode is too slow.
BF16 v23 as a control line and for comparisons where precision matters more than serving speed.

For 25k context, none of the three solves the real problem. There you work on prompt budget, retrieval, memory compaction and agent architecture. Not on hoping a serving profile makes the wait disappear.

The three numbers behind a fast DGX Spark

2026-05-22T00:00:00.000Z

Can you seriously run large language models locally on a DGX Spark? Yes. That is the boring answer, and it is also the answer every review hands you: a model name, a number, tokens per second, done.

The useful answer is harder. A model that handles one demo prompt nicely tells you nothing about a Monday morning with ten people, big context, agent flows and someone pasting half a novel into a ticket. That is where it starts to chafe, or it doesn't. And that does not depend on the Spark, it depends on your workload.

I have a Spark sitting in the lab and ran a stack of models on it, in BF16, FP8 and NVFP4. Nine workloads, two measurement methods, and a few runs redone because the first ones looked suspiciously good. What was left after all that measuring is not a scoreboard. It is one way of looking at it that held up every time, and it is below. The hard numbers per model are in the separate posts, and the complete guide with the setup, the cost and who it works for is at Running LLMs on the DGX Spark. This piece is about that one lens.

What the thing actually is

The DGX Spark is NVIDIA's smallest Blackwell machine. A GB10 superchip, 128 GB unified memory, small enough for a server rack. No separate graphics card with its own memory pool, but one memory that the CPU and the GPU share together. Remember that number, 128 GB. It is your entire budget, and everything that follows is a division sum inside that 128.

One thing you need to know up front, because it explains half the numbers later. The Spark runs on desktop Blackwell, SM12.1, and that chip cannot compute natively in 4-bit. The big datacenter Blackwell, the B200, can. The result: from 4-bit quantization you get the full memory gain on the Spark, but not the full compute gain. vLLM works around this by pulling 4-bit weights back up to higher precision during compute.

That works fine. But it is exactly why you should not blindly stick the pretty FP4 numbers from a B200 onto your own Spark.

What fits in 128 GB

Short version: the weights go in first, the rest is KV-cache for all users together. Precision is therefore a design choice up front, not a knob afterward, and I wrote a separate post about it. The question is never whether a model fits, but what is left when it does. The full division sum is in the guide.

How fast it really is

This is where most DGX Spark reviews go wrong. They grab one prompt, measure tokens per second, and call that "the speed". But speed on this machine is not a number. It is three things, they feel different and they behave differently. Pull them apart and the whole Spark falls into place.

Decode is nearly free

Decode is the text that comes in once the model is actually generating. On the Spark that is boringly stable, and boring is a compliment here. One user on a 26B model gets between 23 and 24 tokens per second in BF16, whether you feed it 4k or 25k context. Ten users at once: about 9 to 12 each, and that is where it sticks. Decode therefore hangs on how many people are busy at the same time, not on how long their prompt is.

And quantization lifts that whole line up. NVFP4 won on decode in all nine tests, by 22 to 92 percent depending on the workload. On a lighter MoE model like Nemotron-3, single-user decode even brushes up against 60 t/s. Decode, in short, is not the problem.

Prefill is the bill

Prefill is. Prefill is the silence before the first token, and that is what a user experiences as "slow", not the tokens after it.

Prefill scales with your prompt size, and that hurts. A short prompt is processed within half a second, even with ten people at once. Throw 25k context at it with those same ten users and you wait 35 seconds for the first character. Same machine, same concurrency, just a longer prompt. Double the prompt, roughly double the wait.

And quantization? Barely helps here. Prefill is compute, and compute is exactly where that SM12.1 handicap sits. NVFP4 makes your decode faster. Your prefill stays prefill.

Under pressure it queues, it doesn't crash

That leaves the question: what does it do when you simply throw too much at it? The answer is reassuringly boring. It does not fall over. It gets in line.

In the heaviest test I wanted to push 1.5 requests per second through the machine. It managed almost six times less than that. And yet not a single one of the 300 requests failed. The slowdown also did not go to everyone, it went to the tail: the average user noticed little, the unlucky one percent waited six seconds for their first token.

For on-prem that is the best outcome you can hope for. A crash is a phone call. A queue is a bit of patience. An office lives with the second, not the first.

That is the whole model. Decode is nearly free, prefill is the bill, queueing is your safety net. The numbers underneath it, nine workloads per model and two measurement methods, are in the arena and in the separate posts: the BF16 baseline, NVFP4 against BF16 and Nemotron-3 in three precisions.

The rest is in the guide

Which engine I run (vLLM), what a Spark costs, and who this does or does not work for: that is the complete picture, and it belongs in the guide, not in this one lens story. The short version of "who for": local only gets interesting once the data is not allowed to leave the building. If you don't have that requirement and you just want the fastest, cheapest tokens, then a cloud API is the more honest answer.

Running local is not a principle. It is a division: what has to stay in, and what is allowed to go out.

Do it yourself

Everything underneath is open. The models are on Hugging Face, vLLM is open source, and the raw benchmark output plus the scripts are on GitHub. The methodology explains which nine workloads I run and why.

If you have a Spark yourself, you should be able to walk the same route and get roughly the same numbers. If that doesn't work out, that is exactly what I want to know. Feel free to email.

Why this blog and arena exist

2026-05-05T00:00:00.000Z

For clients of Kamoo I set up AI systems that sometimes have to stay close to home. Accountants, administrative offices, firms with personal data and financial documents. Exactly the kind of data that does not make your auditor any calmer when you say: "we'll just send it off to America".

That is why we have a DGX Spark standing here. 128 GB unified memory, small enough for a server cabinet, big enough to run serious local models through vLLM. What practically fits on it, I collect on the overview page about local models on the DGX Spark.

Then the practical question started.

Which model do you use for what on this machine? Which precision do you pick? How much context still fits? Where does concurrency fall over? What happens on an ordinary Monday with ten people who are not all running a benchmark at the same time, but just doing their work?

I went looking for numbers on exactly those questions. Not a general leaderboard with a score that mostly looks good in a screenshot. Just: this chip, these models, these engines, these workloads, these limits.

I did not find them.

So I am building them myself.

The arena is the measuring bench

Right now there are ten benchmark profiles in the arena, with runs for things like context scaling, concurrency, output throughput, RAG-like workloads and a Monday-morning peak.

That arena has to do one thing well: show what you can practically expect on a DGX Spark. Not which model is "the best" in some abstract sense, but which model stays usable on this hardware under the workloads I run into in client work.

For a few runs I already wrote down what went wrong and what I took away from it. For instance where Gemma-4 starts to grind on the Spark, what NVFP4 wins over BF16 once the bugs are gone, and how three precisions of Nemotron-3 compare.

The raw output is public on GitHub: djangodevreng/dgx-spark-benchmarks. That is on purpose. If you have a Spark yourself, you should be able to walk the same route and get roughly the same numbers. If that does not work out, that is interesting data too.

So the arena is not a static little list. It is a workbench. New models added, other precisions next to them, workloads tightened up, odd results run again. Boring enough to actually be useful.

The blog is the context around it

Numbers are handy, but they do not tell the whole story.

A benchmark can say that NVFP4 is faster than BF16. The blog can tell you that the first runs fell apart on vLLM bugs, that a parameter was set wrong, that a model only became usable after the context length went down, or that the tail latency felt worse than the average let on.

That is the layer I missed myself when I started. Not just "here is a score", but: this is what I tried, this broke, this is what I changed, and this is what I would do differently next time.

That is why the blog and arena sit side by side. The arena gives the measuring points. The blog gives the reasoning, the mistakes and the practical choices behind them.

Why local

Privacy is usually the polite explanation. It is also true. The more practical reason: some clients have no choice.

An accountancy firm cannot treat client data as if it were sample text in a demo. Municipalities have rules. Financial documents have rules. Personal data has rules. In practice it all comes down to the same question: can you set this up without legal, compliance and audit immediately slamming the door shut?

Then you have two options. AI does not fit there, or you make it local.

We choose local where it is needed. The Spark suddenly makes that less exotic. It is not cheap, but it is manageable for an SME office that wants to do something serious without immediately building its own data center.

That is where the interesting work is for me: running models, measuring latency, testing prompts, pulling documents through a pipeline, and watching where it breaks.

Usually it breaks somewhere boring. Those are the best spots.

What I want to be able to answer

The arena ultimately has to answer questions that keep coming back in projects.

Which model is fast enough for internal document questions? Which precision gives enough room for several users at the same time? When is NVFP4 fine, when do you want FP8, and when is BF16 mostly an expensive default? How much context can you give before latency gets annoying? Which engine fits which workload better: vLLM, TensorRT-LLM or SGLang?

These are not academic questions. They decide how you design an on-prem setup. How much hardware you need. Which data stays local. Which steps you might send off to a hosted model. And where you draw the line between "works in a demo" and "holds up on Monday morning".

That last line is the whole reason this site exists.

Why I write this in public

Everything I use for this is open or public: vLLM, models on Hugging Face, benchmark scripts, loose JSON, the site itself. The secret is not access to some magic dashboard. It is in hours of trying, measuring, running again, hunting bugs and then measuring once more because your first run was suspiciously good.

That has cost me dozens of hours by now. Getting models running, repeating runs, figuring out odd results, and then measuring again because the first run was suspiciously good.

If someone else walks the same route, they do not have to trip over all the same paving stones again. And if someone contradicts my numbers with better runs: great. Then the arena gets better.

There is a second reason under it too. This site is itself part of the experiment. The blog, the arena, the flow from benchmark output to structured JSON to pages: that was largely built in a couple of weeks with agents that write and build along. I described the small version of that earlier in the OpenClaw setup on a Raspberry Pi.

That workflow is part of the work by now. I dump raw findings in Slack, let an agent read the repo and the writing guide, get a branch with a proposal back, run checks and review the diff myself. It does not save me any thinking. It does move a lot of preparation to a layer that just keeps working.

Writing about that process forces me to make it less messy than my terminal history. That helps. Not always fun, but necessary.

What I want to build next

First, more benchmarks. vLLM was the starting point, because it works fast and is widely used. TensorRT-LLM is already on the bench for Nemotron-3. SGLang is what I want to put next to the same workloads after that. Only with multiple engines do you see whether your model is slow, your engine is fighting you, or you just did something dumb.

After that I want to make bench-spark public: the benchmark runner the way I use it now. Not a perfect framework. But something with which someone on the same hardware can ask the same questions without first rebuilding my mistakes.

I also want to make a Dutch eval suite for local LLMs. Not another English reasoning benchmark, but office work: accountancy jargon, legal texts, financial documents, documents with odd formatting. Exactly the things local AI gets judged on in the Netherlands.

And there is more work coming around local RAG on large document sets. No platform pitch. Just figuring out how to get more than a million documents through an on-prem setup without storage, retrieval or OCR slowly starting to hate you.

What I skip

No daily AI newsletter. There are enough places for that already, some of them on purpose.

No general-purpose "we do everything with AI" story. Too broad, and usually it means nothing.

No thought-leader act. I would rather build something that creaks than an opinion that sounds smooth.

No building a platform like OpenClaw either. I use it, I write about it, I build flows with it. But that layer itself I leave to the people who live in it every day.

What this should become

For clients this has to show what local AI practically costs: hardware, latency, precision, maintenance, odd edge cases. For me it is the place where I pin down my own assumptions before the next benchmark knocks them out.

I am trying to keep the rhythm. No promise per week. If there is nothing to report, nothing goes here. If there are bugs, runs and odd graphs, there is probably too much here.

Gemma-4 on the DGX Spark: NVFP4 vs BF16

2026-05-03T00:00:00.000Z

import BenchCard from "../../../components/post/BenchCard.astro"; import BenchCardRow from "../../../components/post/BenchCardRow.astro"; import Note from "../../../components/post/Note.astro";

In the BF16 baseline of Gemma-4 on the DGX Spark I ran nine benchmarks with Gemma-4-26B-A4B in BF16. Decode speed held up just fine, prefill decided when the wall came, and the system queued neatly under pressure instead of crashing. That story seemed done, until NVIDIA released an NVFP4-quantized version of that same model.

Same architecture and fine-tune, same server config, only the precision changes. From BF16 (16 bits per parameter) to NVFP4 (4 bits per parameter, NVIDIA's take on FP4). Four times smaller per weight, and if the Blackwell kernels cooperate, also a lot faster on compute-heavy tasks.

On paper, nice. In practice: the official vLLM v0.20.1 release recognizes this checkpoint without any fuss, and the numbers were faster across the board than the BF16 baseline. Both tests fall under the guide running LLMs on the DGX Spark.

Why look into this at all

For an office with a local AI machine, memory budget is the most limiting thing after compute. A 26B model in BF16 takes ~48 GB of GPU memory for weights alone. On a Spark with 128 GB of unified memory, that leaves about 65 GB for KV-cache. Enough for the office scenario from the first blog, but not much room to run, say, 30+ users with large context side by side.

NVFP4 reduces that to ~18 GB for weights. Not four times smaller than BF16 (the vision encoder stays BF16, and scale factors cost space too), but about 2.7× smaller. That gives you toward 95 GB of KV-cache headroom, which in theory should support much higher concurrency. On top of that, less memory traffic is needed per forward pass, so by definition less bandwidth pressure, and that was already the bottleneck in BF16 under multi-user load. So the question was simple: how much of that theoretical gain survives in practice?

What NVFP4 actually is

NVFP4 is NVIDIA's take on FP4: floating-point numbers with 4 bits per value. Four bits, not four bytes, so a factor of 4 less per parameter than BF16. By storing a scaling factor per group of weights, accuracy stays reasonably intact.

For Blackwell it works like this. NVIDIA's datacenter cards (B100, B200, SM10.0) have tensor cores that can compute natively with 4-bit values, and that is much faster than the same calculation in FP16 or BF16. The DGX Spark, on the other hand, is desktop Blackwell (GB10, SM12.1) and that architecture has no native FP4 compute.<Note>On a datacenter B200 (SM10.0) you'd expect another 2 to 3× on top of this thanks to native FP4 tensor cores. The Spark lacks that hardware path, so all the gain comes from memory bandwidth, not from compute.</Note> What you get in that case is "weight-only" FP4: the weights are physically stored as 4-bit (hence the memory gain), but during compute they get decoded on the fly to FP16 for the matrix multiplications. A vLLM warning makes that explicit:

Your GPU does not have native support for FP4 computation but FP4 quantization
is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel.
This may degrade performance for compute-heavy workloads.

So you get the memory gain in full, the compute gain only partially. The Marlin INT4 GEMM kernel is optimized, but not as fast as native FP4 on SM10.0 would be. Worth factoring in when you look at the numbers further down.

The test setup

Server config identical to the first blog, only the model swaps:

docker run -d --name vllm-bench \
  --gpus all --ipc=host \
  -v appliance_hf-cache:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:v0.20.1 \
  --model nvidia/Gemma-4-26B-A4B-NVFP4 \
  --served-model-name gemma-4-26b-a4b-nvfp4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --limit-mm-per-prompt '{"image":0,"audio":0}' \
  --async-scheduling \
  --no-enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

Tests are one-to-one identical to the first blog: same commands, same concurrency levels, same datasets for the open-loop tests, same seed. That is on purpose, because if you want to measure the effect of an isolated variable (in this case the precision), everything around it has to stay the same. Exactly how I measure those concurrency levels, seeds and open-loop arrivals is described in the Arena measurement method.

Comparison	BF16	NVFP4
Model	google/gemma-4-26B-A4B-it	nvidia/Gemma-4-26B-A4B-NVFP4
Active params	4B	4B
Total params	26B	26B
Model memory	~48 GB	~18 GB
KV-cache headroom	~65 GB	~95 GB
MoE backend	(default)	MARLIN (forced)

Three numbers sum up where this lands. Click through for the full run in the Arena, with all seeds, concurrency levels and commands:

An interactive version of all the numbers is on the Arena page for Gemma-4-26B-A4B-NVFP4, including commands and TTFT percentiles for all 9 tests.

<details> <summary>Run A: context scaling from 4k to 25k</summary>

Decode per user as context grows, c=1/5/10:

Context	Users	BF16 d/u	NVFP4 d/u	Gain
4k	1	24.08	29.80	+24%
4k	5	12.55	22.01	+75%
4k	10	9.48	16.94	+79%
8k	1	23.69	29.31	+24%
8k	5	11.48	19.28	+68%
8k	10	8.52	14.35	+68%
16k	1	23.34	28.55	+22%
16k	5	10.05	15.67	+56%
16k	10	6.79	10.06	+48%
25k	1	22.75	27.70	+22%
25k	5	8.46	12.46	+47%
25k	10	5.40	7.55	+40%

At c=1 the gain is stable around +22-24% across all contexts. Memory bandwidth barely matters for single-user, so the gain here sits in the compute path itself. Marlin's INT4 decode plus FP16 matmul is slightly faster than BF16's direct FP16 matmul, even though it's two steps.

At c=10 the difference scales much more strongly with workload type, from +40% at 25k context to +79% at 4k. That's because under multi-user the memory bandwidth becomes the bottleneck, and NVFP4 reads fewer bytes per forward pass. The more concurrent, the more that counts, until you hit the KV-cache memory limits again (25k context with multiple users) and the gain flattens out.

TTFT (first token) is better too:

Context	Users	BF16 TTFT	NVFP4 TTFT
4k	10	4.46s	4.20s
8k	10	7.99s	7.84s
16k	10	18.92s	18.69s
25k	10	35.67s	35.65s

On TTFT the gain is small. That makes sense: prefill is compute-heavy, and on SM12.1 without native FP4 tensor cores Marlin has to decode the weights on the fly for the matmul. That gives back some of what the memory bandwidth gained. For decode, bandwidth counts more than compute; for prefill, the other way around.

</details>

<details> <summary>Run B: 25k context, concurrency up to 20</summary>

The stress test from part one:

Users	BF16 d/u	NVFP4 d/u	BF16 TTFT	NVFP4 TTFT
5	8.51 t/s	12.43 t/s	19.86s	19.72s
10	5.37 t/s	7.56 t/s	35.44s	35.51s
20	3.16 t/s	4.26 t/s	67.37s	67.40s

The aggregate decode plateau shifts from 32 t/s to 36 t/s at c=20: a 12% higher ceiling at 25k context under maximum pressure. TTFT is practically identical between BF16 and NVFP4 because prefill is the wall here and that doesn't get much faster on SM12.1. Decode per user is clearly better though: at twenty parallel 25k prompts you get 4.26 instead of 3.16 t/s, +35%. Still not chat speed, but a noticeable difference once the tokens start flowing.

</details>

<details> <summary>Run C: 1k prompt, 1k output</summary>

The short-prompt + long-answer workload, close to agent flows and code generation:

Users	BF16 d/u	NVFP4 d/u	Gain
1	23.86	29.45	+23%
5	13.59	24.69	+82%
10	10.92	20.88	+91%

At c=10 per-user decode sits at well over 20 t/s, above reading speed and close to a comfortable streaming UI. Aggregate decode at c=10 hits 209 t/s instead of 86 t/s in BF16, almost a doubling.

</details>

<details> <summary>Run E: multi-turn (depth 4)</summary>

Five consecutive turns per conversation, ten conversations in parallel: the most realistic office shape.

Users	BF16 d/u	NVFP4 d/u	BF16 TTFT	NVFP4 TTFT
1	23.97	29.61	0.53s	0.33s
5	13.07	23.98	1.32s	1.11s
10	10.43	19.51	2.13s	1.94s

For ten parallel 5-turn conversations: 1.94 seconds to first token, 19.51 t/s per user. That fits comfortably within what a reader experiences as chat, and is 87% faster per token than BF16 in the same test.

</details>

<details> <summary>Run F: RAG mix (8k prompt)</summary>

Users	BF16 d/u	NVFP4 d/u	BF16 TTFT	NVFP4 TTFT
5	12.11	20.91	4.32s	4.28s
10	9.31	15.96	7.99s	8.00s
20	6.05	10.57	14.61s	14.45s

8k context is roughly what a RAG flow with four chunks of 2k tokens takes in. At ten users you wait 8 seconds to first token (almost the same as BF16, because of the compute bottleneck), then 16 t/s streaming. For "ask something about your documents" flows that's plenty workable, and where the gain sits: in decode speed, not in TTFT.

</details>

<details> <summary>Run G: short instruction, 4096 output tokens</summary>

The agent / code-generation shape:

Users	BF16 d/u	NVFP4 d/u	BF16 TTFT	NVFP4 TTFT
1	24.17	29.59	0.24s	0.11s
5	14.32	25.79	0.38s	0.23s
10	11.75	22.54	0.48s	0.37s

A TTFT of 110 milliseconds at single-user is very low, lower than most hosted APIs manage over the network. And 22.54 t/s per user at c=10 is plenty for agent streams. Aggregate decode at c=10 in this test comes out at 225 t/s versus 84 t/s in BF16, almost 2.7× as much. For a team running ten concurrent agents that each produce long structured output, this is the most important number.

</details>

<details> <summary>Run H: open-loop, random 4k workload</summary>

The synthetic office baseline with Poisson arrivals:

Metric	BF16	NVFP4
Achieved RPS	0.27	0.29
Peak concurrent	36	16
TTFT P50	1286 ms	1006 ms
TTFT P99	3316 ms	2893 ms
TPOT P50	182 ms	64 ms
Total tok/s	1215	1302

What stands out is that peak concurrent drops from 36 to 16 at an identical arrival rate (0.3 rps) and identical prompts. Because NVFP4 handles each request faster, the queue stays shorter, and that's an important insight for capacity planning: NVFP4 gives you not only lower latency per request, but also less queue pressure at the same arrival rate. At the same time TPOT P50 drops from 182ms to 64ms. Median inter-token latency almost three times faster, then. For a chat UI that shows token streaming, that's the difference between artificially waiting for an answer and just reading along.

</details>

<details> <summary>Run I: ShareGPT replay (real conversations)</summary>

Real multi-turn conversation data:

Metric	BF16	NVFP4
Peak concurrent	17	10
TTFT P50	353 ms	152 ms
TTFT P99	637 ms	265 ms
TPOT P50	95 ms	39 ms

A P99 TTFT of 265 milliseconds, for 99 percent of users. A TPOT of 39 ms works out to 25.6 t/s per user. You can safely call that realtime chat for 25 employees with realistic ShareGPT-style prompts.

</details>

<details> <summary>Run J: Monday-morning peak</summary>

The heaviest scenario from part one: overloaded server, 1.5 rps target with max 25 concurrent requests.

Metric	BF16	NVFP4
Configured RPS	1.50	1.50
Achieved RPS	0.26	0.44
TTFT P50	1132 ms	920 ms
TTFT P99	6157 ms	6054 ms
TPOT P50	187 ms	108 ms
Total tok/s	1173	1984

The most measurable number of the whole day is that achieved RPS goes from 0.26 to 0.44. Same target, same concurrency cap, same Poisson arrivals, and NVFP4 processes 69% more requests per second before the queue clogs up.

P99 TTFT shifts only marginally (6.16s to 6.05s). That fits the pattern: prefill is compute-bound on SM12.1, and NVFP4 isn't much faster there. But TPOT P50 drops from 187ms to 108ms, and aggregate token throughput grows from 1173 to 1984 t/s. For a 25-person office at peak hours, that's the difference between enough and a squeeze: more requests per second processed, with faster streaming for whoever's up next.

</details>

What this means for on-prem AI

If you have a Spark and run Gemma-4-26B, NVFP4 is the upgrade. In all 9 tests NVFP4 is the winner, and it frees up 30 GB of memory for other purposes like more KV-cache, a second small model alongside it, or batch jobs. At Kamoo this NVFP4 config now sits next to the BF16 baseline in bench-spark/, and one command switches between the two.

For a 25-person office with realistic ShareGPT-like prompts you notice it right away. TPOT P50 drops from 95 ms to 39 ms, P99 TTFT from 637 ms to 265 ms. And when peak load comes, the system delivers 69% more requests per second before it fills up. For agent flows and code generation (Run G shape) the Spark in NVFP4 is at its strongest: ten parallel agents, each 4096 tokens of output, 22.5 t/s per user with TTFT under 400 ms.

For 25k context stress (Run B) it stays the wall. NVFP4 barely lowers it (TTFT differs by less than a second), because prefill stays prefill, and ten parallel 25k prompts wait 35 seconds for the first token. Quantization changes nothing about that on this hardware. Decode speed it does change: 7.56 t/s/user instead of 5.37, so once the tokens come, they run faster.

What this run doesn't say

This is not NVFP4 on SM10.0 (datacenter Blackwell). There native FP4 compute would make the difference much bigger, with an expectation of a further 2-3× speedup on top of what we see here. On an H100 or B200 these numbers are therefore not representative; the Spark has a specific SM12.1 handicap (no native FP4) that doesn't exist in the cloud.

This is also not a comparison with dense Gemma-4-31B in NVFP4. Dense goes through a different code path in vLLM's loader. For a follow-up blog, dense NVFP4 with the same test suite would give a third data point.

And this is not a long-term accuracy comparison. NVFP4 quantization has potentially small accuracy effects. For the typical tasks in an office (summarization, ticket classification, RAG) rarely noticeable, for edge cases possibly yes.

What NVIDIA did publish is in the NVFP4 model card: on MMLU-Pro, GPQA-Diamond and LiveCodeBench, NVFP4 sits within 0.2 to 0.7 points of their own BF16 baseline.<Note>NVIDIA's own BF16 baseline itself deviates from Google's official Gemma-4 card numbers. Eval harnesses differ more than precision itself, so cross-comparing between vendors without an identical harness is shaky.</Note> That falls within run-to-run variance, no real degradation. What's curious about that same table is that NVIDIA's BF16 baseline in turn deviates from what Google publishes in the official Gemma-4 card: MMLU-Pro 85.0 vs 82.6, GPQA 80.3 vs 82.3, LiveCodeBench 80.5 vs 77.1. Not because quantization gets better than the original, but because the eval harness apparently matters more than the precision itself. Different prompts, different temperature, different stop criteria. Cross-comparisons between vendors are therefore hard to pin down without the same harness.

What sticks

Decode sells the benchmark, prefill decides the experience. That held in part one and it still holds. What NVFP4 adds is that decode gets faster in every workload, and most where it matters: at larger context and more users at once. TTFT stays roughly the same on SM12.1 because prefill is compute-bound and the Spark has no native FP4 tensor cores. For what the user feels once the tokens start flowing, NVFP4 on this hardware is a lot better than BF16, and it costs nothing in setup pain: one official vLLM image, one model flag, and it runs.

Nemotron-3 on the DGX Spark: BF16 vs FP8 vs NVFP4

2026-05-03T00:00:00.000Z

In the previous posts I ran Gemma-4 on the DGX Spark. First just BF16 as a baseline, then NVFP4 vs BF16 across the same test suite. That gave one model in two precisions. Useful, but not yet a real picture of the choice you have to make in production.

For this piece I run three variants of the same model side by side: BF16, FP8 and NVFP4 of Nemotron-3-Nano-Omni-30B-A3B-Reasoning. Same Spark. Same vLLM version. Same prompts. Same benchmark suite. As close to a fair quantization comparison as I can get on this machine.

The short version: NVFP4 wins on speed and throughput, FP8 wins more often on tail-latency, BF16 is mostly still useful as a baseline. That is less tidy than "4 bit is always better". Lucky for us, otherwise this post would have been short. Part of the guide running LLMs on the DGX Spark.

Why this experiment

The Gemma post mostly showed that NVFP4 works on the Spark. With some pain. Five vLLM bugs, a nightly build and enough flags to make a command line look like a small confession.

But Gemma did not answer the question I need for clients: what do you pick if you want to run a local model on a Spark today? BF16 because those are the original weights? FP8 because Blackwell is natively good at it? Or NVFP4 because you fit much more model and KV-cache in the same memory?

So here is this run. One model in three precisions. No leaderboard score, but workloads that resemble office work: chat, RAG, longer answers, multiple users at once, and a Monday morning where everyone suddenly decides AI is handy after all.

What BF16, FP8 and NVFP4 mean here

BF16 is the baseline: 16 bits per parameter, roughly 2 bytes. For this model that means about 61.5 GB of checkpoint size. That fits on the Spark, but it eats a lot of your 128 GB unified memory before a single user has any context in the KV-cache.

FP8 roughly halves that weight. The checkpoint is 32.8 GB. On Blackwell, FP8 is a logical choice: less memory, native support, and usually little hassle in vLLM.

NVFP4 goes further. The checkpoint is 20.9 GB. Not four times smaller than BF16, because the vision and audio encoders stay in BF16, but small enough to make the Spark feel different. More room for KV-cache, more batching, more concurrency.

The nuance: the DGX Spark runs on desktop Blackwell SM12.1. There NVFP4 is not the same party as on datacenter Blackwell. vLLM uses Marlin to decode FP4 weights toward FP16 during compute. You get the memory win fully. The compute win is less pure.

For this post that is exactly what makes it interesting. This is not a theoretical quantization post. This is: what happens on this machine, with this stack, when you actually run the three options?

Precision	Model size	Memory budget left of 128 GB
BF16	61.5 GB	~66 GB
FP8	32.8 GB	~95 GB
NVFP4	20.9 GB	~107 GB

The test setup

All runs go through Docker on the DGX Spark with vllm/vllm-openai:v0.20.0. Official release, no patches.

docker run -d --name vllm-bench \
  --gpus all --ipc=host \
  -v appliance_hf-cache:/root/.cache/huggingface \
  -p 8000:8000 \
  -e HF_TOKEN="***" \
  vllm/vllm-openai:v0.20.0 \
  vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 256 \
  --max-num-batched-tokens 8192 \
  --trust-remote-code \
  --video-pruning-rate 0.5 \
  --reasoning-parser nemotron_v3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --limit-mm-per-prompt '{"image":0,"audio":0}'

For FP8 I use the same profile with --kv-cache-dtype fp8. BF16 runs without that KV-cache flag. Everything else stays equal.

The benchmark suite is described in the arena methodology. In short: closed-loop tests for decode and TTFT per user, plus open-loop tests with Poisson arrivals to see how the server behaves when requests do not neatly wait for each other.

Setup

I started wrong with nvcr.io/nvidia/vllm:26.02-py3, NVIDIA's own vLLM container. It had vLLM 0.15.1 and did not yet know the NemotronH_Nano_Omni_Reasoning_V3 architecture.

The fix was more boring: vllm/vllm-openai:v0.20.0. Official release, correct flashinfer versions, first run working.

Our own bench-spark CLI still needed two small fixes: bypass the NVIDIA entrypoint with --entrypoint vllm, and pass HF_TOKEN to the container automatically. After that the suite ran.

Lesson: start with the stable release that supports the architecture.

<details> <summary>Run A: context-scaling</summary>

This run is the foundation: what happens when the prompt gets longer, while the number of users climbs from one to ten? That touches office work directly. A short chat is easy. A RAG question with 25k context and several people at once is where the Spark shows how much room is really left.

Here I look at two things. First decode per user: how fast does text come back once generation is running? Then TTFT: how long do you wait for the first token? With long context TTFT is often the pain users feel first. They see no tokens, so it feels like the system is stuck.

Single-user is mostly a pure speed measurement. There NVFP4 nearly doubles BF16. At ten users it gets more interesting: the smaller weights give vLLM more room to batch, and then BF16 just gets heavy.

Decode/user (tg256), c=1

Context	BF16	FP8	NVFP4	NVFP4 vs BF16
4k	29.23	51.68	60.30	+106%
8k	28.59	49.82	55.72	+95%
16k	28.24	47.52	55.24	+96%
25k	28.24	48.85	54.98	+95%

BF16 stays neatly flat around 28-29 tokens per second. That is stable, but not fast. FP8 puts about 50 t/s against it. NVFP4 sits around 55-60 t/s. For a single user that is the difference between "fine" and "this feels local but not local-slow".

Decode/user (tg256), c=10

Context	BF16	FP8	NVFP4	NVFP4 vs BF16
4k	7.76	13.45	19.69	+154%
8k	7.13	11.14	17.90	+151%
16k	6.30	10.73	14.99	+138%
25k	5.56	8.59	12.99	+134%

At ten users NVFP4 is not "a bit faster". It is a different class. At 25k context BF16 does 5.56 tok/s/user. NVFP4 does 12.99. That is still no cloud-GPU cluster, but the difference in feel is large: BF16 becomes waiting, NVFP4 keeps working.

TTFT (first token), c=10

Context	BF16	FP8	NVFP4
4k	3.90s	2.91s	2.45s
8k	6.49s	5.93s	4.03s
16k	12.63s	10.55s	8.01s
25k	19.82s	16.89s	12.71s

This is the table I take most seriously for real users. At 25k context and ten users you wait almost 20 seconds for the first token with BF16. With NVFP4 that is 12.7 seconds. Still long, but not the same kind of long.

</details>

<details> <summary>Run B: 25k context, concurrency up to 20</summary>

Run A shows how context length scales. Run B keeps the context heavy and only raises the concurrency. This is the "everyone asks a big question at the same time" test.

In practice this does not happen every hour. Ten to twenty people rarely click send at exactly the same moment with 25k context. But if you put a local AI machine in front of a team, you want to know how it fails. Calmly getting slower is acceptable. A queue that feels dead is not.

NVFP4 keeps the most air here. Not because the model gets smarter, but because the server with smaller weights has more room for batching and KV-cache.

Users	BF16 d/u	FP8 d/u	NVFP4 d/u	NVFP4 vs BF16
5	9.06	15.33	20.75	+129%
10	5.65	9.18	12.99	+130%
20	3.70	5.97	7.79	+110%

Users	BF16 TTFT	FP8 TTFT	NVFP4 TTFT
5	11.01s	8.89s	7.21s
10	19.75s	15.82s	12.74s
20	37.88s	29.91s	24.08s

Twenty users with 25k context is deliberately unkind. Still, it is useful. BF16 sits at 37.88 seconds TTFT. That feels broken. NVFP4 sits at 24.08 seconds. Also not cozy, but still a good thirteen seconds faster.

Aggregate decode shows the same picture:

Users	BF16	FP8	NVFP4
5	34 t/s	53 t/s	71 t/s
10	38 t/s	59 t/s	77 t/s
20	44 t/s	66 t/s	84 t/s

The ceiling shifts from 44 t/s to 84 t/s. For a single user that is abstract. For a team it means the queue drains faster.

</details>

<details> <summary>Run C: short prompt, long output</summary>

This is the workload for agents, code generation and longer answers: little input, a lot of output. The prompt is only 1024 tokens, so prefill is not the problem here. The question is mostly how fast the model keeps ticking once the output gets long.

So here I look at decode per user. TTFT has to stay low, but the real difference you feel only after a few hundred tokens. A model that starts fast but then hangs at 8 tok/s still feels slow.

NVFP4 clearly wins here. At ten parallel users the model stays at 22.90 tok/s/user. BF16 drops to 7.84. That is still readable, but for an agent flow it feels like someone is typing along by hand.

Users	BF16 d/u	FP8 d/u	NVFP4 d/u
1	28.65	49.85	55.55
5	12.19	21.32	30.97
10	7.84	15.26	22.90

For this workload NVFP4 is the logical default. FP8 is fine, but here you mostly give up speed without tail-latency playing the lead role.

</details>

<details> <summary>Run E: multi-turn, depth 4</summary>

Multi-turn is closer to real use than one isolated prompt. Five turns per conversation, several conversations in parallel. That resembles an employee who does not ask one question, but keeps asking, corrects, and carries context along.

Here I do not just want to see high throughput. I mostly want the server not to feel like it comes out of a cold start every turn. With ten conversations at once that becomes relevant: the context grows per conversation, the scheduler has to keep sharing, and the user expects the chat to keep running.

This is the most important office run for me. Not because it is perfectly real, but because it comes closest to "25 people use this spread across the day".

Users	BF16 d/u	FP8 d/u	NVFP4 d/u	NVFP4 TTFT
1	28.69	49.72	56.18	596 ms
5	11.50	20.87	30.55	1032 ms
10	7.68	14.88	21.58	1359 ms

At ten parallel conversations NVFP4 sits at 21.58 tok/s/user. FP8 sits at 14.88. BF16 at 7.68. That last one works technically, but it no longer feels like a snappy chat. NVFP4 stays well above the line where you experience an answer as fluent.

</details>

<details> <summary>Run F: RAG mix with 8k prompt</summary>

RAG is usually not 25k context, but not a short chat either. This run uses an 8k prompt and 512 output tokens. Think four chunks of about 2k tokens, plus question and instruction.

With RAG prefill counts more than in Run C. You push a sizeable slab of context into the model each time before anything comes back. After that you want enough decode left to make the answer usefully fast.

So the question is: does quantization keep helping when the prompt gets heavier? Yes. NVFP4 stays clearly ahead, even at twenty users.

Users	BF16 d/u	FP8 d/u	NVFP4 d/u
5	12.50	21.02	27.77
10	8.11	14.37	19.65
20	5.51	9.82	14.09

At twenty users NVFP4 delivers 14.09 tok/s/user. BF16 sits at 5.51. For batch processing that can still work. For real-time RAG in an office BF16 feels tight, certainly when documents are messy and prompts get longer than you had hoped. They always do.

</details>

<details> <summary>Run G: short instruction, 4096 output tokens</summary>

Run G resembles Run C, but pulls the output much further: 4096 tokens. This is the shape of agents that write out plans, generate code, make long analyses or summarize multiple files.

For this kind of workload the first token is almost a side issue. If the answer is long, decode speed determines the experience. Ten seconds of difference at the start is annoying. Waiting on output for minutes is worse.

NVFP4 stays strongest here. More important: it also stays above 25 tok/s/user at ten users. For local hardware on a desk machine that is simply usable.

Users	BF16 d/u	FP8 d/u	NVFP4 d/u	NVFP4 TTFT
1	28.68	49.75	55.44	179 ms
5	14.32	25.56	34.63	427 ms
10	9.51	18.40	25.18	363 ms

For agent flows this is fairly hard: BF16 is not broken, but you pay for every long output twice. First in memory, then in waiting time.

</details>

<details> <summary>Run H: open-loop office baseline</summary>

From here the interpretation changes. The previous runs push controlled batches through the model. Run H uses open-loop traffic: requests come in according to a Poisson distribution. So the server has to deal with arrivals that do not neatly wait for the previous one to finish.

This resembles an office more. Not perfect, but better than everyone at once or fully sequential. The metrics are different too. TPOT tells how fast tokens come once it is your turn. TTFT P50 tells the normal experience. TTFT P99 tells what the unlucky one notices.

Here FP8 gets interesting. NVFP4 wins the median and TPOT, but FP8 wins the tail. That is exactly why I do not want to end with "NVFP4 is always better".

Metric	BF16	FP8	NVFP4
Achieved RPS	0.26	0.28	0.29
Peak concurrent	42	18	15
TTFT P50	1229 ms	732 ms	618 ms
TTFT P99	2996 ms	2008 ms	3235 ms
TPOT P50	203 ms	74 ms	39 ms
Aggregate tok/s	1203	1297	1329

That peak concurrent of BF16 looks good on paper, but it is not. The queue grows because BF16 drains it less quickly. NVFP4 processes faster, so fewer requests are open at the same time. That is not lower capacity, that is less of a line.

The real choice is between NVFP4 and FP8. Want the best median and fastest output, then NVFP4. Want the cleanest P99 on this workload, then FP8.

</details>

<details> <summary>Run I: ShareGPT replay</summary>

ShareGPT replay is messier and therefore useful. Real conversations have varying lengths, follow-up questions, short answers, long answers and prompts that have not been neatly smoothed out by a benchmark author.

This is the run I trust most for chat feel. Not for company documents, but for the question: how does this feel when several people hold conversations throughout the day?

The pattern from Run H holds. NVFP4 is fastest for the average user. FP8 has the better P99.

Metric	BF16	FP8	NVFP4
Peak concurrent	17	12	10
TTFT P50	433 ms	220 ms	157 ms
TTFT P99	713 ms	422 ms	1361 ms
TPOT P50	118 ms	38 ms	26 ms

NVFP4 feels instant for most users: 157 ms TTFT P50 and 26 ms TPOT P50. But the P99 is 1361 ms, where FP8 stays at 422 ms. That is a hefty difference.

For an internal chat where a single slower request is no disaster, I pick NVFP4. For a product UI with a hard latency promise I would take FP8 more seriously.

</details>

<details> <summary>Run J: Monday morning peak</summary>

Run J is oversubscribe. The target is 1.5 requests per second with a concurrency cap of 25. This is not the normal workday. This is the test for what happens when demand is bigger than the server can neatly keep up with.

With oversubscribe I look at achieved RPS first. Not at configured RPS, because that is the same for everyone. The question is how many requests the server actually processes while it is under pressure.

There NVFP4 wins clearly. FP8 keeps the tail cleaner, but NVFP4 gets much more work through the machine.

Metric	BF16	FP8	NVFP4
Configured RPS	1.50	1.50	1.50
Achieved RPS	0.25	0.43	0.58
Peak concurrent	28	28	28
TTFT P50	1130 ms	757 ms	687 ms
TTFT P99	5184 ms	3388 ms	4462 ms
TPOT P50	197 ms	112 ms	82 ms
Aggregate tok/s	1118	1951	2622

Concretely: NVFP4 processes about 35 requests per minute. BF16 about 15. That is the difference between a queue that slowly drains and a queue that makes users wonder whether they should click again. Do not click. That second click never helps.

</details>

The three precisions side by side

If I have to pick one realistic chat run, I take ShareGPT replay. There you see the distinction cleanest: NVFP4 wins the normal experience, FP8 wins the tail, BF16 takes part but convinces nowhere.

Metric	BF16	FP8	NVFP4	Best choice
TPOT P50	118 ms	38 ms	26 ms	NVFP4
TTFT P50	433 ms	220 ms	157 ms	NVFP4
TTFT P99	713 ms	422 ms	1361 ms	FP8
Peak concurrent	17	12	10	NVFP4
Achieved RPS	0.30	0.30	0.30	tie

With oversubscribe the difference gets harder:

Metric	BF16	FP8	NVFP4	Best choice
Achieved RPS	0.25	0.43	0.58	NVFP4
TTFT P50	1130 ms	757 ms	687 ms	NVFP4
TTFT P99	5184 ms	3388 ms	4462 ms	FP8
TPOT P50	197 ms	112 ms	82 ms	NVFP4
Aggregate tok/s	1118	1951	2622	NVFP4

That makes the choice more practical than I thought beforehand. NVFP4 is the default if you want throughput and normal user experience. FP8 is the choice if you find P99 more important than median. BF16 is the baseline you use to check whether quantization wrecks your accuracy.

Why FP8 wins the P99

My hypothesis: NVFP4 gives vLLM more memory room and therefore more batching room. That raises throughput and lowers TPOT, but individual requests can sometimes wait longer before they fall neatly into a batch.

FP8 has less headroom than NVFP4, but still enough for this workload. That makes the scheduler seem more predictable. Less aggressive, less fast in median, better in the tail.

BF16 has the worst of both worlds: large weights, less KV-cache headroom and lower decode. The queue gets fuller, but not because the server can handle so much at once. It just gets through it less quickly.

I want to dig into this further with scheduler settings and prefix caching. The raw numbers and the test definitions are in the arena so I can hold future runs against the same bar.

Comparison with Gemma-4-26B-A4B

Nemotron-NVFP4 is single-user almost twice as fast as Gemma-NVFP4. At multi-user the difference gets smaller, but it usually stays positive.

Workload	Gemma-NVFP4 d/u	Nemotron-NVFP4 d/u	Ratio
pp4096 c=1	30.01	60.30	2.0×
pp8192 c=1	29.35	55.72	1.9×
pp25000 c=1	28.00	54.98	2.0×
pp4096 c=10	17.05	19.69	1.2×
pp25000 c=10	7.61	12.99	1.7×

That pattern matches what the model is. Nemotron has 3B active params, Gemma 4B active params. At single-user that helps a lot. At multi-user the bottleneck shifts toward memory bandwidth and scheduling, and then the difference gets smaller.

What this means for on-prem AI

My default choice for this Spark is NVFP4. Not because 4 bit is principally nicer, but because the numbers on these workloads carry it: highest throughput, fastest median, lowest TPOT, smallest footprint.

I pick FP8 when tail-latency matters more than median. Think of a UI where you want to be able to say that 99 percent of requests start within a certain bound. In Run H, I and J, FP8 consistently wins on P99 TTFT.

I pick BF16 only as a baseline or for accuracy-critical validation. Not as a production default. For that it is too expensive on the Spark: roughly three times as much memory as NVFP4 and roughly half the speed.

For a 25-person office with chat and RAG-like workload I would run NVFP4, with a custom eval suite alongside it. For an external chatbot with a tight latency promise I would test FP8. For BF16 I would mostly keep a short run to see what quantization changes in substance.

What these runs do not say

No accuracy tests. FP8 and NVFP4 can differ in substance from BF16. For production you have to measure that on your own documents, your own prompts and your own error tolerance.

No multimodal benchmarks. Nemotron-3-Nano-Omni is multimodal-aware, but these runs are text-only. Vision and audio stay out of frame here.

No comparison with dense models. This is an MoE model. Dense models feel different, especially in output speed and how vLLM handles them.

No definitive scheduler conclusion. The FP8-vs-NVFP4 tail is interesting enough to test separately with other batching and scheduling settings.

Where I land

The precision choice is not a detail. On the Spark it determines whether the same machine feels like a local experiment or like something you can hand to colleagues without explaining it every five minutes.

NVFP4 in many runs doubles the usable experience compared to BF16. FP8 is less spectacular, but more predictable in the tail. BF16 stays useful as a reference point, not as an end station.

The practical lesson from these three posts together: follow the vendor recipes, run the stable image and measure your own workload. Do not tinker yourself unless you have a good reason for it. With Gemma I had a reason. In hindsight it was mediocre.

Gemma-4 on the DGX Spark: the price of context

2026-05-01T00:00:00.000Z

I wanted to know how well a DGX Spark holds up as a local AI machine for an office environment.

Not in theory. Just: load Gemma-4-26B-A4B-it into vLLM, throw llama-benchy at it, make context windows bigger, output longer, concurrency higher, add multi-turn, and watch where it stays pleasant and where the wait starts to hurt. And once that story started taking shape, a second question came up: what if I stop testing in lockstep and let requests arrive organically, the way they would in a real office? For that I pulled in vLLM's own benchmark suite, which does what llama-benchy does not: Poisson arrivals, percentiles, real conversation data. How I measure all of this is in the methodology.

The short version: for normal office use this looks good. Short to medium prompts, longer outputs, and even conversations across multiple turns keep feeling fast, even with ten users at once. With large context windows the problem is not tokens per second, but how long someone stares at an empty chat window before the first token arrives. And if you really overload the machine, it does not scale, it queues.

That makes this not a "can the DGX Spark do it or not" story. It makes it a workload story. Nine tests, two methods, one machine. It is one of the build logs under the guide running LLMs on the DGX Spark.

Why this test

With on-prem AI you quickly end up talking about privacy, keeping data closer, and being less dependent on hosted models. That is all true, but eventually a flatter question follows.

Can the machine handle it?

A local model that neatly answers one demo prompt is nice. But production rarely looks like that. There you have multiple users, larger context, agent flows, tool-calls, retries, and sometimes someone who pastes half a novel into a ticket.

So I did not want to measure only tokens per second on one prompt. I wanted to see what happens when you load the machine from different angles: from "ten users, short prompts, long answers" to "ten users, five-turn conversations, growing memory" to "requests that arrive organically like in a real office, not all at once and not all the same size".

For these benchmarks I tested one model:

google/gemma-4-26B-A4B-it
BF16
DGX Spark, NVIDIA GB10, 128 GB unified memory
vLLM as OpenAI-compatible endpoint

Dense comes later. MoE vs dense too. This piece is only about Gemma-4-26B-A4B-it on the DGX Spark. This run is on BF16; what happens to the same Gemma-4 when you quantize to NVFP4 is a separate story.

What I expected up front

My expectation was simple: MoE would stay reasonably good under concurrent requests, but I thought the DGX Spark would hit its limits faster once the context grew large.

Especially at 25k context.

Context is expensive. You pay not only for the prompt coming in, but also for the KV-cache that vLLM has to keep around. Multiply that by multiple users and it suddenly becomes a memory problem and a queueing problem.

I was curious about five things:

does decode stay usable as context grows?
how much does prefill add to the time to first token?
what happens when the prompt is short but the output long?
how does it behave with multi-turn conversations, where context thickens per turn?
and (added only later) what does all of this look like when requests do not come in lockstep, but organically?

That last question turned out to be half the story.

The test setup

The server ran in Docker with the official vLLM image:

docker run -d --name vllm-bench \
  --gpus all --ipc=host \
  -v appliance_hf-cache:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:v0.20.1 \
  --model google/gemma-4-26B-A4B-it \
  --served-model-name gemma-4-26b-a4b-bf16 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --limit-mm-per-prompt '{"image":0,"audio":0}' \
  --async-scheduling \
  --no-enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

A few details matter.

Prefix caching is deliberately off. I wanted to see the raw prefill cost first, not a benchmark that looks nicer because the prompts resemble each other.

The KV-cache runs on fp8. Without it, 128k context with multiple concurrent requests quickly becomes a memory exercise that gets you nowhere.

All nine tests below use exactly this server config. No restart, no mid-run change. What varies is the workload: prompt size, output size, concurrency, depth, and for the open-loop tests also arrival rate and burstiness.

What the Spark makes of this:

Component	Value
Model weights (BF16)	~48 GB
KV-cache headroom (fp8)	~65 GB
Theoretical parallel @ 128k	~4 requests
Theoretical parallel @ 8k	~50 requests

At full context per request, memory is tight. In practice no test uses 128k at once per user, so the bottleneck shifts to prefill compute and scheduler batching. We see that below.

Run A: making the context bigger

The first run grew the context from 4k to 25k. Concurrency went along from 1 to 5 and 10. Closed-loop, so N users in lockstep.

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model gemma-4-26b-a4b-bf16 \
  --pp 4096 8192 16384 25000 \
  --tg 256 \
  --depth 0 \
  --concurrency 1 5 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

pp is prefill, that is how many prompt tokens go in. tg is decode, that is how many tokens the model generates afterwards. llama-benchy reports mean ± stddev. No p95. That is important to remember, because with latency you otherwise quickly fool yourself into optimism.

This is the summary from Run A:

Context	Users	Prefill total	Decode/user	Decode total	TTFT
4k	1	3677.85 ± 1259.27 tok/s	24.08 ± 0.02 tok/s	24.08 ± 0.02 tok/s	1.37 ± 0.52s
4k	5	5722.96 ± 94.70 tok/s	12.55 ± 0.49 tok/s	57.07 ± 2.64 tok/s	2.29 ± 0.82s
4k	10	5475.53 ± 888.14 tok/s	9.48 ± 0.73 tok/s	84.40 ± 3.08 tok/s	4.46 ± 2.38s
8k	1	6121.87 ± 62.31 tok/s	23.69 ± 0.02 tok/s	23.69 ± 0.02 tok/s	1.39 ± 0.01s
8k	5	5444.57 ± 12.82 tok/s	11.48 ± 0.92 tok/s	49.42 ± 1.60 tok/s	4.34 ± 1.91s
8k	10	5478.98 ± 11.48 tok/s	8.52 ± 1.10 tok/s	67.72 ± 0.91 tok/s	7.99 ± 4.03s
16k	1	4607.64 ± 23.05 tok/s	23.34 ± 0.05 tok/s	23.34 ± 0.05 tok/s	3.42 ± 0.00s
16k	5	4466.35 ± 27.19 tok/s	10.05 ± 1.75 tok/s	38.41 ± 0.12 tok/s	10.43 ± 4.69s
16k	10	4453.92 ± 18.19 tok/s	6.79 ± 1.62 tok/s	45.76 ± 0.43 tok/s	18.92 ± 9.43s
25k	1	3621.25 ± 18.50 tok/s	22.75 ± 0.08 tok/s	22.75 ± 0.08 tok/s	6.39 ± 0.05s
25k	5	3561.78 ± 9.23 tok/s	8.46 ± 2.36 tok/s	27.93 ± 0.08 tok/s	19.63 ± 8.87s
25k	10	3565.35 ± 8.21 tok/s	5.40 ± 2.00 tok/s	30.73 ± 0.12 tok/s	35.67 ± 18.00s

<figure class="breakout-wide"> <img src="/blog/gemma-4-dgx-spark/run-a-ttfr.webp" width="1425" height="878" loading="lazy" decoding="async" alt="Run A: TTFT vs context, one line per concurrent users (1, 5, 10). TTFT climbs from ~1.4 seconds at 4k to 36 seconds at 25k context with 10 users." /> <figcaption>Run A: Wait time for the first token, per concurrent users. Double the prompt and you double the wait.</figcaption> </figure>

<figure class="breakout-wide"> <img src="/blog/gemma-4-dgx-spark/run-a-decode.webp" width="1425" height="878" loading="lazy" decoding="async" alt="Run A: Decode speed per user vs context. At c=1 decode stays between 22.7 and 24.1 tokens per second, at c=10 it drops from 9.5 to 5.4 tokens per second." /> <figcaption>Run A: Decode per user. With one user it stays almost flat; only with multiple users and large context does it collapse.</figcaption> </figure>

Run B: holding 25k context, concurrency up

After that I ran the same 25k context harder. No longer varying the context, only adding users.

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model gemma-4-26b-a4b-bf16 \
  --pp 25000 \
  --tg 256 \
  --depth 0 \
  --concurrency 5 10 20 \
  --runs 3 \
  --latency-mode generation \
  --exit-on-first-fail \
  --format md

No OOM. No crash. The DGX Spark survived 20 concurrent requests at 25k context.

Users	Prefill total	Decode/user	Decode total	TTFT
5	3559.17 ± 6.72 tok/s	8.51 ± 2.40 tok/s	27.88 ± 0.05 tok/s	19.86 ± 9.00s
10	3569.77 ± 2.99 tok/s	5.37 ± 1.99 tok/s	30.68 ± 0.09 tok/s	35.44 ± 17.95s
20	3563.64 ± 8.78 tok/s	3.16 ± 1.41 tok/s	32.26 ± 0.10 tok/s	67.37 ± 36.44s

<figure class="breakout-wide"> <img src="/blog/gemma-4-dgx-spark/run-b-prefill-wall.webp" width="1522" height="843" loading="lazy" decoding="async" alt="Run B: TTFT grows linearly with concurrency: 19.9s at 5 users, 35.4s at 10, 67.4s at 20. Aggregate decode sticks around 30 tok/s." /> <figcaption>Run B: Aggregate decode sticks at ~30 tok/s; all the extra wait goes into TTFT.</figcaption> </figure>

This is the stress edge of the benchmark. Aggregate decode sticks around 30 tok/s, regardless of whether you put 5, 10 or 20 users on it. Per user it drops from 8.51 to 3.16 tok/s. But the real problem is TTFT: at 20 users the average request waits 67 seconds before the first token arrives. The server is not broken then. The workload just no longer fits a realtime chat expectation.

Run C: short prompt, long output

Run C flipped the shape. Not 25k context with short output, but 1024 prompt tokens and 1024 output tokens.

Users	Prefill total	Decode/user	Decode total	TTFT
1	4627.12 ± 374.91 tok/s	23.86 ± 0.03 tok/s	23.86 ± 0.03 tok/s	0.31 ± 0.02s
5	5701.55 ± 561.36 tok/s	13.59 ± 1.05 tok/s	54.67 ± 4.90 tok/s	0.76 ± 0.11s
10	6346.87 ± 64.52 tok/s	10.92 ± 0.73 tok/s	86.46 ± 1.74 tok/s	1.26 ± 0.40s

<figure class="breakout-wide"> <img src="/blog/gemma-4-dgx-spark/run-c-grouped.webp" width="1227" height="777" loading="lazy" decoding="async" alt="Run C: per-user decode drops from 23.9 (c=1) to 10.9 (c=10), aggregate decode climbs to 86.5 tok/s." /> <figcaption>Run C: short prompt, long output. Aggregate decode scales neatly to 86 tok/s, per-user stays comfortably readable.</figcaption> </figure>

At ten users at once, TTFT stays at 1.3 seconds. That feels like chat.

Run G: even longer output

Runs A, B and C showed enough to make the "decode is stable, prefill decides the wait" story plausible. But one scenario stayed open: what if the output is much longer still? An agent generating code. A tool-call with structured output. A long summary.

Users	Prefill total	Decode/user	Decode total	TTFT
1	1993.94 ± 262.05 tok/s	24.17 ± 0.02 tok/s	24.17 ± 0.02 tok/s	0.24 ± 0.01s
5	3048.28 ± 496.15 tok/s	14.32 ± 2.18 tok/s	46.11 ± 11.57 tok/s	0.38 ± 0.07s
10	4800.80 ± 50.75 tok/s	11.75 ± 0.68 tok/s	83.77 ± 4.04 tok/s	0.48 ± 0.01s

<figure class="breakout-wide"> <img src="/blog/gemma-4-dgx-spark/run-g-grouped.webp" width="1227" height="777" loading="lazy" decoding="async" alt="Run G: per-user decode 24.2 (c=1), 14.3 (c=5), 11.8 (c=10); aggregate 24.2, 46.1, 83.8 tok/s." /> <figcaption>Run G: 4k output: long generations are only longer, not slower. Per-user sits close to Run C.</figcaption> </figure>

Decode/user over 4096 tokens barely drops compared to C's 1024 tokens. At c=1 it is 24.17 (G) vs 23.86 (C). At c=10 it is 11.75 (G) vs 10.92 (C). Long generations do not compound, they just take proportionally longer. And TTFT is lowest here: under half a second at ten users at once.

Run F: medium context, more users

Between Run C (1k context) and Run B (25k context) sat a gap that is closer to reality. A typical RAG flow with four chunks of ~2k tokens comes out around 8k.

Users	Prefill total	Decode/user	Decode total	TTFT
5	5439.51 ± 32.60 tok/s	12.11 ± 0.51 tok/s	55.21 ± 1.49 tok/s	4.32 ± 1.90s
10	5466.71 ± 15.65 tok/s	9.31 ± 0.77 tok/s	78.36 ± 1.61 tok/s	7.99 ± 4.02s
20	5532.74 ± 5.39 tok/s	6.05 ± 0.62 tok/s	97.35 ± 3.50 tok/s	14.61 ± 7.72s

<figure class="breakout-wide"> <img src="/blog/gemma-4-dgx-spark/run-f-ttfr.webp" width="1522" height="843" loading="lazy" decoding="async" alt="Run F: 8k context. TTFT climbs from 4.3s (c=5) to 8.0s (c=10) to 14.6s (c=20); aggregate decode reaches 97.4 tok/s." /> <figcaption>Run F: 8k context. TTFT grows linearly with concurrency, aggregate decode keeps scaling to almost 100 tok/s.</figcaption> </figure>

Three observations.

Prefill throughput sits at a flat 5.5k tok/s, regardless of whether it is 5, 10 or 20 users. At 8k context the machine is already saturated at the prefill level. Aggregate decode keeps scaling: in Run B (25k) this plateaued at ~30 t/s, here it runs up to 97.4 t/s. And most importantly: TTFT at 8k context is roughly a quarter of what it is at 25k. Same concurrency, same machine, different prompt size.

Run E: multi-turn as realistic office work

--depth 4 means: five turns in a row per request (initial + four follow-ups). Concurrency at 10 means: ten such conversations in parallel.

Users	Prefill total	Decode/user	Decode total	TTFT
1	4716.21 ± 542.88 tok/s	23.97 ± 0.10 tok/s	23.97 ± 0.10 tok/s	0.53 ± 0.06s
5	5693.39 ± 128.08 tok/s	13.07 ± 0.16 tok/s	59.48 ± 2.26 tok/s	1.32 ± 0.39s
10	6096.81 ± 56.92 tok/s	10.43 ± 0.35 tok/s	92.42 ± 3.33 tok/s	2.13 ± 0.83s

<figure class="breakout-wide"> <img src="/blog/gemma-4-dgx-spark/run-e-multiturn.webp" width="1242" height="777" loading="lazy" decoding="async" alt="Run E: multi-turn. Per-user 24.0/13.1/10.4 tok/s, aggregate 24.0/59.5/92.4 tok/s, highest aggregate of all closed-loop runs." /> <figcaption>Run E: multi-turn (depth = 4) at 2k starting context. Aggregate of 92 tok/s is the highest number across all six closed-loop runs.</figcaption> </figure>

Three things stood out that I had not expected up front.

Per-user decode with multi-turn is identical to single-turn. Multi-turn does not make the tokens slower, only the number of prefills goes up. Aggregate decode at c=10 is 92.42 t/s, the highest of any closed-loop run. With multi-turn, vLLM gets a denser stream of dependent requests fed to it, and can batch those more efficiently than ten separate single-shot prompts. And TTFT at c=10 averages 2.13 seconds across all five turns. Under three seconds still feels like chat.

What the six closed-loop runs show together

One table that puts everything at c=10 side by side:

Run	Prompt	Output	Depth	TTFT (c=10)	Decode/user (c=10)	Aggregate decode (c=10)
G	256	4096	0	0.48s	11.75 t/s	83.8 t/s
C	1024	1024	0	1.26s	10.92 t/s	86.5 t/s
E	2048	512	4	2.13s	10.43 t/s	92.4 t/s
F	8192	512	0	7.99s	9.31 t/s	78.4 t/s
A	16384	256	0	18.92s	6.79 t/s	45.8 t/s
A/B	25000	256	0	35.67s	5.40 t/s	30.7 t/s

<figure class="breakout-wide"> <img src="/blog/gemma-4-dgx-spark/summary-c10.webp" width="1569" height="944" loading="lazy" decoding="async" alt="Scatter of all six closed-loop runs at c=10. Y-axis decode/user (5 to 12 tok/s), X-axis TTFT logarithmic (0.5s to 49s). G and C top left, A-25k bottom right." /> <figcaption>All six closed-loop runs at 10 concurrent users. Decode per user barely moves up to 8k context. TTFT moves everywhere.</figcaption> </figure>

Two patterns jump out.

Decode/user barely moves up to 8k context. Between Run G and Run F there is a factor of 32 in prompt size and a factor of 8 in output size. Yet decode/user there sits between 9.3 and 11.8 tok/s. Only at 16k+ does that band collapse.

TTFT moves everywhere and is almost a function of prompt size alone. Double the prompt and the TTFT roughly doubles along with it. Output size and depth matter almost nothing for TTFT.

That is the closed-loop conclusion. It holds, and it tells a real part of the story. But there is a gap in it.

But these are synthetic tests

The six runs above test capacity. Ceilings. All in the same shape: N users in lockstep, all the same prompt size, all hitting send buttons at the same time. That is a fine way to measure where it breaks. It is a bad way to measure how a real office feels.

Because a real office has 25 employees, of whom on average a few are doing something at the same time. One colleague asks a short question. Another is mid-RAG with 8k context. The third is in turn 4 of a conversation. And requests do not arrive in lockstep. They arrive as a Poisson process with the occasional burst, because someone just finished an email and three colleagues all want coffee at once.

That is what vLLM's own vllm bench serve can do and llama-benchy cannot:

Open-loop with arrival rate. Dispatch requests according to a Poisson or Gamma distribution, instead of lockstep.
Percentiles. P50, P90, P95, P99 on TTFT, TPOT (time per output token), ITL (inter-token latency) and E2E. No more mean ± stddev.
Realistic datasets. ShareGPT replay of 94k+ real conversations with naturally varying prompt lengths and multi-turn structure.
Mixed workloads. Sample prompts from a distribution instead of testing one fixed shape.

Three tests below, same server (no restart), but with those other glasses on.

Test H: realistic office baseline

The scenario: 25 people active on average, each sends a prompt roughly once per 1-2 minutes, prompts vary widely in length. Arrivals are slightly clumpy.

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model google/gemma-4-26B-A4B-it \
  --tokenizer google/gemma-4-26B-A4B-it \
  --served-model-name gemma-4-26b-a4b-bf16 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 200 \
  --request-rate 0.3 \
  --burstiness 0.7 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

With --random-range-ratio 0.9, input lengths vary from 399 to 7600 tokens, outputs from 49 to 950. --burstiness 0.7 is slightly clumpier than pure Poisson. People often hit enter in little bursts, not like a metronome. Target rate of 0.3 req/s = ~18 prompts/min across 25 users.

Metric	Value
Successful requests	200 / 200
Achieved RPS	0.27 (target 0.30)
Peak concurrent requests	36
Total token throughput	1215 tok/s

	Mean	P50	P90	P95	P99
TTFT (ms)	1395	1286	2284	2644	3316
TPOT (ms)	177	182	193	202	214
E2E (ms)	85921	85306	150192	162375	171351

The median user gets the first token in 1.29s. Still feels like chat. The tail stays within bounds: P99 waits 3.3 seconds, comfortably under twice the average.

And look at peak concurrent: 36. At a target rate of just 0.3 req/s. No closed-loop run came near that. The Poisson burstiness alone, combined with an average response time of ~86 seconds, produces peaks heavier than any Run B stress test had. That is the thing closed-loop literally cannot show.

Test I: real conversations (ShareGPT replay)

Identical arrival pattern to Test H, but now with 250 real multi-turn conversations from ShareGPT V3 as prompts. Some are 1 turn of 200 tokens, others are 15 turns with ever-growing context.

docker exec vllm-bench vllm bench serve \
  ... \
  --dataset-name sharegpt \
  --dataset-path /tmp/ShareGPT_V3.json \
  --num-prompts 250 \
  --request-rate 0.3 \
  --burstiness 0.7

Metric	Value
Successful requests	250 / 250
Achieved RPS	0.30 (target 0.30)
Peak concurrent requests	17
Total token throughput	133 tok/s

	Mean	P50	P90	P95	P99
TTFT (ms)	376	353	469	509	637
TPOT (ms)	93	95	117	123	135
E2E (ms)	19600	10923	49525	63036	82596

This is a different universe than Test H. TTFT P99 = 637 ms. 99% of users see the first token within 650 milliseconds. That is genuine chat speed.

Identical arrival pattern to Test H, completely different experience. The difference is entirely in prompt size: ShareGPT conversations average 228 tokens, not 4000. Short prompt = cheap prefill = no queue pressure = sub-second TTFT.

Metric	Test H (random 4k)	Test I (ShareGPT)
Achieved RPS	0.27	0.30
Peak concurrent	36	17
TTFT P50	1286 ms	353 ms
TTFT P99	3316 ms	637 ms
TPOT P50	182 ms	95 ms

This is also a warning: the synthetic workload of Test H overstates how heavy an average office prompt is. Real-world conversations are lighter than our 4k random baseline, so the real-world numbers probably sit closer to Test I than to Test H.

Test J: Monday morning peak

What if everyone comes in at the same time and starts hitting send buttons? Fivefold load, max 25 concurrent requests to model a real office.

docker exec vllm-bench vllm bench serve \
  ... \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 300 \
  --request-rate 1.5 \
  --burstiness 1.0 \
  --max-concurrency 25

Metric	Value
Successful requests	300 / 300
Configured RPS	1.50
Achieved RPS	0.26
Peak concurrent requests	27
Total token throughput	1173 tok/s

	Mean	P50	P90	P95	P99
TTFT (ms)	1370	1132	1932	2961	6157
TPOT (ms)	185	187	195	199	221
E2E (ms)	92752	91099	165179	172073	179139

This is the key number: achieved rate 0.26 at target 1.5. The system is throttled almost 6x. Not because it crashes (all 300 requests succeed, no failures), but because the queue fills up to 25 and holds requests there until there is room.

Compare Test H (target 0.3) and Test J (target 1.5):

Metric	Test H (0.3 rps)	Test J (1.5 rps)
Achieved RPS	0.27	0.26
TTFT P50	1286 ms	1132 ms
TTFT P95	2644 ms	2961 ms
TTFT P99	3316 ms	6157 ms
TPOT P50	182 ms	187 ms

The median experience is even slightly better in Test J than in Test H (1.13s vs 1.29s). The cap creates a smoother stream. But the tail is dramatically worse: P99 doubles from 3.3s to 6.2s.

<figure class="breakout-wide"> <img src="/blog/gemma-4-dgx-spark/open-loop-ttft.webp" width="1425" height="882" loading="lazy" decoding="async" alt="Open-loop TTFT percentiles for H (random 4k 0.3 rps), I (ShareGPT 0.3 rps) and J (random 4k 1.5 rps). I stays sub-second everywhere; H climbs to 6.4s P99; J shoots up to 14.8s P99." /> <figcaption>Open-loop TTFT percentiles. The median says little; the tail tells you where overload hurts.</figcaption> </figure>

The Spark does not scale under oversubscribe, it queues. That is good news: graceful degradation instead of crashes. For on-prem AI that is really the best failure mode.

What closed-loop hides, what open-loop overstates

The two methods each tell a different part of the story. Both true, both incomplete.

Closed-loop underestimates queue depth.

In Run F I tested c=10 as "ten users at once". That sounds like a reasonably busy office situation. But Test H shows that an organic 0.3 req/s arrival rate is already enough to produce peaks of 36 concurrent requests. So the closed-loop "10 users" claim is more optimistic than practice shows.

Open-loop with synthetic overstates the real load.

At the same time: Test H uses random 4k prompts. A real office does not pose 25 average 4k prompts per minute. ShareGPT (Test I) is a much better proxy for "what people type", averaging 228 tokens. With that workload shape, peak concurrent is 17 instead of 36, and P99 TTFT 637ms instead of 3.3s.

So reality sits between Run F and Test I:

Source	TTFT (P50 or mean)	Peak concurrent
Run F (closed-loop, 10 users, 8k)	7.99 s	10
Test H (open-loop, 0.3 rps, 4k random)	1.29 s P50 / 3.3s P99	36
Test I (open-loop, 0.3 rps, ShareGPT)	0.35 s P50 / 0.64s P99	17
Test J (open-loop, 1.5 rps, 4k random, cap 25)	1.13 s P50 / 6.2s P99	27

For an office with realistic prompts and a realistic arrival pattern, Test I is closest to what people feel. For capacity planning ("what if everyone asks an 8k RAG question at once?"), Run F is closest to what the machine can chew through.

The tail tells what the average hides

llama-benchy gave only mean ± stddev. That sounds like a lot of information, but it hides the part that matters most to your users: the tail.

Test I's mean TTFT is 376ms. Sounds fine. But what does that say about the 1% of users where the queue just spiked? Nothing. For that you need P99, and that sits at 637ms. In this case no problem (both sub-second), but the principle you need to know.

Test H's mean TTFT is 1395ms. P99 is 3316ms. Comfortably more than twice as bad as the average for the unlucky 1%.

Test J's mean TTFT is 1370ms. P99 is 6157ms. Comfortably four times the average.

For SLA decisions ("our system answers within 3 seconds for 95% of requests") you need these percentiles. Mean ± stddev can suggest an SLA you do not hit at the moments that matter most, namely when it is busy.

That is why the blog cannot land on llama-benchy alone. Testing capacity is one thing. Reporting tail latency is another.

Decode is not the problem

With one user, decode stays almost flat.

4k context gets 24.08 tok/s per user. 25k context gets 22.75 tok/s. 4096 output tokens (Run G, c=1) gets 24.17 tok/s. Multi-turn with depth 4 (Run E, c=1) gets 23.97 tok/s. Four different workloads, all within 6 percent of each other.

At ten users at once something similar happens, only on a lower line. Run G: 11.75 tok/s/user. Run C: 10.92. Run E: 10.43. Run F: 9.31. And in the open-loop tests: Test I gives TPOT P50 = 95ms = ~10.5 tok/s/user. Test H and J give TPOT P50 = ~185ms = ~5.4 tok/s/user (because peaks there hit 25+ concurrent).

In short: per-token decode speed is a function of average concurrent load, not of prompt length, output length, multi-turn, or arrival pattern. Only at 16k+ context combined with multiple users (Run A) does it really drop below 7 t/s/user.

Concurrency on its own is not the problem. Long output isn't either. Multi-turn isn't either. Only large context together with multiple users eats decode.

Prefill is the wall

What you feel first is waiting.

With one user at 25k context it takes a good 6 seconds before the first response comes. At five users that becomes 19.9 seconds. At ten it becomes 35.4 seconds. At twenty it becomes 67.4 seconds.

Run F shows that this is linear in both concurrency and context. 8k context at 20 users gives 14.6 seconds, roughly a quarter of the 67.4 seconds at 25k context, for the same concurrency. Halve the prompt, halve the wait.

And Test J shows: as soon as you push the system past its throughput ceiling, all that extra wait goes into the tail. Median TTFT stays stable around 1.1-1.3s, but P99 shoots to 6 seconds. The pain of overload falls on a small group, not on everyone.

That is where the real limit sits.

Not: can the DGX Spark generate tokens? Yes.

Not: can the KV-cache handle 20 × 25k? Also yes.

Not: does it stop under overload? No, it queues along nicely.

But: does this still feel like chat? Not for 25k. For 8k it is already borderline. For 2k with multi-turn just fine. For ShareGPT-realistic prompts with 25 users spread organically: a crystal-clear yes.

Where this does fit

These benchmarks make the on-prem choice more concrete.

Yes for an office environment where 10 to 25 people use local AI spread across the day. Test I is the proof: 250 real ShareGPT conversations, 0.3 req/s arrival rate, P99 TTFT of 637ms. The median user sees the first token in 353 milliseconds. That is exactly the office scenario, and this is what it feels like.

Yes for RAG flows with medium context. Run F gave the numbers up front: 8k prompt, 10 users, 8s TTFT, 9.3 tok/s streaming. Test H confirms the open-loop variant is still workable: P99 TTFT 3.3s. Not realtime, but within waitable bounds.

Yes for agents and code generation. Run G is the confirmation: short instruction, 4k+ tokens output, ten parallel tasks. TTFT under half a second, 11.75 tok/s/user.

Yes for multi-turn conversations. Run E gives 2.1s TTFT at 10 parallel 5-turn conversations. Decode the same as single-turn.

Careful with 5+ users at 25k context at once. 19.9 seconds TTFT is no longer chat, but workable for analysis.

Careful with SLA claims based on averages. Test H's mean TTFT of 1.4s could sound acceptable, but P99 sits at 3.3s. Decisions based on percentiles, not on mean.

No for support chat where ten to twenty users send 25k context per session at once and all expect a realtime answer. Or: support chat under Test J-like load (1.5 rps of 4k prompts). That can technically run (no failures), but P99 TTFT of 6 seconds is a borderline case for chat.

What these tests do not say

This is not a MoE-vs-dense comparison. I want to test that separately, and then not only with throughput. If you compare MoE and dense, you also have to test prompts: summarizing, code questions, tool choice, ticket classification, a long context piece with follow-up steps. Otherwise you only measure how hard the engine spins, not whether it is driving the right way.

This is also not a test with prefix caching on. That is deliberate. I wanted to see the raw prefill cost, not a benchmark that looks nicer because the prompts resemble each other. A next piece will add it: those same 8k and 25k context runs and the open-loop tests with --enable-prefix-caching. My hunch: Test H and J benefit modestly (random data, little overlap), Test I benefits considerably (real conversations have overlapping system prompts and context), and Run F gets substantially faster. But that needs measuring.

Where I land

My expectation up front was that the DGX Spark with this MoE model would fill up sooner at large context windows. That happened, but differently than I thought.

Memory was not the showstopper. Run B managed 20 users at 25k context without OOM. Test J survived 1.5 req/s without a single failed request. The practical limit always sat in prefill latency, not in capacity.

And after nine tests it turns out: that is really the only limit you feel.

Decode/user is almost a constant for this machine. Between 9 and 12 tokens per second at ten concurrent users, across six different closed-loop workloads. In open-loop with realistic ShareGPT prompts: 10.5 t/s/user. Only at 16k context or at synthetic peaks of 25+ concurrent does that drop below 7 t/s.

What varies is how long someone waits before the text begins. At 256 prompt tokens that is half a second, even with ten users. At 2048 prompt tokens with five turns an average of 2.1 seconds. At 8192 prompt tokens with ten users eight seconds. At 25k with ten users 35 seconds. Under realistic 0.3 rps ShareGPT load: 353 milliseconds for the median, 637 milliseconds for the unlucky 1%.

And as soon as you push the system above its capacity, it does not scale, it queues. Test J showed that a 1.5 req/s target gets throttled to 0.26 achieved, with the pain entirely in the tail (P99 6.2s) while the median stays stable. For on-prem AI that is the best failure mode you can hope for: nobody crashes, some wait longer.

That is not a "can this machine do it or not". That is "pick the workload that fits what the user expects, and accept that 1% of requests has an unpleasant wait at peak moments".

For one to three users with large context it is usable. For ten users with medium context it is fine. For ten users with multi-turn conversations it is actually at its best. For a 25-person office with realistic prompts and an organic arrival pattern it is astonishingly good: sub-second TTFT for 99% of requests, measured on real conversation data.

For agent flows with long outputs it is strong. For twenty concurrent 25k prompts or for 1.5 rps oversubscribe it is no longer realtime chat. There you have to queue, turn on prefix caching, or route that kind of work differently.

Two methods measure two things. Closed-loop benchmarks show what the machine can do. Open-loop replay shows what the user feels. The DGX Spark is a strong local AI machine for office work, as long as you know which knob decides what you feel.

Decode sells the benchmark. Prefill decides the experience. And as soon as you go past the limit, the Spark queues instead of breaking, and that is the third number an on-prem choice has to be able to read.

I put a 24/7 assistant on a Raspberry Pi

2026-05-01T00:00:00.000Z

I didn't want a better chatbot. I wanted an agent that picks up work on its own: hop on the internet, read tickets, dive into a repo, draft a first proposal for code changes and then report back where my team already works anyway.

The entry point had to be Slack. That's where the questions, threads, files and half-finished ideas live. The agent had to be able to use tools, read files, stage branches and keep running when my laptop closes.

So now there's a Raspberry Pi 5 with 4 GB RAM in my network. It runs OpenClaw. Slack in front, GPT-5.5 behind it, Tailscale as the gateway when I'm not home.

That sounds bigger than it is. The Pi doesn't run a local language model. OpenClaw uses the Pi as an always-on Gateway: the layer that receives Slack messages, manages sessions and workspace context, starts an agent-run, makes tools available and sends the answer back to the same thread. In this setup the model runs through OpenAI.

That distinction matters. For fully local inference I use the DGX Spark, and I wrote about that earlier in the quantization post. This Pi is the agent layer next to it: always on, reachable in Slack, close to my files and workflows.

The thing I was missing

I already use plenty of AI tools. Claude Code for building. ChatGPT for one-off questions. For client projects I work with model APIs or local models, depending on what the data and infrastructure allow.

The missing layer sat in between those tools: an agent that sees work come in and gets started. In Slack you can start small. I type a messy instruction, the agent reads the repo, pulls in the right tone-of-voice rules and comes back with something I can review.

Publishing stays manual. So does trust. The first bit of groundwork is allowed to happen automatically.

The direction is bigger than writing drafts. Eventually I want to point at a ticket and say: figure out what's needed here. The agent reads the context, checks documentation, looks in the codebase, proposes an approach and maybe stages a branch already.

That kind of work often gets left behind because it doesn't fit anywhere neatly. Too small for a sprint. Too big to do "on the side". Before you know it, that ticket is still open a week later with the same three vague comments under it.

What runs on the Pi

The base is small:

Raspberry Pi 5, 4 GB RAM
OpenClaw Gateway locally on the Pi
OpenAI GPT-5.5 as the model in this setup
Slack as the interface
Tailscale for remote access

The Pi is mostly just available here. That's its talent.

OpenClaw ties the layers together: channel, session, agent-runtime, model-provider and tools. A Slack message comes in through the channel layer. OpenClaw stages an agent-turn from it, with the right context and tools. The runtime runs that turn with the chosen model. Then OpenClaw delivers the answer back through Slack.

That way the same agent can read files, run shell commands, fetch web pages, check git status or prepare a PR, depending on which tools you allow. So the Pi isn't a mini GPU. It's the local control layer.

Tailscale keeps it practical. I can reach the Pi when I'm out. Opening a public port for a build-log would be a bit much honour.

Slack as the shop floor

Slack was the easiest choice because I'm in it all day already. My companies have workspaces, channels, threads, files and notifications. An extra dashboard would mostly collect extra tab dust.

For me this is the core: the agent has to be available where the team works. If it figures something out based on a ticket, I want the answer back in the same flow. The analysis belongs next to the question, in the same thread.

OpenClaw supports more entry points than Slack. It also works through, among others, Telegram, Microsoft Teams, Google Chat, WhatsApp, Discord and iMessage. Slack is my entry point. The broader idea is agents on existing communication channels, with tools and memory behind them.

The install was less exciting than I'd hoped

The install was less dramatic than I'd expected. That's nice for me and bad for the genre "build-log with fireworks".

Most of the time went into reading. OpenClaw has a lot of documentation, and you have to work out which part fits your setup. Slack, Gateway, agents, runtimes, channels, tools: they're separate layers that eventually form one assistant together.

Setting up Slack took attention too. You decide which users may DM the bot, in which channels it may talk and whether it reacts to every message in group channels or only on an @mention. Those aren't details for later. You have to pick those rules up front and share them with your team, otherwise nobody gets when the agent does or doesn't join in.

After about two hours it worked. I typed in Slack, the Pi caught the message, OpenClaw started a run, GPT-5.5 thought along and the answer came back in the same thread.

A lot of plumbing for a text message. Except that text message can now use tools.

First test: this site

The first place I use this for is djangodevreng.nl.

The content has to come from real work: what we built, what broke, which choices stuck, where a tool looked nice until it started to sweat under load. The agent gets to help with form and execution.

Once that raw input is there, it can do a lot. Structure a dump. Make a first outline. Rewrite a draft in my tone. Strip out marketing language. Check whether a post sounds like it fell out of a generic LinkedIn carousel.

The workflow for this site usually starts messy. I dump in Slack what I want to say: a few observations, a half idea, sometimes just feedback on an existing post. The agent then finds the right repo, reads the relevant files and grabs the writing guide from the workspace.

Then I ask it for a concrete change: "rewrite the intro", "strip out the marketing language" or "make this technical explanation more precise". The sharper the instruction, the more usable the diff. It edits the markdown on a branch, runs the checks and pushes the change to a PR.

That's where my part starts again. I read the diff, give feedback in Slack and let it process the next round. Only when the post is right do I merge it myself. The agent does the prep work. I stay responsible for what goes live.

An agent that publishes without me looking isn't a workflow. That's a slot machine with commit rights.

Why this feels different from chat

A lot of AI tools feel like you have to bring your work to a chat window. You copy context, paste logs, explain for the third time where the repo path is and hope the model acts like it was there.

This setup runs closer to the context. The agent can start on its own because it sees the workspace, knows the branch, can read the rules for the site and knows which checks need to run.

That still doesn't make it an autonomous developer. It mostly pushes the first boring bit forward.

For me that's the interesting agent layer: reading along ahead of time, making a first version, pointing out where it chafes. A junior colleague with infinite patience, no agenda and sometimes a worrying confidence in its own sentences.

I'm going to write a separate post about this, because OpenClaw really deserves more explanation than fits in this build-log. Which channels it supports. Which tools you hang on it. And above all: why this gets interesting.

We're slowly shifting from AI as a sparring partner to AI as an executing layer. The past few years we mostly talked to models: brainstorming, summarizing, rewriting, thinking along. That stays useful, but the real difference is in agents that can carry out work in existing systems.

Agents don't take over people's work one-to-one. It's not that simple, luckily. The shift is in workflows: figuring out tickets, gathering context, preparing drafts, proposing code changes, running checks, reporting back. Work you'd normally ask someone for because it takes time, while it needs little deep human judgement.

Next step: tickets and MCP

The next step is MCP. I want to hang tools onto this workflow neatly, starting with Linear.

The scenario is simple: a ticket comes in, the agent reads the relevant repo context, finds the likely files, writes a short analysis and comes back with a proposal or a list of questions.

Autonomous merging I'm skipping. First I want to know where the line is between useful preparation and dangerous eagerness to act.

After that come GitHub, repo context and maybe a local knowledge base. Some context should just be available, without me pasting it into a prompt every time again.

Workflow by workflow

This Pi setup is small. That's exactly why I like it.

Small enough to understand. Real enough to learn something from. Cheap enough to leave on all the time. Local enough to sit close to my work, without pretending the model itself runs locally.

For production AI at clients this is at most one layer in the architecture. For my own workflow it already works fine: Slack as the entry point, OpenClaw as the Gateway, OpenAI as the model provider, GitHub as the place where work ends up staged.

For the time being I'm going to happily tinker with this. First this site. Then tickets. Then MCP tools. Then probably something I currently still think is too specific to automate.

That's the interesting route: replacing workflow by workflow with an agent that does the groundwork, gathers context and stages proposals. Step by step I build out my OpenClaw setup. Just as a practical assistant that takes a bit more work off my hands each time.

And if it breaks down, it's close enough to pull the plug.

What quantization turned out to be

2026-05-01T00:00:00.000Z

This was the first blog post I put live on this site. When I wrote it, I had just gotten two models running on the DGX Spark: Gemma-4-26B-A4B-it, a MoE model, and a 31B dense model. Both local, both through vLLM.

At that point, quantization was still mostly a question for me. I knew the term, I roughly understood what it was about, but I had too few measurements of my own to say anything firm about it.

By now we're a few benchmark rounds further. First Gemma-4 on the DGX Spark. Then NVFP4 vs BF16 on that same model. And after that Nemotron-3 in BF16, FP8 and NVFP4. Together they make up the guide running LLMs on the DGX Spark.

That changed this post, really. It's less about "what is quantization?" and more about what happens when quantization stops being a model-card term and becomes an architecture choice.

The first question was simply: does it fit?

With hosted models you often start at quality. Which one is smarter, which follows instructions better, which writes better code?

Locally you start blunter: does it fit?

That sounds almost too simple, but on your own hardware that's the first wall. A model name and a model card are paperwork; the weights have to actually fit in memory. After that you still want room for context, you want to handle several requests at once, and ideally see something come back within seconds.

On the DGX Spark you feel that immediately. You watch vLLM at work: downloading, loading, reserving memory, warming up. Only then does the discussion about throughput, latency and usability begin.

That's a different feeling than an API call to Claude or GPT-5.5. There the infrastructure mostly exists as an abstraction. You send text in and get text back. Running locally, you see the back end. Sometimes that's fun. Sometimes it mostly takes a while.

That's exactly where quantization comes in.

My first picture was too narrow

My first working definition was tidy enough: quantization stores model weights more compactly. FP16 or BF16 uses more space than 8-bit or 4-bit. Fewer bits means less memory. Less memory means a model fits sooner, loads faster, or leaves room for more context and more requests.

That's correct, but it's too small.

After the benchmarks I look at it differently. The question "does this model fit on this machine?" is only the start. After that comes the question of what you can do with that machine once the model fits.

Running one request is the demo. Running multiple requests is the workflow.

That's where the difference sits for me. A local model that answers one prompt neatly is nice. A local model that can handle several users, agents or tasks at once without latency collapsing becomes useful.

So quantization decides how much room to move you have left.

vLLM makes it concrete

I use vLLM because one request at a time isn't the situation I'm heading for. Starting a local chatbot is fine for testing, but the moment you talk about agents you get different traffic.

An agent fetches context, calls tools, splits up work, sometimes asks for things in parallel and waits for results in between. Meanwhile you want a second request not to have to wait until the first is completely done.

That's where serving matters.

vLLM is the layer that makes this concrete: batching, scheduling, using memory more efficiently and handling multiple concurrent requests. It also makes visible that running locally is a system. The model, the precision, the context length, the number of simultaneous requests and the scheduler all pull on the same hardware.

That was the first real lesson for me. Quantization isn't a separate trick at the bottom of the stack. It influences how the whole stack behaves.

BF16 felt like the safe choice at first

If you haven't measured yet, higher precision quickly feels safer. BF16 sounds solid. More detail, less quality risk, less chance the model starts behaving oddly.

That was my first reflex too. If the hardware can handle it, why sit lower?

The measurements made that less obvious. On the DGX Spark, BF16 often turned out to be the least practical choice in the later runs. BF16 isn't "bad"; it's just that the hardware and workload weigh more heavily than the tidy feeling of higher precision.

If a lower precision gives much more room for concurrency, context or throughput, then in practice that can be better. Certainly for workloads where speed and concurrency count for more than the last bit of model quality.

That's the twist I found interesting. The highest precision intuitively feels like the serious choice. On this machine it was often mostly the most expensive one.

NVFP4 changed the Spark

The biggest shift came with NVFP4. In the benchmark posts and the arena you can see that NVFP4 nearly doubles the DGX Spark for many workloads. That's not a small optimization anymore. It changes what you dare to try on the same machine.

For on-prem AI that's exactly the point. You buy hardware for a workflow, not for one pretty prompt. You want to know how much real work you can fit on that box.

If NVFP4 means you can run more requests at once, keep more headroom and bump into memory limits less quickly, then that's not a detail in a table. Then your architecture changes.

You can divide tasks differently. You can keep more local. You can experiment faster with agent steps that would otherwise go straight to a hosted model.

That made quantization more practical for me than I'd expected beforehand. It stopped being about a smaller model and became about enabling a different workflow.

FP8 had a different kind of upside

FP8 didn't simply sit "between BF16 and NVFP4". In the Nemotron-3 runs, tail latency was the interesting part. That draws less attention than a big throughput jump, but in use it matters at least as much.

Averages don't necessarily lie, but they reassure you at the wrong moments. A workflow feels slow because of the few requests that keep hanging.

That's why tail latency is so practical. If an agent workflow has multiple steps, delays stack. One slow step is annoying. Three slow steps in a row feel like the system is reconsidering its life choices.

FP8 looks useful in that corner: less extreme than NVFP4, but interesting when predictability matters more than running as much as possible at once.

That's the nuance I didn't have yet in the first version. Precision isn't a ladder where lower is always faster and worse. It's a set of choices with different trade-offs.

Quality stays the open question

The benchmarks answer memory, throughput and latency. They say less about behaviour.

That stays the hard side of quantization. You don't always see quality loss neatly in one metric. Sometimes an answer gets flatter. Sometimes code goes wrong a bit more often. Sometimes an agent picks the wrong tool. Sometimes you notice nothing, until your task is just different from your test set.

For simple tasks that can be perfectly fine. Think classification, routing, first summaries, embeddings or a light pass over internal documents. The heaviest model doesn't always have to be on that.

For code generation and agent workflows it's more sensitive. Small errors stack. One mediocre piece of reasoning is annoying. A wrong tool call is a different kind of problem.

That's why I don't want to benchmark quantized models on speed alone. I want to know where I dare to deploy them.

That's a different question. And honestly, the only one that counts.

The split gets clearer

My expectation is still that the best on-prem setup becomes a mix. "Everything local" sounds tough, but usually it's also needlessly strict.

The logical split looks more like:

embeddings local
sensitive documents local
routing and classification local
simple agent steps local
heavy reasoning to Claude or GPT-5.5 when needed

Quantization decides how big that local part can get. The more tasks run reliably and fast enough locally, the less you have to send out.

That matters for client work. Not because every token has to stay within four walls, but because some data does belong there. And because latency, cost and control simply count in production.

An on-prem setup isn't a belief. It's a division of work.

What I'd measure differently now

In the first version of this post I mostly had a list of questions. How long does downloading take? How long does loading take? How much VRAM is left? How many concurrent requests can I send before latency gets annoying?

Those questions stay useful, but they're the start. How I set up those measurements on the Spark exactly is in the arena methodology.

Now I'd put three things side by side per precision:

system behaviour: loading, memory, throughput, latency and tail latency
model behaviour: Dutch output, code questions, longer context, tool use
workflow fit: which tasks do I dare run locally with this

That last one is easy to miss if you only look at benchmark tables. A model can technically run and still be awkward. Or score less prettily, but be exactly good enough for routing or summarizing.

For production that makes the difference. Nobody buys "tokens per second" alone. You buy room in a workflow.

What I understand now

My working definition has shifted.

Quantization makes a model smaller, but that's only the entrance. It changes how much work you get out of the same hardware, which latency you accept and which tasks you dare to keep local.

On the DGX Spark, the highest precision rarely seems to be automatically the best choice. NVFP4 makes the machine much more usable for many workloads. FP8 is interesting when tail latency starts to matter. BF16 stays useful as a reference point, but on this hardware it less often feels like the practical default.

That's exactly why I wanted to do these measurements. A universal ranking helps little; better architecture choices do.

The question isn't: which quantization level wins?

The question is: which task is allowed on which precision, on which machine, with how much risk?

That's where on-prem AI starts getting interesting for me: at the division of work.