Nemotron-3 on the DGX Spark: BF16 vs FP8 vs NVFP4
One model, three precisions, the same Spark. What memory budget, decode speed and tail-latency do when you go from 16 bit to 8 bit to 4 bit.
In the previous posts I ran Gemma-4 on the DGX Spark. First just BF16 as a baseline, then NVFP4 vs BF16 across the same test suite. That gave one model in two precisions. Useful, but not yet a real picture of the choice you have to make in production.
For this piece I run three variants of the same model side by side: BF16, FP8 and NVFP4 of Nemotron-3-Nano-Omni-30B-A3B-Reasoning. Same Spark. Same vLLM version. Same prompts. Same benchmark suite. As close to a fair quantization comparison as I can get on this machine.
The short version: NVFP4 wins on speed and throughput, FP8 wins more often on tail-latency, BF16 is mostly still useful as a baseline. That is less tidy than “4 bit is always better”. Lucky for us, otherwise this post would have been short. Part of the guide running LLMs on the DGX Spark.
Why this experiment
The Gemma post mostly showed that NVFP4 works on the Spark. With some pain. Five vLLM bugs, a nightly build and enough flags to make a command line look like a small confession.
But Gemma did not answer the question I need for clients: what do you pick if you want to run a local model on a Spark today? BF16 because those are the original weights? FP8 because Blackwell is natively good at it? Or NVFP4 because you fit much more model and KV-cache in the same memory?
So here is this run. One model in three precisions. No leaderboard score, but workloads that resemble office work: chat, RAG, longer answers, multiple users at once, and a Monday morning where everyone suddenly decides AI is handy after all.
What BF16, FP8 and NVFP4 mean here
BF16 is the baseline: 16 bits per parameter, roughly 2 bytes. For this model that means about 61.5 GB of checkpoint size. That fits on the Spark, but it eats a lot of your 128 GB unified memory before a single user has any context in the KV-cache.
FP8 roughly halves that weight. The checkpoint is 32.8 GB. On Blackwell, FP8 is a logical choice: less memory, native support, and usually little hassle in vLLM.
NVFP4 goes further. The checkpoint is 20.9 GB. Not four times smaller than BF16, because the vision and audio encoders stay in BF16, but small enough to make the Spark feel different. More room for KV-cache, more batching, more concurrency.
The nuance: the DGX Spark runs on desktop Blackwell SM12.1. There NVFP4 is not the same party as on datacenter Blackwell. vLLM uses Marlin to decode FP4 weights toward FP16 during compute. You get the memory win fully. The compute win is less pure.
For this post that is exactly what makes it interesting. This is not a theoretical quantization post. This is: what happens on this machine, with this stack, when you actually run the three options?
| Precision | Model size | Memory budget left of 128 GB |
|---|---|---|
| BF16 | 61.5 GB | ~66 GB |
| FP8 | 32.8 GB | ~95 GB |
| NVFP4 | 20.9 GB | ~107 GB |
The test setup
All runs go through Docker on the DGX Spark with vllm/vllm-openai:v0.20.0. Official release, no patches.
docker run -d --name vllm-bench \
--gpus all --ipc=host \
-v appliance_hf-cache:/root/.cache/huggingface \
-p 8000:8000 \
-e HF_TOKEN="***" \
vllm/vllm-openai:v0.20.0 \
vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 256 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--video-pruning-rate 0.5 \
--reasoning-parser nemotron_v3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--limit-mm-per-prompt '{"image":0,"audio":0}'
For FP8 I use the same profile with --kv-cache-dtype fp8. BF16 runs without that KV-cache flag. Everything else stays equal.
The benchmark suite is described in the arena methodology. In short: closed-loop tests for decode and TTFT per user, plus open-loop tests with Poisson arrivals to see how the server behaves when requests do not neatly wait for each other.
Setup
I started wrong with nvcr.io/nvidia/vllm:26.02-py3, NVIDIA’s own vLLM container. It had vLLM 0.15.1 and did not yet know the NemotronH_Nano_Omni_Reasoning_V3 architecture.
The fix was more boring: vllm/vllm-openai:v0.20.0. Official release, correct flashinfer versions, first run working.
Our own bench-spark CLI still needed two small fixes: bypass the NVIDIA entrypoint with --entrypoint vllm, and pass HF_TOKEN to the container automatically. After that the suite ran.
Lesson: start with the stable release that supports the architecture.
Run A: context-scaling
This run is the foundation: what happens when the prompt gets longer, while the number of users climbs from one to ten? That touches office work directly. A short chat is easy. A RAG question with 25k context and several people at once is where the Spark shows how much room is really left.
Here I look at two things. First decode per user: how fast does text come back once generation is running? Then TTFT: how long do you wait for the first token? With long context TTFT is often the pain users feel first. They see no tokens, so it feels like the system is stuck.
Single-user is mostly a pure speed measurement. There NVFP4 nearly doubles BF16. At ten users it gets more interesting: the smaller weights give vLLM more room to batch, and then BF16 just gets heavy.
Decode/user (tg256), c=1
| Context | BF16 | FP8 | NVFP4 | NVFP4 vs BF16 |
|---|---|---|---|---|
| 4k | 29.23 | 51.68 | 60.30 | +106% |
| 8k | 28.59 | 49.82 | 55.72 | +95% |
| 16k | 28.24 | 47.52 | 55.24 | +96% |
| 25k | 28.24 | 48.85 | 54.98 | +95% |
BF16 stays neatly flat around 28-29 tokens per second. That is stable, but not fast. FP8 puts about 50 t/s against it. NVFP4 sits around 55-60 t/s. For a single user that is the difference between “fine” and “this feels local but not local-slow”.
Decode/user (tg256), c=10
| Context | BF16 | FP8 | NVFP4 | NVFP4 vs BF16 |
|---|---|---|---|---|
| 4k | 7.76 | 13.45 | 19.69 | +154% |
| 8k | 7.13 | 11.14 | 17.90 | +151% |
| 16k | 6.30 | 10.73 | 14.99 | +138% |
| 25k | 5.56 | 8.59 | 12.99 | +134% |
At ten users NVFP4 is not “a bit faster”. It is a different class. At 25k context BF16 does 5.56 tok/s/user. NVFP4 does 12.99. That is still no cloud-GPU cluster, but the difference in feel is large: BF16 becomes waiting, NVFP4 keeps working.
TTFT (first token), c=10
| Context | BF16 | FP8 | NVFP4 |
|---|---|---|---|
| 4k | 3.90s | 2.91s | 2.45s |
| 8k | 6.49s | 5.93s | 4.03s |
| 16k | 12.63s | 10.55s | 8.01s |
| 25k | 19.82s | 16.89s | 12.71s |
This is the table I take most seriously for real users. At 25k context and ten users you wait almost 20 seconds for the first token with BF16. With NVFP4 that is 12.7 seconds. Still long, but not the same kind of long.
Run B: 25k context, concurrency up to 20
Run A shows how context length scales. Run B keeps the context heavy and only raises the concurrency. This is the “everyone asks a big question at the same time” test.
In practice this does not happen every hour. Ten to twenty people rarely click send at exactly the same moment with 25k context. But if you put a local AI machine in front of a team, you want to know how it fails. Calmly getting slower is acceptable. A queue that feels dead is not.
NVFP4 keeps the most air here. Not because the model gets smarter, but because the server with smaller weights has more room for batching and KV-cache.
| Users | BF16 d/u | FP8 d/u | NVFP4 d/u | NVFP4 vs BF16 |
|---|---|---|---|---|
| 5 | 9.06 | 15.33 | 20.75 | +129% |
| 10 | 5.65 | 9.18 | 12.99 | +130% |
| 20 | 3.70 | 5.97 | 7.79 | +110% |
| Users | BF16 TTFT | FP8 TTFT | NVFP4 TTFT |
|---|---|---|---|
| 5 | 11.01s | 8.89s | 7.21s |
| 10 | 19.75s | 15.82s | 12.74s |
| 20 | 37.88s | 29.91s | 24.08s |
Twenty users with 25k context is deliberately unkind. Still, it is useful. BF16 sits at 37.88 seconds TTFT. That feels broken. NVFP4 sits at 24.08 seconds. Also not cozy, but still a good thirteen seconds faster.
Aggregate decode shows the same picture:
| Users | BF16 | FP8 | NVFP4 |
|---|---|---|---|
| 5 | 34 t/s | 53 t/s | 71 t/s |
| 10 | 38 t/s | 59 t/s | 77 t/s |
| 20 | 44 t/s | 66 t/s | 84 t/s |
The ceiling shifts from 44 t/s to 84 t/s. For a single user that is abstract. For a team it means the queue drains faster.
Run C: short prompt, long output
This is the workload for agents, code generation and longer answers: little input, a lot of output. The prompt is only 1024 tokens, so prefill is not the problem here. The question is mostly how fast the model keeps ticking once the output gets long.
So here I look at decode per user. TTFT has to stay low, but the real difference you feel only after a few hundred tokens. A model that starts fast but then hangs at 8 tok/s still feels slow.
NVFP4 clearly wins here. At ten parallel users the model stays at 22.90 tok/s/user. BF16 drops to 7.84. That is still readable, but for an agent flow it feels like someone is typing along by hand.
| Users | BF16 d/u | FP8 d/u | NVFP4 d/u |
|---|---|---|---|
| 1 | 28.65 | 49.85 | 55.55 |
| 5 | 12.19 | 21.32 | 30.97 |
| 10 | 7.84 | 15.26 | 22.90 |
For this workload NVFP4 is the logical default. FP8 is fine, but here you mostly give up speed without tail-latency playing the lead role.
Run E: multi-turn, depth 4
Multi-turn is closer to real use than one isolated prompt. Five turns per conversation, several conversations in parallel. That resembles an employee who does not ask one question, but keeps asking, corrects, and carries context along.
Here I do not just want to see high throughput. I mostly want the server not to feel like it comes out of a cold start every turn. With ten conversations at once that becomes relevant: the context grows per conversation, the scheduler has to keep sharing, and the user expects the chat to keep running.
This is the most important office run for me. Not because it is perfectly real, but because it comes closest to “25 people use this spread across the day”.
| Users | BF16 d/u | FP8 d/u | NVFP4 d/u | NVFP4 TTFT |
|---|---|---|---|---|
| 1 | 28.69 | 49.72 | 56.18 | 596 ms |
| 5 | 11.50 | 20.87 | 30.55 | 1032 ms |
| 10 | 7.68 | 14.88 | 21.58 | 1359 ms |
At ten parallel conversations NVFP4 sits at 21.58 tok/s/user. FP8 sits at 14.88. BF16 at 7.68. That last one works technically, but it no longer feels like a snappy chat. NVFP4 stays well above the line where you experience an answer as fluent.
Run F: RAG mix with 8k prompt
RAG is usually not 25k context, but not a short chat either. This run uses an 8k prompt and 512 output tokens. Think four chunks of about 2k tokens, plus question and instruction.
With RAG prefill counts more than in Run C. You push a sizeable slab of context into the model each time before anything comes back. After that you want enough decode left to make the answer usefully fast.
So the question is: does quantization keep helping when the prompt gets heavier? Yes. NVFP4 stays clearly ahead, even at twenty users.
| Users | BF16 d/u | FP8 d/u | NVFP4 d/u |
|---|---|---|---|
| 5 | 12.50 | 21.02 | 27.77 |
| 10 | 8.11 | 14.37 | 19.65 |
| 20 | 5.51 | 9.82 | 14.09 |
At twenty users NVFP4 delivers 14.09 tok/s/user. BF16 sits at 5.51. For batch processing that can still work. For real-time RAG in an office BF16 feels tight, certainly when documents are messy and prompts get longer than you had hoped. They always do.
Run G: short instruction, 4096 output tokens
Run G resembles Run C, but pulls the output much further: 4096 tokens. This is the shape of agents that write out plans, generate code, make long analyses or summarize multiple files.
For this kind of workload the first token is almost a side issue. If the answer is long, decode speed determines the experience. Ten seconds of difference at the start is annoying. Waiting on output for minutes is worse.
NVFP4 stays strongest here. More important: it also stays above 25 tok/s/user at ten users. For local hardware on a desk machine that is simply usable.
| Users | BF16 d/u | FP8 d/u | NVFP4 d/u | NVFP4 TTFT |
|---|---|---|---|---|
| 1 | 28.68 | 49.75 | 55.44 | 179 ms |
| 5 | 14.32 | 25.56 | 34.63 | 427 ms |
| 10 | 9.51 | 18.40 | 25.18 | 363 ms |
For agent flows this is fairly hard: BF16 is not broken, but you pay for every long output twice. First in memory, then in waiting time.
Run H: open-loop office baseline
From here the interpretation changes. The previous runs push controlled batches through the model. Run H uses open-loop traffic: requests come in according to a Poisson distribution. So the server has to deal with arrivals that do not neatly wait for the previous one to finish.
This resembles an office more. Not perfect, but better than everyone at once or fully sequential. The metrics are different too. TPOT tells how fast tokens come once it is your turn. TTFT P50 tells the normal experience. TTFT P99 tells what the unlucky one notices.
Here FP8 gets interesting. NVFP4 wins the median and TPOT, but FP8 wins the tail. That is exactly why I do not want to end with “NVFP4 is always better”.
| Metric | BF16 | FP8 | NVFP4 |
|---|---|---|---|
| Achieved RPS | 0.26 | 0.28 | 0.29 |
| Peak concurrent | 42 | 18 | 15 |
| TTFT P50 | 1229 ms | 732 ms | 618 ms |
| TTFT P99 | 2996 ms | 2008 ms | 3235 ms |
| TPOT P50 | 203 ms | 74 ms | 39 ms |
| Aggregate tok/s | 1203 | 1297 | 1329 |
That peak concurrent of BF16 looks good on paper, but it is not. The queue grows because BF16 drains it less quickly. NVFP4 processes faster, so fewer requests are open at the same time. That is not lower capacity, that is less of a line.
The real choice is between NVFP4 and FP8. Want the best median and fastest output, then NVFP4. Want the cleanest P99 on this workload, then FP8.
Run I: ShareGPT replay
ShareGPT replay is messier and therefore useful. Real conversations have varying lengths, follow-up questions, short answers, long answers and prompts that have not been neatly smoothed out by a benchmark author.
This is the run I trust most for chat feel. Not for company documents, but for the question: how does this feel when several people hold conversations throughout the day?
The pattern from Run H holds. NVFP4 is fastest for the average user. FP8 has the better P99.
| Metric | BF16 | FP8 | NVFP4 |
|---|---|---|---|
| Peak concurrent | 17 | 12 | 10 |
| TTFT P50 | 433 ms | 220 ms | 157 ms |
| TTFT P99 | 713 ms | 422 ms | 1361 ms |
| TPOT P50 | 118 ms | 38 ms | 26 ms |
NVFP4 feels instant for most users: 157 ms TTFT P50 and 26 ms TPOT P50. But the P99 is 1361 ms, where FP8 stays at 422 ms. That is a hefty difference.
For an internal chat where a single slower request is no disaster, I pick NVFP4. For a product UI with a hard latency promise I would take FP8 more seriously.
Run J: Monday morning peak
Run J is oversubscribe. The target is 1.5 requests per second with a concurrency cap of 25. This is not the normal workday. This is the test for what happens when demand is bigger than the server can neatly keep up with.
With oversubscribe I look at achieved RPS first. Not at configured RPS, because that is the same for everyone. The question is how many requests the server actually processes while it is under pressure.
There NVFP4 wins clearly. FP8 keeps the tail cleaner, but NVFP4 gets much more work through the machine.
| Metric | BF16 | FP8 | NVFP4 |
|---|---|---|---|
| Configured RPS | 1.50 | 1.50 | 1.50 |
| Achieved RPS | 0.25 | 0.43 | 0.58 |
| Peak concurrent | 28 | 28 | 28 |
| TTFT P50 | 1130 ms | 757 ms | 687 ms |
| TTFT P99 | 5184 ms | 3388 ms | 4462 ms |
| TPOT P50 | 197 ms | 112 ms | 82 ms |
| Aggregate tok/s | 1118 | 1951 | 2622 |
Concretely: NVFP4 processes about 35 requests per minute. BF16 about 15. That is the difference between a queue that slowly drains and a queue that makes users wonder whether they should click again. Do not click. That second click never helps.
The three precisions side by side
If I have to pick one realistic chat run, I take ShareGPT replay. There you see the distinction cleanest: NVFP4 wins the normal experience, FP8 wins the tail, BF16 takes part but convinces nowhere.
| Metric | BF16 | FP8 | NVFP4 | Best choice |
|---|---|---|---|---|
| TPOT P50 | 118 ms | 38 ms | 26 ms | NVFP4 |
| TTFT P50 | 433 ms | 220 ms | 157 ms | NVFP4 |
| TTFT P99 | 713 ms | 422 ms | 1361 ms | FP8 |
| Peak concurrent | 17 | 12 | 10 | NVFP4 |
| Achieved RPS | 0.30 | 0.30 | 0.30 | tie |
With oversubscribe the difference gets harder:
| Metric | BF16 | FP8 | NVFP4 | Best choice |
|---|---|---|---|---|
| Achieved RPS | 0.25 | 0.43 | 0.58 | NVFP4 |
| TTFT P50 | 1130 ms | 757 ms | 687 ms | NVFP4 |
| TTFT P99 | 5184 ms | 3388 ms | 4462 ms | FP8 |
| TPOT P50 | 197 ms | 112 ms | 82 ms | NVFP4 |
| Aggregate tok/s | 1118 | 1951 | 2622 | NVFP4 |
That makes the choice more practical than I thought beforehand. NVFP4 is the default if you want throughput and normal user experience. FP8 is the choice if you find P99 more important than median. BF16 is the baseline you use to check whether quantization wrecks your accuracy.
Why FP8 wins the P99
My hypothesis: NVFP4 gives vLLM more memory room and therefore more batching room. That raises throughput and lowers TPOT, but individual requests can sometimes wait longer before they fall neatly into a batch.
FP8 has less headroom than NVFP4, but still enough for this workload. That makes the scheduler seem more predictable. Less aggressive, less fast in median, better in the tail.
BF16 has the worst of both worlds: large weights, less KV-cache headroom and lower decode. The queue gets fuller, but not because the server can handle so much at once. It just gets through it less quickly.
I want to dig into this further with scheduler settings and prefix caching. The raw numbers and the test definitions are in the arena so I can hold future runs against the same bar.
Comparison with Gemma-4-26B-A4B
Nemotron-NVFP4 is single-user almost twice as fast as Gemma-NVFP4. At multi-user the difference gets smaller, but it usually stays positive.
| Workload | Gemma-NVFP4 d/u | Nemotron-NVFP4 d/u | Ratio |
|---|---|---|---|
| pp4096 c=1 | 30.01 | 60.30 | 2.0× |
| pp8192 c=1 | 29.35 | 55.72 | 1.9× |
| pp25000 c=1 | 28.00 | 54.98 | 2.0× |
| pp4096 c=10 | 17.05 | 19.69 | 1.2× |
| pp25000 c=10 | 7.61 | 12.99 | 1.7× |
That pattern matches what the model is. Nemotron has 3B active params, Gemma 4B active params. At single-user that helps a lot. At multi-user the bottleneck shifts toward memory bandwidth and scheduling, and then the difference gets smaller.
What this means for on-prem AI
My default choice for this Spark is NVFP4. Not because 4 bit is principally nicer, but because the numbers on these workloads carry it: highest throughput, fastest median, lowest TPOT, smallest footprint.
I pick FP8 when tail-latency matters more than median. Think of a UI where you want to be able to say that 99 percent of requests start within a certain bound. In Run H, I and J, FP8 consistently wins on P99 TTFT.
I pick BF16 only as a baseline or for accuracy-critical validation. Not as a production default. For that it is too expensive on the Spark: roughly three times as much memory as NVFP4 and roughly half the speed.
For a 25-person office with chat and RAG-like workload I would run NVFP4, with a custom eval suite alongside it. For an external chatbot with a tight latency promise I would test FP8. For BF16 I would mostly keep a short run to see what quantization changes in substance.
What these runs do not say
No accuracy tests. FP8 and NVFP4 can differ in substance from BF16. For production you have to measure that on your own documents, your own prompts and your own error tolerance.
No multimodal benchmarks. Nemotron-3-Nano-Omni is multimodal-aware, but these runs are text-only. Vision and audio stay out of frame here.
No comparison with dense models. This is an MoE model. Dense models feel different, especially in output speed and how vLLM handles them.
No definitive scheduler conclusion. The FP8-vs-NVFP4 tail is interesting enough to test separately with other batching and scheduling settings.
Where I land
The precision choice is not a detail. On the Spark it determines whether the same machine feels like a local experiment or like something you can hand to colleagues without explaining it every five minutes.
NVFP4 in many runs doubles the usable experience compared to BF16. FP8 is less spectacular, but more predictable in the tail. BF16 stays useful as a reference point, not as an end station.
The practical lesson from these three posts together: follow the vendor recipes, run the stable image and measure your own workload. Do not tinker yourself unless you have a good reason for it. With Gemma I had a reason. In hindsight it was mediocre.