On-prem AI 20 min read Updated

Nemotron-3 on the DGX Spark: BF16 vs FP8 vs NVFP4

One model, three precisions, the same Spark. What memory budget, decode speed and tail-latency do when you go from 16 bit to 8 bit to 4 bit.

Written by Django de Vreng

In the previous posts I ran Gemma-4 on the DGX Spark. First just BF16 as a baseline, then NVFP4 vs BF16 across the same test suite. That gave one model in two precisions. Useful, but not yet a real picture of the choice you have to make in production.

For this piece I run three variants of the same model side by side: BF16, FP8 and NVFP4 of Nemotron-3-Nano-Omni-30B-A3B-Reasoning. Same Spark. Same vLLM version. Same prompts. Same benchmark suite. As close to a fair quantization comparison as I can get on this machine.

The short version: NVFP4 wins on speed and throughput, FP8 wins more often on tail-latency, BF16 is mostly still useful as a baseline. That is less tidy than “4 bit is always better”. Lucky for us, otherwise this post would have been short. Part of the guide running LLMs on the DGX Spark.

Why this experiment

The Gemma post mostly showed that NVFP4 works on the Spark. With some pain. Five vLLM bugs, a nightly build and enough flags to make a command line look like a small confession.

But Gemma did not answer the question I need for clients: what do you pick if you want to run a local model on a Spark today? BF16 because those are the original weights? FP8 because Blackwell is natively good at it? Or NVFP4 because you fit much more model and KV-cache in the same memory?

So here is this run. One model in three precisions. No leaderboard score, but workloads that resemble office work: chat, RAG, longer answers, multiple users at once, and a Monday morning where everyone suddenly decides AI is handy after all.

What BF16, FP8 and NVFP4 mean here

BF16 is the baseline: 16 bits per parameter, roughly 2 bytes. For this model that means about 61.5 GB of checkpoint size. That fits on the Spark, but it eats a lot of your 128 GB unified memory before a single user has any context in the KV-cache.

FP8 roughly halves that weight. The checkpoint is 32.8 GB. On Blackwell, FP8 is a logical choice: less memory, native support, and usually little hassle in vLLM.

NVFP4 goes further. The checkpoint is 20.9 GB. Not four times smaller than BF16, because the vision and audio encoders stay in BF16, but small enough to make the Spark feel different. More room for KV-cache, more batching, more concurrency.

The nuance: the DGX Spark runs on desktop Blackwell SM12.1. There NVFP4 is not the same party as on datacenter Blackwell. vLLM uses Marlin to decode FP4 weights toward FP16 during compute. You get the memory win fully. The compute win is less pure.

For this post that is exactly what makes it interesting. This is not a theoretical quantization post. This is: what happens on this machine, with this stack, when you actually run the three options?

PrecisionModel sizeMemory budget left of 128 GB
BF1661.5 GB~66 GB
FP832.8 GB~95 GB
NVFP420.9 GB~107 GB

The test setup

All runs go through Docker on the DGX Spark with vllm/vllm-openai:v0.20.0. Official release, no patches.

docker run -d --name vllm-bench \
  --gpus all --ipc=host \
  -v appliance_hf-cache:/root/.cache/huggingface \
  -p 8000:8000 \
  -e HF_TOKEN="***" \
  vllm/vllm-openai:v0.20.0 \
  vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 256 \
  --max-num-batched-tokens 8192 \
  --trust-remote-code \
  --video-pruning-rate 0.5 \
  --reasoning-parser nemotron_v3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --limit-mm-per-prompt '{"image":0,"audio":0}'

For FP8 I use the same profile with --kv-cache-dtype fp8. BF16 runs without that KV-cache flag. Everything else stays equal.

The benchmark suite is described in the arena methodology. In short: closed-loop tests for decode and TTFT per user, plus open-loop tests with Poisson arrivals to see how the server behaves when requests do not neatly wait for each other.

Setup

I started wrong with nvcr.io/nvidia/vllm:26.02-py3, NVIDIA’s own vLLM container. It had vLLM 0.15.1 and did not yet know the NemotronH_Nano_Omni_Reasoning_V3 architecture.

The fix was more boring: vllm/vllm-openai:v0.20.0. Official release, correct flashinfer versions, first run working.

Our own bench-spark CLI still needed two small fixes: bypass the NVIDIA entrypoint with --entrypoint vllm, and pass HF_TOKEN to the container automatically. After that the suite ran.

Lesson: start with the stable release that supports the architecture.

Run A: context-scaling

This run is the foundation: what happens when the prompt gets longer, while the number of users climbs from one to ten? That touches office work directly. A short chat is easy. A RAG question with 25k context and several people at once is where the Spark shows how much room is really left.

Here I look at two things. First decode per user: how fast does text come back once generation is running? Then TTFT: how long do you wait for the first token? With long context TTFT is often the pain users feel first. They see no tokens, so it feels like the system is stuck.

Single-user is mostly a pure speed measurement. There NVFP4 nearly doubles BF16. At ten users it gets more interesting: the smaller weights give vLLM more room to batch, and then BF16 just gets heavy.

Decode/user (tg256), c=1

ContextBF16FP8NVFP4NVFP4 vs BF16
4k29.2351.6860.30+106%
8k28.5949.8255.72+95%
16k28.2447.5255.24+96%
25k28.2448.8554.98+95%

BF16 stays neatly flat around 28-29 tokens per second. That is stable, but not fast. FP8 puts about 50 t/s against it. NVFP4 sits around 55-60 t/s. For a single user that is the difference between “fine” and “this feels local but not local-slow”.

Decode/user (tg256), c=10

ContextBF16FP8NVFP4NVFP4 vs BF16
4k7.7613.4519.69+154%
8k7.1311.1417.90+151%
16k6.3010.7314.99+138%
25k5.568.5912.99+134%

At ten users NVFP4 is not “a bit faster”. It is a different class. At 25k context BF16 does 5.56 tok/s/user. NVFP4 does 12.99. That is still no cloud-GPU cluster, but the difference in feel is large: BF16 becomes waiting, NVFP4 keeps working.

TTFT (first token), c=10

ContextBF16FP8NVFP4
4k3.90s2.91s2.45s
8k6.49s5.93s4.03s
16k12.63s10.55s8.01s
25k19.82s16.89s12.71s

This is the table I take most seriously for real users. At 25k context and ten users you wait almost 20 seconds for the first token with BF16. With NVFP4 that is 12.7 seconds. Still long, but not the same kind of long.

Run B: 25k context, concurrency up to 20

Run A shows how context length scales. Run B keeps the context heavy and only raises the concurrency. This is the “everyone asks a big question at the same time” test.

In practice this does not happen every hour. Ten to twenty people rarely click send at exactly the same moment with 25k context. But if you put a local AI machine in front of a team, you want to know how it fails. Calmly getting slower is acceptable. A queue that feels dead is not.

NVFP4 keeps the most air here. Not because the model gets smarter, but because the server with smaller weights has more room for batching and KV-cache.

UsersBF16 d/uFP8 d/uNVFP4 d/uNVFP4 vs BF16
59.0615.3320.75+129%
105.659.1812.99+130%
203.705.977.79+110%
UsersBF16 TTFTFP8 TTFTNVFP4 TTFT
511.01s8.89s7.21s
1019.75s15.82s12.74s
2037.88s29.91s24.08s

Twenty users with 25k context is deliberately unkind. Still, it is useful. BF16 sits at 37.88 seconds TTFT. That feels broken. NVFP4 sits at 24.08 seconds. Also not cozy, but still a good thirteen seconds faster.

Aggregate decode shows the same picture:

UsersBF16FP8NVFP4
534 t/s53 t/s71 t/s
1038 t/s59 t/s77 t/s
2044 t/s66 t/s84 t/s

The ceiling shifts from 44 t/s to 84 t/s. For a single user that is abstract. For a team it means the queue drains faster.

Run C: short prompt, long output

This is the workload for agents, code generation and longer answers: little input, a lot of output. The prompt is only 1024 tokens, so prefill is not the problem here. The question is mostly how fast the model keeps ticking once the output gets long.

So here I look at decode per user. TTFT has to stay low, but the real difference you feel only after a few hundred tokens. A model that starts fast but then hangs at 8 tok/s still feels slow.

NVFP4 clearly wins here. At ten parallel users the model stays at 22.90 tok/s/user. BF16 drops to 7.84. That is still readable, but for an agent flow it feels like someone is typing along by hand.

UsersBF16 d/uFP8 d/uNVFP4 d/u
128.6549.8555.55
512.1921.3230.97
107.8415.2622.90

For this workload NVFP4 is the logical default. FP8 is fine, but here you mostly give up speed without tail-latency playing the lead role.

Run E: multi-turn, depth 4

Multi-turn is closer to real use than one isolated prompt. Five turns per conversation, several conversations in parallel. That resembles an employee who does not ask one question, but keeps asking, corrects, and carries context along.

Here I do not just want to see high throughput. I mostly want the server not to feel like it comes out of a cold start every turn. With ten conversations at once that becomes relevant: the context grows per conversation, the scheduler has to keep sharing, and the user expects the chat to keep running.

This is the most important office run for me. Not because it is perfectly real, but because it comes closest to “25 people use this spread across the day”.

UsersBF16 d/uFP8 d/uNVFP4 d/uNVFP4 TTFT
128.6949.7256.18596 ms
511.5020.8730.551032 ms
107.6814.8821.581359 ms

At ten parallel conversations NVFP4 sits at 21.58 tok/s/user. FP8 sits at 14.88. BF16 at 7.68. That last one works technically, but it no longer feels like a snappy chat. NVFP4 stays well above the line where you experience an answer as fluent.

Run F: RAG mix with 8k prompt

RAG is usually not 25k context, but not a short chat either. This run uses an 8k prompt and 512 output tokens. Think four chunks of about 2k tokens, plus question and instruction.

With RAG prefill counts more than in Run C. You push a sizeable slab of context into the model each time before anything comes back. After that you want enough decode left to make the answer usefully fast.

So the question is: does quantization keep helping when the prompt gets heavier? Yes. NVFP4 stays clearly ahead, even at twenty users.

UsersBF16 d/uFP8 d/uNVFP4 d/u
512.5021.0227.77
108.1114.3719.65
205.519.8214.09

At twenty users NVFP4 delivers 14.09 tok/s/user. BF16 sits at 5.51. For batch processing that can still work. For real-time RAG in an office BF16 feels tight, certainly when documents are messy and prompts get longer than you had hoped. They always do.

Run G: short instruction, 4096 output tokens

Run G resembles Run C, but pulls the output much further: 4096 tokens. This is the shape of agents that write out plans, generate code, make long analyses or summarize multiple files.

For this kind of workload the first token is almost a side issue. If the answer is long, decode speed determines the experience. Ten seconds of difference at the start is annoying. Waiting on output for minutes is worse.

NVFP4 stays strongest here. More important: it also stays above 25 tok/s/user at ten users. For local hardware on a desk machine that is simply usable.

UsersBF16 d/uFP8 d/uNVFP4 d/uNVFP4 TTFT
128.6849.7555.44179 ms
514.3225.5634.63427 ms
109.5118.4025.18363 ms

For agent flows this is fairly hard: BF16 is not broken, but you pay for every long output twice. First in memory, then in waiting time.

Run H: open-loop office baseline

From here the interpretation changes. The previous runs push controlled batches through the model. Run H uses open-loop traffic: requests come in according to a Poisson distribution. So the server has to deal with arrivals that do not neatly wait for the previous one to finish.

This resembles an office more. Not perfect, but better than everyone at once or fully sequential. The metrics are different too. TPOT tells how fast tokens come once it is your turn. TTFT P50 tells the normal experience. TTFT P99 tells what the unlucky one notices.

Here FP8 gets interesting. NVFP4 wins the median and TPOT, but FP8 wins the tail. That is exactly why I do not want to end with “NVFP4 is always better”.

MetricBF16FP8NVFP4
Achieved RPS0.260.280.29
Peak concurrent421815
TTFT P501229 ms732 ms618 ms
TTFT P992996 ms2008 ms3235 ms
TPOT P50203 ms74 ms39 ms
Aggregate tok/s120312971329

That peak concurrent of BF16 looks good on paper, but it is not. The queue grows because BF16 drains it less quickly. NVFP4 processes faster, so fewer requests are open at the same time. That is not lower capacity, that is less of a line.

The real choice is between NVFP4 and FP8. Want the best median and fastest output, then NVFP4. Want the cleanest P99 on this workload, then FP8.

Run I: ShareGPT replay

ShareGPT replay is messier and therefore useful. Real conversations have varying lengths, follow-up questions, short answers, long answers and prompts that have not been neatly smoothed out by a benchmark author.

This is the run I trust most for chat feel. Not for company documents, but for the question: how does this feel when several people hold conversations throughout the day?

The pattern from Run H holds. NVFP4 is fastest for the average user. FP8 has the better P99.

MetricBF16FP8NVFP4
Peak concurrent171210
TTFT P50433 ms220 ms157 ms
TTFT P99713 ms422 ms1361 ms
TPOT P50118 ms38 ms26 ms

NVFP4 feels instant for most users: 157 ms TTFT P50 and 26 ms TPOT P50. But the P99 is 1361 ms, where FP8 stays at 422 ms. That is a hefty difference.

For an internal chat where a single slower request is no disaster, I pick NVFP4. For a product UI with a hard latency promise I would take FP8 more seriously.

Run J: Monday morning peak

Run J is oversubscribe. The target is 1.5 requests per second with a concurrency cap of 25. This is not the normal workday. This is the test for what happens when demand is bigger than the server can neatly keep up with.

With oversubscribe I look at achieved RPS first. Not at configured RPS, because that is the same for everyone. The question is how many requests the server actually processes while it is under pressure.

There NVFP4 wins clearly. FP8 keeps the tail cleaner, but NVFP4 gets much more work through the machine.

MetricBF16FP8NVFP4
Configured RPS1.501.501.50
Achieved RPS0.250.430.58
Peak concurrent282828
TTFT P501130 ms757 ms687 ms
TTFT P995184 ms3388 ms4462 ms
TPOT P50197 ms112 ms82 ms
Aggregate tok/s111819512622

Concretely: NVFP4 processes about 35 requests per minute. BF16 about 15. That is the difference between a queue that slowly drains and a queue that makes users wonder whether they should click again. Do not click. That second click never helps.

The three precisions side by side

If I have to pick one realistic chat run, I take ShareGPT replay. There you see the distinction cleanest: NVFP4 wins the normal experience, FP8 wins the tail, BF16 takes part but convinces nowhere.

MetricBF16FP8NVFP4Best choice
TPOT P50118 ms38 ms26 msNVFP4
TTFT P50433 ms220 ms157 msNVFP4
TTFT P99713 ms422 ms1361 msFP8
Peak concurrent171210NVFP4
Achieved RPS0.300.300.30tie

With oversubscribe the difference gets harder:

MetricBF16FP8NVFP4Best choice
Achieved RPS0.250.430.58NVFP4
TTFT P501130 ms757 ms687 msNVFP4
TTFT P995184 ms3388 ms4462 msFP8
TPOT P50197 ms112 ms82 msNVFP4
Aggregate tok/s111819512622NVFP4

That makes the choice more practical than I thought beforehand. NVFP4 is the default if you want throughput and normal user experience. FP8 is the choice if you find P99 more important than median. BF16 is the baseline you use to check whether quantization wrecks your accuracy.

Why FP8 wins the P99

My hypothesis: NVFP4 gives vLLM more memory room and therefore more batching room. That raises throughput and lowers TPOT, but individual requests can sometimes wait longer before they fall neatly into a batch.

FP8 has less headroom than NVFP4, but still enough for this workload. That makes the scheduler seem more predictable. Less aggressive, less fast in median, better in the tail.

BF16 has the worst of both worlds: large weights, less KV-cache headroom and lower decode. The queue gets fuller, but not because the server can handle so much at once. It just gets through it less quickly.

I want to dig into this further with scheduler settings and prefix caching. The raw numbers and the test definitions are in the arena so I can hold future runs against the same bar.

Comparison with Gemma-4-26B-A4B

Nemotron-NVFP4 is single-user almost twice as fast as Gemma-NVFP4. At multi-user the difference gets smaller, but it usually stays positive.

WorkloadGemma-NVFP4 d/uNemotron-NVFP4 d/uRatio
pp4096 c=130.0160.302.0×
pp8192 c=129.3555.721.9×
pp25000 c=128.0054.982.0×
pp4096 c=1017.0519.691.2×
pp25000 c=107.6112.991.7×

That pattern matches what the model is. Nemotron has 3B active params, Gemma 4B active params. At single-user that helps a lot. At multi-user the bottleneck shifts toward memory bandwidth and scheduling, and then the difference gets smaller.

What this means for on-prem AI

My default choice for this Spark is NVFP4. Not because 4 bit is principally nicer, but because the numbers on these workloads carry it: highest throughput, fastest median, lowest TPOT, smallest footprint.

I pick FP8 when tail-latency matters more than median. Think of a UI where you want to be able to say that 99 percent of requests start within a certain bound. In Run H, I and J, FP8 consistently wins on P99 TTFT.

I pick BF16 only as a baseline or for accuracy-critical validation. Not as a production default. For that it is too expensive on the Spark: roughly three times as much memory as NVFP4 and roughly half the speed.

For a 25-person office with chat and RAG-like workload I would run NVFP4, with a custom eval suite alongside it. For an external chatbot with a tight latency promise I would test FP8. For BF16 I would mostly keep a short run to see what quantization changes in substance.

What these runs do not say

No accuracy tests. FP8 and NVFP4 can differ in substance from BF16. For production you have to measure that on your own documents, your own prompts and your own error tolerance.

No multimodal benchmarks. Nemotron-3-Nano-Omni is multimodal-aware, but these runs are text-only. Vision and audio stay out of frame here.

No comparison with dense models. This is an MoE model. Dense models feel different, especially in output speed and how vLLM handles them.

No definitive scheduler conclusion. The FP8-vs-NVFP4 tail is interesting enough to test separately with other batching and scheduling settings.

Where I land

The precision choice is not a detail. On the Spark it determines whether the same machine feels like a local experiment or like something you can hand to colleagues without explaining it every five minutes.

NVFP4 in many runs doubles the usable experience compared to BF16. FP8 is less spectacular, but more predictable in the tail. BF16 stays useful as a reference point, not as an end station.

The practical lesson from these three posts together: follow the vendor recipes, run the stable image and measure your own workload. Do not tinker yourself unless you have a good reason for it. With Gemma I had a reason. In hindsight it was mediocre.

Esc