NVIDIA 30B-A3B params NVFP4 MoE

Nemotron-3-Nano-Omni-30B-A3B-Reasoning

De snelste van de drie precisies, met afstand. Op chat tikt 'ie 23 tokens per gebruiker en TTFT blijft mooi onder de seconde, ook bij grotere contexten. Voor lange chain-of-thought waar je bij BF16 nog op het eerste antwoord zit te wachten, gaat dit gewoon door. Setup-pijn is wel reëel, dezelfde rits aan vLLM-patches die ook de Gemma-versie nodig had. Eenmaal draaiend is dit op de Spark de variant die je wil hebben.

64.9

Arena score

126

Throughput tok/s

21 GB

VRAM

8/9

Benches gemeten

Hugging Face → vLLM 0.20.0 DGX Spark, NVIDIA GB10, 128 GB unified memory Laatst gemeten 3 mei 2026

02 Quality · MMLU · GPQA · HumanEval artificial analysis ↗

De quality-component van de Arena-score. Niet zelf gemeten, uit de officiële model-cards van de vendor. Voor cross-model vergelijking met consistente eval-harness is Artificial Analysis een nuttige derde partij. Het gemiddelde van de drie benchmarks komt 1-op-1 in de Score-formule terug (zwaarder gewogen in Aggregaat/Agent, lichter in Batch).

70.9

Avg

77.3

MMLU-Pro

72.2

GPQA-Diamond

63.2

HumanEval

03 Performance · NVFP4 vs BF16-sibling

Decode throughput · totaal t/s · c=10

NVFP4 BF16-sibling

1k ctx 202 t/s

1k ctx 76.0 t/s

8k ctx 167 t/s

8k ctx 69.0 t/s

4k+turn 193 t/s

4k+turn 73.0 t/s

25k ctx 78.0 t/s

25k ctx 39.0 t/s

04 Test suite · 9 benchmarks methodologie →

5 closed-loop tests met llama-benchy en 4 open-loop tests met vllm bench serve. Per benchmark de tokens/sec (decode throughput) en TTFT p50. TTFT vertaalt direct in UX-gevoel, tps in capaciteit. Klap "view command" uit voor het exacte commando.

01 · llama-benchy closed-loop

Chat

Korte prompt, lang antwoord. De vorm die als normale chat moet aanvoelen, TTFT bepaalt of het "snappy" is.

pp (prompt) 1024 tg (gen) 1024 depth 0 concurrency 10 runs 3

tokens/sec

22.9 t/s

TTFT · p50

950ms

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --pp 1024 \
  --tg 1024 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

02 · llama-benchy closed-loop

RAG · 8k context

Middelgrote context, een paar documentchunks met antwoord van normale lengte. Toont prefill-kosten zonder de muur te raken.

pp (prompt) 8192 tg (gen) 512 depth 0 concurrency 10 runs 3

tokens/sec

19.6 t/s

TTFT · p50

4,01s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --pp 8192 \
  --tg 512 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

03 · llama-benchy closed-loop

Lange output / agents

Korte instructie, veel output. Code-generation, rapporten of gestructureerde agent-output. Stress-test voor decode throughput.

pp (prompt) 256 tg (gen) 4096 depth 0 concurrency 10 runs 3

tokens/sec

25.2 t/s

TTFT · p50

360ms

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --pp 256 \
  --tg 4096 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

04 · llama-benchy closed-loop

Grote context · 25k

Stress-test met grote prompts. Niet per se chatmateriaal, wel exact waar de prefill-muur zichtbaar wordt en TTFT instort.

pp (prompt) 25000 tg (gen) 256 depth 0 concurrency 10 runs 3

tokens/sec

13.0 t/s

TTFT · p50

12,71s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --pp 25000 \
  --tg 256 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

05 · llama-benchy closed-loop

Multi-turn · kantoorwerk

Vijf beurten per gesprek, tien gesprekken parallel. Dicht bij hoe een team dit echt gebruikt, met groeiende context per turn.

pp (prompt) 2048 tg (gen) 512 depth 4 concurrency 10 runs 3

tokens/sec

21.6 t/s

TTFT · p50

1,36s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --pp 2048 \
  --tg 512 \
  --depth 4 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

06 · vllm bench serve open-loop

Realistische kantoor-baseline

Random dataset · 4000 tokens in, 500 tokens uit · request-rate 0.3, burstiness 0.7. Een rustig kantoor.

dataset random rate (req/s) 0,30 burstiness 0,7 prompts 200

tokens/sec

88.6 t/s

TTFT · p50

618ms

200 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --tokenizer nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --served-model-name Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 200 \
  --request-rate 0,30 \
  --burstiness 0,7 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

07 · vllm bench serve open-loop

Echte gesprekken · ShareGPT

ShareGPT V3 · gemiddeld 228 tokens per turn · natuurlijk variërend per gesprek. Wat real users doen, niet een synthetische random distributie.

dataset sharegpt v3 rate (req/s) 0,30 burstiness 0,7 prompts 250

tokens/sec

13.1 t/s

TTFT · p50

157ms

250 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --tokenizer nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --served-model-name Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --dataset-name sharegpt \
  --dataset-path /tmp/ShareGPT_V3.json \
  --num-prompts 250 \
  --request-rate 0,30 \
  --burstiness 0,7 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

08 · vllm bench serve open-loop

Maandagochtend-piek

Random · 4000 in / 500 uit · request-rate 1.5 req/s, burstiness 1.0, max 25 parallel. Wanneer iedereen tegelijk inlogt, zien we de queue groeien?

dataset random rate (req/s) 1,50 burstiness 1,0 prompts 300 max parallel 25

tokens/sec

93.6 t/s

TTFT · p50

687ms

300 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --tokenizer nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --served-model-name Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 300 \
  --request-rate 1,50 \
  --burstiness 1,0 \
  --max-concurrency 25 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

09 · vllm bench serve open-loop

Reasoning workload

Lange chain-of-thought outputs · 1k in / 4k uit · trage rate (0.2 req/s) want elke request kost veel decode-budget. Test of TTFT stabiel blijft.

dataset random rate (req/s) 0,20 burstiness 1,0 prompts 50

tokens/sec

— t/s

TTFT · p50

—

50 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --tokenizer nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --served-model-name Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 50 \
  --request-rate 0,20 \
  --burstiness 1,0 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

05 Wat ik er van vond

Wat werkt

Drop-in upgrade van FP8 naar NVFP4

Zelfde recipe, andere precision-flag, factor 1.5x snelheid eruit (23 vs 15 t/s/user op chat). Checkpoint zakt van 33 naar 21 GB, KV-cache headroom van ruim 100 GB.

Wat brak

Tweede ronde NVFP4 setup-pijn

Dezelfde patches als bij Gemma-4-NVFP4 nodig: vLLM dev154+, flashinfer-version-check bypass, sampler-fallback. Niet een druk-op-de-knop install.

Wat niet meeviel

Geen native FP4 op SM12.1

Kernel-emulatie via Marlin. Op datacenter-Blackwell zou de NVFP4-winst nog groter zijn. Wat we nu zien is de bandwidth-helft van de winst zonder de compute-helft.

Wat verbaasde

Reasoning-workload wordt opeens haalbaar

Lange chain-of-thought houdt op c=10 nog 25 t/s/user (Run G). Op BF16 of FP8 is ditzelfde scenario op de Spark eerder een geduldspel.

Meer cijfers?
Lees het volledig artikel.

Post Nemotron-3 in drie precisies Weights Hugging Face Terug Naar de arena

Nemotron-3-Nano-Omni-30B-A3B-Reasoning

Chat

RAG · 8k context

Lange output / agents

Grote context · 25k

Multi-turn · kantoorwerk

Realistische kantoor-baseline

Echte gesprekken · ShareGPT

Maandagochtend-piek

Reasoning workload

Drop-in upgrade van FP8 naar NVFP4

Tweede ronde NVFP4 setup-pijn

Geen native FP4 op SM12.1

Reasoning-workload wordt opeens haalbaar

Meer cijfers?Lees het volledig artikel.

Meer cijfers?
Lees het volledig artikel.