NVIDIA 30B-A3B params FP8 MoE

Nemotron-3-Nano-Omni-30B-A3B-Reasoning

Hier wordt het serieus. FP8 verdubbelt de snelheid van BF16 zonder dat je het in kwaliteit merkt. Decode tikt rond de 15 tokens per gebruiker op chat, TTFT zakt onder de seconde. Onder maandagochtend-piek doet 'ie het bijna twee keer zo goed als BF16. Voor de meeste workloads is dit waar je standaard naar zou moeten grijpen, klein offer voor flink wat winst.

62.5

Arena score

Throughput tok/s

33 GB

VRAM

8/9

Benches measured

Hugging Face → vLLM 0.20.0 DGX Spark, NVIDIA GB10, 128 GB unified memory Last measured 3 May 2026

02 Quality · MMLU · GPQA · HumanEval artificial analysis ↗

The quality component of the Arena score. Not measured by me, from the vendor's official model cards. For cross-model comparison with a consistent eval harness, Artificial Analysis is a useful third party. The average of the three benchmarks feeds one-to-one into the Score formula (weighted heavier in Aggregate/Agent, lighter in Batch).

70.9

Avg

77.3

MMLU-Pro

72.2

GPQA-Diamond

63.2

HumanEval

03 Performance · FP8 vs BF16 sibling

Decode throughput · total t/s · c=10

FP8 BF16 sibling

1k ctx 138 t/s

1k ctx 76.0 t/s

8k ctx 119 t/s

8k ctx 69.0 t/s

4k+turn 134 t/s

4k+turn 73.0 t/s

25k ctx 55.0 t/s

25k ctx 39.0 t/s

04 Test suite · 9 benchmarks methodology →

5 closed-loop tests with llama-benchy and 4 open-loop tests with vllm bench serve. Per benchmark the tokens/sec (decode throughput) and TTFT p50. TTFT translates directly into UX feel, tps into capacity. Expand "view command" for the exact command.

01 · llama-benchy closed-loop

Chat

Korte prompt, lang antwoord. De vorm die als normale chat moet aanvoelen, TTFT bepaalt of het "snappy" is.

pp (prompt) 1024 tg (gen) 1024 depth 0 concurrency 10 runs 3

tokens/sec

15.3 t/s

TTFT · p50

960ms

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --pp 1024 \
  --tg 1024 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

02 · llama-benchy closed-loop

RAG · 8k context

Middelgrote context, een paar documentchunks met antwoord van normale lengte. Toont prefill-kosten zonder de muur te raken.

pp (prompt) 8192 tg (gen) 512 depth 0 concurrency 10 runs 3

tokens/sec

14.4 t/s

TTFT · p50

5,11s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --pp 8192 \
  --tg 512 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

03 · llama-benchy closed-loop

Lange output / agents

Korte instructie, veel output. Code-generation, rapporten of gestructureerde agent-output. Stress-test voor decode throughput.

pp (prompt) 256 tg (gen) 4096 depth 0 concurrency 10 runs 3

tokens/sec

18.4 t/s

TTFT · p50

450ms

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --pp 256 \
  --tg 4096 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

04 · llama-benchy closed-loop

Grote context · 25k

Stress-test met grote prompts. Niet per se chatmateriaal, wel exact waar de prefill-muur zichtbaar wordt en TTFT instort.

pp (prompt) 25000 tg (gen) 256 depth 0 concurrency 10 runs 3

tokens/sec

8.6 t/s

TTFT · p50

16,89s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --pp 25000 \
  --tg 256 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

05 · llama-benchy closed-loop

Multi-turn · kantoorwerk

Vijf beurten per gesprek, tien gesprekken parallel. Dicht bij hoe een team dit echt gebruikt, met groeiende context per turn.

pp (prompt) 2048 tg (gen) 512 depth 4 concurrency 10 runs 3

tokens/sec

14.9 t/s

TTFT · p50

1,55s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --pp 2048 \
  --tg 512 \
  --depth 4 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

06 · vllm bench serve open-loop

Realistische kantoor-baseline

Random dataset · 4000 tokens in, 500 tokens uit · request-rate 0.3, burstiness 0.7. Een rustig kantoor.

dataset random rate (req/s) 0,30 burstiness 0,7 prompts 200

tokens/sec

72.1 t/s

TTFT · p50

732ms

200 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --tokenizer nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --served-model-name Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 200 \
  --request-rate 0,30 \
  --burstiness 0,7 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

07 · vllm bench serve open-loop

Echte gesprekken · ShareGPT

ShareGPT V3 · gemiddeld 228 tokens per turn · natuurlijk variërend per gesprek. Wat real users doen, niet een synthetische random distributie.

dataset sharegpt v3 rate (req/s) 0,30 burstiness 0,7 prompts 250

tokens/sec

11.0 t/s

TTFT · p50

220ms

250 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --tokenizer nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --served-model-name Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --dataset-name sharegpt \
  --dataset-path /tmp/ShareGPT_V3.json \
  --num-prompts 250 \
  --request-rate 0,30 \
  --burstiness 0,7 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

08 · vllm bench serve open-loop

Maandagochtend-piek

Random · 4000 in / 500 uit · request-rate 1.5 req/s, burstiness 1.0, max 25 parallel. Wanneer iedereen tegelijk inlogt, zien we de queue groeien?

dataset random rate (req/s) 1,50 burstiness 1,0 prompts 300 max parallel 25

tokens/sec

69.7 t/s

TTFT · p50

757ms

300 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --tokenizer nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --served-model-name Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 300 \
  --request-rate 1,50 \
  --burstiness 1,0 \
  --max-concurrency 25 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

09 · vllm bench serve open-loop

Reasoning workload

Lange chain-of-thought outputs · 1k in / 4k uit · trage rate (0.2 req/s) want elke request kost veel decode-budget. Test of TTFT stabiel blijft.

dataset random rate (req/s) 0,20 burstiness 1,0 prompts 50

tokens/sec

— t/s

TTFT · p50

—

50 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --tokenizer nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --served-model-name Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 50 \
  --request-rate 0,20 \
  --burstiness 1,0 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

05 What I thought of it

What works

FP8 native op Blackwell, volle compute-winst

Tensor-cores draaien direct op 8-bit, geen kernel-emulatie zoals bij FP4. Decode verdubbelt vs BF16 (15 vs 7.8 t/s/user op chat), op TTFT zakt het van 1.33 naar 0.96 seconden.

What broke

Config-keys driften tussen vLLM-versies

Tussen dev-builds wisselt de canonical naam voor W8A8 weights/activations. Profile aanpassen na elke vLLM-upgrade hoort er gewoon bij.

What disappointed

Quality-drop niet zelf gemeten

NVIDIA's Tabel 14 toont -0.37 mean over multimodal evals vs BF16. Niet getest op text-only taken; aanname is gelijk aan BF16 binnen meet-onzekerheid.

What surprised

Tail-latency wint sterker dan decode-mean

Niet alleen gemiddelde verbetert, P95 en P99 TTFT op H/I/J zakken evenredig of meer. Voor chat-perceptie helpt dat veel meer dan een paar t/s extra.

More numbers?
Read the full article.

Post Nemotron-3 in drie precisies Weights Hugging Face Back To the arena

Nemotron-3-Nano-Omni-30B-A3B-Reasoning

Chat

RAG · 8k context

Lange output / agents

Grote context · 25k

Multi-turn · kantoorwerk

Realistische kantoor-baseline

Echte gesprekken · ShareGPT

Maandagochtend-piek

Reasoning workload

FP8 native op Blackwell, volle compute-winst

Config-keys driften tussen vLLM-versies

Quality-drop niet zelf gemeten

Tail-latency wint sterker dan decode-mean

More numbers?Read the full article.

More numbers?
Read the full article.