RedHatAI 24B params NVFP4 Dense

Mistral-Small-3.2-24B-Instruct-2506

Mistral Small 3.2 NVFP4 is geen makkelijke Spark-winnaar. ShareGPT is bruikbaar, maar de 4k kantoor-baseline loopt naar p95 TTFT van 155.8 sec. Voor on-prem chat kies ik hier eerder Gemma-4 NVFP4.

61.9

Arena score

Throughput tok/s

16 GB

VRAM

8/9

Benches gemeten

Hugging Face → vLLM v0.23.0 DGX Spark, NVIDIA GB10, 128 GB unified memory Laatst gemeten 23 juni 2026

02 Quality · MMLU · GPQA · HumanEval artificial analysis ↗

De quality-component van de Arena-score. Niet zelf gemeten, uit de officiële model-cards van de vendor. Voor cross-model vergelijking met consistente eval-harness is Artificial Analysis een nuttige derde partij. Het gemiddelde van de drie benchmarks komt 1-op-1 in de Score-formule terug (zwaarder gewogen in Aggregaat/Agent, lichter in Batch).

73.2

Avg

80.5

MMLU-Pro

46.1

GPQA-Diamond

92.9

HumanEval

03 Performance · NVFP4

Decode throughput · totaal t/s · c=10

NVFP4

1k ctx 63.0 t/s

8k ctx 17.0 t/s

4k+turn 94.0 t/s

25k ctx 5.0 t/s

04 Test suite · 9 benchmarks methodologie →

5 closed-loop tests met llama-benchy en 4 open-loop tests met vllm bench serve. Per benchmark de tokens/sec (decode throughput) en TTFT p50. TTFT vertaalt direct in UX-gevoel, tps in capaciteit. Klap "view command" uit voor het exacte commando.

01 · llama-benchy closed-loop

Chat

Korte prompt, lang antwoord. De vorm die als normale chat moet aanvoelen, TTFT bepaalt of het "snappy" is.

pp (prompt) 1024 tg (gen) 1024 depth 0 concurrency 10 runs 3

tokens/sec

12.6 t/s

TTFT · p50

7,01s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --pp 1024 \
  --tg 1024 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

02 · llama-benchy closed-loop

RAG · 8k context

Middelgrote context, een paar documentchunks met antwoord van normale lengte. Toont prefill-kosten zonder de muur te raken.

pp (prompt) 8192 tg (gen) 512 depth 0 concurrency 10 runs 3

tokens/sec

4.6 t/s

TTFT · p50

39,89s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --pp 8192 \
  --tg 512 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

03 · llama-benchy closed-loop

Lange output / agents

Korte instructie, veel output. Code-generation, rapporten of gestructureerde agent-output. Stress-test voor decode throughput.

pp (prompt) 256 tg (gen) 4096 depth 0 concurrency 10 runs 3

tokens/sec

15.3 t/s

TTFT · p50

5,20s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --pp 256 \
  --tg 4096 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

04 · llama-benchy closed-loop

Grote context · 25k

Stress-test met grote prompts. Niet per se chatmateriaal, wel exact waar de prefill-muur zichtbaar wordt en TTFT instort.

pp (prompt) 25000 tg (gen) 256 depth 0 concurrency 10 runs 3

tokens/sec

2.0 t/s

TTFT · p50

127,73s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --pp 25000 \
  --tg 256 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

05 · llama-benchy closed-loop

Multi-turn · kantoorwerk

Vijf beurten per gesprek, tien gesprekken parallel. Dicht bij hoe een team dit echt gebruikt, met groeiende context per turn.

pp (prompt) 2048 tg (gen) 512 depth 4 concurrency 10 runs 3

tokens/sec

12.7 t/s

TTFT · p50

12,38s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --pp 2048 \
  --tg 512 \
  --depth 4 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

06 · vllm bench serve open-loop

Realistische kantoor-baseline

Random dataset · 4000 tokens in, 500 tokens uit · request-rate 0.3, burstiness 0.7. Een rustig kantoor.

dataset random rate (req/s) 0,30 burstiness 0,7 prompts 200

tokens/sec

4.7 t/s

TTFT · p50

86,74s

200 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --tokenizer RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --served-model-name Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 200 \
  --request-rate 0,30 \
  --burstiness 0,7 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

07 · vllm bench serve open-loop

Echte gesprekken · ShareGPT

ShareGPT V3 · gemiddeld 228 tokens per turn · natuurlijk variërend per gesprek. Wat real users doen, niet een synthetische random distributie.

dataset sharegpt v3 rate (req/s) 0,30 burstiness 0,7 prompts 250

tokens/sec

21.9 t/s

TTFT · p50

1,00s

250 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --tokenizer RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --served-model-name Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --dataset-name sharegpt \
  --dataset-path /tmp/ShareGPT_V3.json \
  --num-prompts 250 \
  --request-rate 0,30 \
  --burstiness 0,7 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

08 · vllm bench serve open-loop

Maandagochtend-piek

Random · 4000 in / 500 uit · request-rate 1.5 req/s, burstiness 1.0, max 25 parallel. Wanneer iedereen tegelijk inlogt, zien we de queue groeien?

dataset random rate (req/s) 1,50 burstiness 1,0 prompts 300 max parallel 25

tokens/sec

30.5 t/s

TTFT · p50

4,84s

300 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --tokenizer RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --served-model-name Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 300 \
  --request-rate 1,50 \
  --burstiness 1,0 \
  --max-concurrency 25 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

09 · vllm bench serve open-loop

Reasoning workload

Lange chain-of-thought outputs · 1k in / 4k uit · trage rate (0.2 req/s) want elke request kost veel decode-budget. Test of TTFT stabiel blijft.

dataset random rate (req/s) 0,20 burstiness 1,0 prompts 50

tokens/sec

— t/s

TTFT · p50

—

50 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --tokenizer RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --served-model-name Mistral-Small-3.2-24B-Instruct-2506-NVFP4 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 50 \
  --request-rate 0,20 \
  --burstiness 1,0 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

05 Wat ik er van vond

Wat werkt

ShareGPT blijft bruikbaar

Bij echte korte gesprekken haalt Mistral Small NVFP4 57.9 output tok/s en p95 TTFT 2546 ms. Niet snel, wel bruikbaar als chat niet continu onder druk staat.

Wat brak

De 4k kantoor-baseline breekt

Test H is hard: 200/200 requests slagen, maar p95 TTFT is 155844.87 ms en p95 TPOT 4429.32 ms. Dat voelt niet als chat.

Wat niet meeviel

Decode is laag voor NVFP4

Run C blijft op 63.13 tok/s totaal en 12.63 tok/s per request. Voor een 24B NVFP4-model op deze machine is dat geen overtuigende score.

Wat verbaasde

Mistral is workload-gevoelig

Multi-turn haalt 94.20 tok/s, beter dan de simpele chat-run. De scheduler krijgt daar blijkbaar net een vorm die beter past.

Meer cijfers?
Lees het volledig artikel.

Weights Hugging Face Terug Naar de arena

Mistral-Small-3.2-24B-Instruct-2506

Chat

RAG · 8k context

Lange output / agents

Grote context · 25k

Multi-turn · kantoorwerk

Realistische kantoor-baseline

Echte gesprekken · ShareGPT

Maandagochtend-piek

Reasoning workload

ShareGPT blijft bruikbaar

De 4k kantoor-baseline breekt

Decode is laag voor NVFP4

Mistral is workload-gevoelig

Meer cijfers?Lees het volledig artikel.

Meer cijfers?
Lees het volledig artikel.