Mistral AI 3B params BF16 Dense

Ministral-3 3B

Eerste Mistral-model in de Arena. Strakke decode (49 t/s per gebruiker op chat c=10, 53 t/s op long-output) en sterke quality voor zijn maat: GPQA Diamond 53.4 en LiveCodeBench 54.8 zijn cijfers waar je de 8B-klasse normaal voor nodig hebt. ShareGPT TTFT P50 onder 70 ms, bijna identiek aan de 0.8B-Qwen, en sneller dan bijna alles erboven. Onder maandagochtend-piek voltooit 'ie 0.78 RPS van 1.5 ingeplande, redelijk maar geen outlier. Op grote context (25k×10) breekt 'ie net als andere dense BF16-modellen: 9.6 t/s per gebruiker, daar wil je een gequantiseerde of een MoE.

59.6

Arena score

183

Throughput tok/s

8 GB

VRAM

9/9

Benches mesurés

Hugging Face → vLLM v0.20.1 DGX Spark, NVIDIA GB10, 128 GB unified memory Dernière mesure 7 mai 2026

02 Quality · MMLU · GPQA · HumanEval artificial analysis ↗

La composante quality de l'Arena score. Pas mesurée par moi, issue des model cards officielles du vendor. Pour une comparaison cross-model avec un eval harness cohérent, Artificial Analysis est un tiers utile. La moyenne des trois benchmarks entre une à une dans la formule du Score (pondérée plus lourd en Aggregate/Agent, plus léger en Batch).

59.6

Avg

70.7

MMLU-Pro

53.4

GPQA-Diamond

54.8

HumanEval

03 Performance · BF16

Decode throughput · total t/s · c=10

BF16

1k ctx 339 t/s

8k ctx 219 t/s

4k+turn 401 t/s

25k ctx 61.0 t/s

04 Test suite · 9 benchmarks méthodologie →

5 tests closed-loop avec llama-benchy et 4 tests open-loop avec vllm bench serve. Par benchmark les tokens/sec (decode throughput) et le TTFT p50. Le TTFT se traduit directement en ressenti UX, les tps en capacité. Déplie "view command" pour la commande exacte.

01 · llama-benchy closed-loop

Chat

Korte prompt, lang antwoord. De vorm die als normale chat moet aanvoelen, TTFT bepaalt of het "snappy" is.

pp (prompt) 1024 tg (gen) 1024 depth 0 concurrency 10 runs 3

tokens/sec

48.7 t/s

TTFT · p50

490ms

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --pp 1024 \
  --tg 1024 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

02 · llama-benchy closed-loop

RAG · 8k context

Middelgrote context, een paar documentchunks met antwoord van normale lengte. Toont prefill-kosten zonder de muur te raken.

pp (prompt) 8192 tg (gen) 512 depth 0 concurrency 10 runs 3

tokens/sec

25.9 t/s

TTFT · p50

3,34s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --pp 8192 \
  --tg 512 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

03 · llama-benchy closed-loop

Lange output / agents

Korte instructie, veel output. Code-generation, rapporten of gestructureerde agent-output. Stress-test voor decode throughput.

pp (prompt) 256 tg (gen) 4096 depth 0 concurrency 10 runs 3

tokens/sec

52.5 t/s

TTFT · p50

190ms

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --pp 256 \
  --tg 4096 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

04 · llama-benchy closed-loop

Grote context · 25k

Stress-test met grote prompts. Niet per se chatmateriaal, wel exact waar de prefill-muur zichtbaar wordt en TTFT instort.

pp (prompt) 25000 tg (gen) 256 depth 0 concurrency 10 runs 3

tokens/sec

9.6 t/s

TTFT · p50

15,60s

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --pp 25000 \
  --tg 256 \
  --depth 0 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

05 · llama-benchy closed-loop

Multi-turn · kantoorwerk

Vijf beurten per gesprek, tien gesprekken parallel. Dicht bij hoe een team dit echt gebruikt, met groeiende context per turn.

pp (prompt) 2048 tg (gen) 512 depth 4 concurrency 10 runs 3

tokens/sec

43.2 t/s

TTFT · p50

820ms

3 runs · seed 42

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --pp 2048 \
  --tg 512 \
  --depth 4 \
  --concurrency 10 \
  --runs 3 \
  --latency-mode generation \
  --format md

06 · vllm bench serve open-loop

Realistische kantoor-baseline

Random dataset · 4000 tokens in, 500 tokens uit · request-rate 0.3, burstiness 0.7. Een rustig kantoor.

dataset random rate (req/s) 0,30 burstiness 0,7 prompts 200

tokens/sec

122 t/s

TTFT · p50

407ms

200 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --tokenizer mistralai/Ministral-3-3B-Instruct-2512 \
  --served-model-name Ministral-3-3B-Instruct-2512 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 200 \
  --request-rate 0,30 \
  --burstiness 0,7 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

07 · vllm bench serve open-loop

Echte gesprekken · ShareGPT

ShareGPT V3 · gemiddeld 228 tokens per turn · natuurlijk variërend per gesprek. Wat real users doen, niet een synthetische random distributie.

dataset sharegpt v3 rate (req/s) 0,30 burstiness 0,7 prompts 250

tokens/sec

18.1 t/s

TTFT · p50

69ms

250 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --tokenizer mistralai/Ministral-3-3B-Instruct-2512 \
  --served-model-name Ministral-3-3B-Instruct-2512 \
  --dataset-name sharegpt \
  --dataset-path /tmp/ShareGPT_V3.json \
  --num-prompts 250 \
  --request-rate 0,30 \
  --burstiness 0,7 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

08 · vllm bench serve open-loop

Maandagochtend-piek

Random · 4000 in / 500 uit · request-rate 1.5 req/s, burstiness 1.0, max 25 parallel. Wanneer iedereen tegelijk inlogt, zien we de queue groeien?

dataset random rate (req/s) 1,50 burstiness 1,0 prompts 300 max parallel 25

tokens/sec

126 t/s

TTFT · p50

514ms

300 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --tokenizer mistralai/Ministral-3-3B-Instruct-2512 \
  --served-model-name Ministral-3-3B-Instruct-2512 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 300 \
  --request-rate 1,50 \
  --burstiness 1,0 \
  --max-concurrency 25 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

09 · vllm bench serve open-loop

Reasoning workload

Lange chain-of-thought outputs · 1k in / 4k uit · trage rate (0.2 req/s) want elke request kost veel decode-budget. Test of TTFT stabiel blijft.

dataset random rate (req/s) 0,20 burstiness 1,0 prompts 50

tokens/sec

20.1 t/s

TTFT · p50

161ms

50 prompts · seed 42

docker exec vllm-bench vllm bench serve \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --tokenizer mistralai/Ministral-3-3B-Instruct-2512 \
  --served-model-name Ministral-3-3B-Instruct-2512 \
  --dataset-name random \
  --random-input-len 4000 \
  --random-output-len 500 \
  --random-range-ratio 0.9 \
  --num-prompts 50 \
  --request-rate 0,20 \
  --burstiness 1,0 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,95,99 \
  --seed 42

05 Ce que j'en ai pensé

Ce qui marche

Quality op 8B-niveau in een 3B-frame

GPQA Diamond 53.4 en LiveCodeBench 54.8 zijn cijfers waar je normaal de 8B-klasse voor nodig hebt. MMLU 5-shot 70.7 voltooit het beeld. Voor het eerste Mistral-model in de Arena, sterke entry.

Ce qui a cassé

Mistral-stack vereist eigen flags

vLLM laadt Ministral-3 alleen met --tokenizer_mode mistral --config_format mistral --load_format mistral. Zonder die drie flags crasht model load direct. In het profiel staat het al, maar voor wie van Qwen of Gemma komt: niet vergeten.

Ce qui a déçu

25k×10 concurrent breekt de KV-cache

Per-user decode op pp25000 c=10 valt naar 9.6 t/s, prefill duurt mediaan 15.6 seconden. Dense BF16 zonder kwantisatie heeft op deze grootte gewoon te weinig KV-budget voor tien parallelle 25k-sessies. Niet geschikt voor lange-context werklasten op tien gebruikers tegelijk.

Ce qui a surpris

Long-output decode bovenmodaal

Op test G (256 in / 4096 uit, c=10) tikt 'ie 52.55 tokens per gebruiker met 340 t/s aggregate, sneller dan op chat zelf (48.7 / 339). Decode-loop is bij Mistral kennelijk beter geoptimaliseerd dan prefill-zware workloads.

Plus de chiffres ?
Lis l'article complet.

Weights Hugging Face Retour Vers l'arena

Ministral-3 3B

Chat

RAG · 8k context

Lange output / agents

Grote context · 25k

Multi-turn · kantoorwerk

Realistische kantoor-baseline

Echte gesprekken · ShareGPT

Maandagochtend-piek

Reasoning workload

Quality op 8B-niveau in een 3B-frame

Mistral-stack vereist eigen flags

25k×10 concurrent breekt de KV-cache

Long-output decode bovenmodaal

Plus de chiffres ?Lis l'article complet.

Plus de chiffres ?
Lis l'article complet.