---
title: "Gemma-4-26B-A4B-it (BF16) · DGX Spark Arena"
canonical: "https://djangodevreng.nl/fr/arena/gemma-4-26b-a4b-it-bf16-v23/"
license: "CC-BY-4.0"
source: "https://github.com/djangodevreng/dgx-spark-benchmarks"
attribution: "Django de Vreng, https://djangodevreng.nl"
---

# Gemma-4-26B-A4B-it (BF16)

Ligne de contrôle pour vLLM v0.23.0. Le chat atteint 11.47 t/s/user à c=10, le multi-turn 10.69 t/s/user et la baseline bureau reste 200/200 verte. Utile comme référence, mais MTP et NVFP4 montrent combien de decode reste disponible.

## Spécifications

- Vendor: Google
- Architecture: MoE
- Paramètres: 26B-A4B
- Précision: BF16
- Contexte: 256K
- VRAM: 52 GB
- Engine: vLLM v0.23.0
- Hardware: DGX Spark, NVIDIA GB10, 128 GB unified memory
- Model card: https://huggingface.co/google/gemma-4-26B-A4B-it

## Quality (model cards)

| Benchmark | Score |
| --- | --- |
| MMLU-Pro | 82.6 |
| GPQA | 82.3 |
| HumanEval / LCB | 77.1 |
| Avg | 80.7 |

## Benchmarks sur le DGX Spark

| Test | tokens/s par utilisateur | tokens/s total | TTFT (ms) |
| --- | --- | --- | --- |
| 01 Chat | 11 | 91 | 1343 |
| 02 RAG · 8k context | 10 | 78 | 8488 |
| 03 Lange output / agents | 12 | 87 | 491 |
| 04 Grote context · 25k | 5 | 28 | 39281 |
| 05 Multi-turn · kantoorwerk | 11 | 98 | 2155 |
| 06 Realistische kantoor-baseline | 34 | 34 | 1330 |
| 07 Echte gesprekken · ShareGPT | 8 | 8 | 327 |
| 08 Maandagochtend-piek | 44 | 44 | 1178 |
| 09 Reasoning workload | n/a | n/a | n/a |

## Mon avis

**A marché: NVFP4 est le choix pratique**

Chat à 21.59 t/s/user et multi-turn à 20.01 t/s/user à c=10. Pour du chat local de bureau, cela ne ressemble pas à un compromis.

**A cassé: Le contexte 25k reste douloureux en prefill**

Même NVFP4 est à 38.58s de TTFT moyen en 25k et c=10. Le profil de serving aide le decode, pas l’attente avant les grands prompts.

**A coûté: MTP achète du decode, pas une tail parfaite**

MTP bat BF16 en decode, mais sous le pic du lundi son p95 TTFT et son p95 TPOT sont pires que BF16. Les percentiles restent nécessaires.

**A surpris: Le replay ShareGPT est très favorable**

NVFP4 termine 250/250 requêtes avec p95 TTFT 225.09 ms et p95 TPOT 45.30 ms. Les vraies conversations courtes sont bien plus légères que random 4k.

## Notes

Modèle Google BF16 sur vLLM v0.23.0, KV-cache fp8, prefix caching désactivé, gpu-memory-utilization 0.85. Nouvelle suite exécutée les 2026-06-22 et 2026-06-23.

---

Licence: CC-BY-4.0 (https://creativecommons.org/licenses/by/4.0/). Django de Vreng, https://djangodevreng.nl.
Arène complète: https://djangodevreng.nl/fr/arena/ · Runs bruts (GitHub): https://github.com/djangodevreng/dgx-spark-benchmarks