---
title: "Gemma-4-26B-A4B-it (NVFP4) · DGX Spark Arena"
canonical: "https://djangodevreng.nl/fr/arena/gemma-4-26b-a4b-nvfp4/"
license: "CC-BY-4.0"
source: "https://github.com/djangodevreng/dgx-spark-benchmarks"
attribution: "Django de Vreng, https://djangodevreng.nl"
---

# Gemma-4-26B-A4B-it (NVFP4)

NVFP4 op dezelfde Gemma-4-26B-A4B is op deze hardware bijna een gratis lunch. Chat-decode 20.9 t/s/user (BF16: 10.9), agents 22.5 t/s/user, en zelfs 25k context blijft op 7.6 t/s/user. Onder maandagochtend-piek total throughput bijna 2 GB/s aan tokens en P99 TTFT onder 7 sec. Dit is wat de Spark wil draaien.

## Spécifications

- Vendor: NVIDIA (re-quant van Google)
- Architecture: MoE
- Paramètres: 26B-A4B
- Précision: NVFP4
- VRAM: 24 GB
- Engine: vLLM v0.20.1
- Hardware: DGX Spark, NVIDIA GB10, 128 GB unified memory
- Model card: https://huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4

## Benchmarks sur le DGX Spark

| Test | tokens/s par utilisateur | tokens/s total | TTFT (ms) |
| --- | --- | --- | --- |
| 01 Chat | 21 | n/a | 1100 |
| 02 RAG · 8k context | 16 | n/a | 8000 |
| 03 Lange output / agents | 23 | n/a | 370 |
| 04 Grote context · 25k | 8 | n/a | 35650 |
| 05 Multi-turn · kantoorwerk | 20 | n/a | 1940 |
| 06 Realistische kantoor-baseline | 81 | n/a | 1006 |
| 07 Echte gesprekken · ShareGPT | 13 | n/a | 152 |
| 08 Maandagochtend-piek | 73 | n/a | 920 |
| 09 Reasoning workload | 9 | n/a | 356 |

## Mon avis

**A marché: NVFP4 verdubbelt decode bijna gratis**

Chat 20.9 vs 10.9 t/s/user, agents 22.5 vs 11.8, multi-turn 19.5 vs 10.4. Bijna 2× decode-doorvoer voor dezelfde MoE, op dezelfde Spark, voor minder dan 0.5% quality-drift volgens NVIDIA's eigen evals.

**A cassé: 25k context blijft duur**

TTFT mean 35.6 sec bij c=10, vrijwel gelijk aan BF16. Quantisatie helpt decode, maar de prefill-muur op 25k is hardware-limited, niet precisie-limited.

**A coûté: Single-stream decode hoger dan c=10**

Bij c=1 op 4k context: 29.8 t/s/user. Bij c=10: 16.9 t/s/user. Niet onverwacht, maar het verschil is groter dan bij BF16 (24.1 naar 9.5). Quantisatie is gevoeliger voor scheduling-overhead.

**A surpris: Total throughput tikt 1984 t/s op piek**

Onder J (max-concurrency 25, burstiness 1.0) komt total token throughput op 1984 t/s, anderhalf keer de BF16-versie. NVFP4 op een MoE met 4B actief is op deze hardware echt vleugels.

## Notes

NVFP4-quant door NVIDIA. 24 GB weights, KV-cache fp8, prefix caching uit. vLLM v0.20.1 met async scheduling. Complete suite (A-J) gedraaid 2026-05-06.

---

Licence: CC-BY-4.0 (https://creativecommons.org/licenses/by/4.0/). Django de Vreng, https://djangodevreng.nl.
Arène complète: https://djangodevreng.nl/fr/arena/ · Runs bruts (GitHub): https://github.com/djangodevreng/dgx-spark-benchmarks