---
title: "Gemma-4-26B-A4B-it (NVFP4) · DGX Spark Arena"
canonical: "https://djangodevreng.nl/en/arena/gemma-4-26b-a4b-nvfp4-v23/"
license: "CC-BY-4.0"
source: "https://github.com/djangodevreng/dgx-spark-benchmarks"
attribution: "Django de Vreng, https://djangodevreng.nl"
---

# Gemma-4-26B-A4B-it (NVFP4)

Best practical v23 profile for this Spark. Chat reaches 21.59 t/s/user, multi-turn 20.01 t/s/user and ShareGPT replay stays at p95 TTFT 225.09 ms. 25k context is still prefill pain, but normal chat and agents feel fast locally.

## Specs

- Vendor: NVIDIA (re-quant van Google)
- Architecture: MoE
- Parameters: 26B-A4B
- Precision: NVFP4
- Context: 256K
- VRAM: 24 GB
- Engine: vLLM v0.23.0
- Hardware: DGX Spark, NVIDIA GB10, 128 GB unified memory
- Model card: https://huggingface.co/nvidia/Gemma-4-26B-A4B-NVFP4

## Quality (model cards)

| Benchmark | Score |
| --- | --- |
| MMLU-Pro | 84.8 |
| GPQA | 79.9 |
| HumanEval / LCB | 79.8 |
| Avg | 81.5 |

## Benchmarks on the DGX Spark

| Test | tokens/s per user | tokens/s total | TTFT (ms) |
| --- | --- | --- | --- |
| 01 Chat | 22 | 151 | 1138 |
| 02 RAG · 8k context | 16 | 119 | 8309 |
| 03 Lange output / agents | 24 | 121 | 369 |
| 04 Grote context · 25k | 7 | 34 | 38575 |
| 05 Multi-turn · kantoorwerk | 20 | 183 | 1966 |
| 06 Realistische kantoor-baseline | 82 | 82 | 1051 |
| 07 Echte gesprekken · ShareGPT | 13 | 13 | 146 |
| 08 Maandagochtend-piek | 73 | 73 | 967 |
| 09 Reasoning workload | n/a | n/a | n/a |

## What I made of it

**Worked: NVFP4 is the practical default**

Chat at 21.59 t/s/user and multi-turn at 20.01 t/s/user at c=10. For local office chat this does not feel like a compromise.

**Broke: 25k context is still prefill pain**

Even NVFP4 sits at 38.58s average TTFT for 25k and c=10. Serving profile helps decode, not the wait before large prompts.

**Cost: MTP buys decode, not perfect tail**

MTP beats BF16 on decode, but under Monday peak load its p95 TTFT and p95 TPOT are worse than BF16. Percentiles still matter.

**Surprised: ShareGPT replay is extremely friendly**

NVFP4 completes 250/250 requests with p95 TTFT 225.09 ms and p95 TPOT 45.30 ms. Real short conversations are much lighter than random 4k.

## Notes

NVIDIA NVFP4 re-quant on vLLM v0.23.0. KV-cache fp8, prefix caching off, gpu-memory-utilization 0.85. Full suite run on 2026-06-23.

---

License: CC-BY-4.0 (https://creativecommons.org/licenses/by/4.0/). Django de Vreng, https://djangodevreng.nl.
Full arena: https://djangodevreng.nl/en/arena/ · Raw runs (GitHub): https://github.com/djangodevreng/dgx-spark-benchmarks
