---
title: "Gemma-4-26B-A4B-it (BF16) · DGX Spark Arena"
canonical: "https://djangodevreng.nl/en/arena/gemma-4-26b-a4b-it-bf16-v23/"
license: "CC-BY-4.0"
source: "https://github.com/djangodevreng/dgx-spark-benchmarks"
attribution: "Django de Vreng, https://djangodevreng.nl"
---

# Gemma-4-26B-A4B-it (BF16)

Control line for vLLM v0.23.0. Chat reaches 11.47 t/s/user at c=10, multi-turn 10.69 t/s/user and the office baseline stays 200/200 green. Useful as reference, but MTP and NVFP4 show how much decode is left on the table.

## Specs

- Vendor: Google
- Architecture: MoE
- Parameters: 26B-A4B
- Precision: BF16
- Context: 256K
- VRAM: 52 GB
- Engine: vLLM v0.23.0
- Hardware: DGX Spark, NVIDIA GB10, 128 GB unified memory
- Model card: https://huggingface.co/google/gemma-4-26B-A4B-it

## Quality (model cards)

| Benchmark | Score |
| --- | --- |
| MMLU-Pro | 82.6 |
| GPQA | 82.3 |
| HumanEval / LCB | 77.1 |
| Avg | 80.7 |

## Benchmarks on the DGX Spark

| Test | tokens/s per user | tokens/s total | TTFT (ms) |
| --- | --- | --- | --- |
| 01 Chat | 11 | 91 | 1343 |
| 02 RAG · 8k context | 10 | 78 | 8488 |
| 03 Lange output / agents | 12 | 87 | 491 |
| 04 Grote context · 25k | 5 | 28 | 39281 |
| 05 Multi-turn · kantoorwerk | 11 | 98 | 2155 |
| 06 Realistische kantoor-baseline | 34 | 34 | 1330 |
| 07 Echte gesprekken · ShareGPT | 8 | 8 | 327 |
| 08 Maandagochtend-piek | 44 | 44 | 1178 |
| 09 Reasoning workload | n/a | n/a | n/a |

## What I made of it

**Worked: NVFP4 is the practical default**

Chat at 21.59 t/s/user and multi-turn at 20.01 t/s/user at c=10. For local office chat this does not feel like a compromise.

**Broke: 25k context is still prefill pain**

Even NVFP4 sits at 38.58s average TTFT for 25k and c=10. Serving profile helps decode, not the wait before large prompts.

**Cost: MTP buys decode, not perfect tail**

MTP beats BF16 on decode, but under Monday peak load its p95 TTFT and p95 TPOT are worse than BF16. Percentiles still matter.

**Surprised: ShareGPT replay is extremely friendly**

NVFP4 completes 250/250 requests with p95 TTFT 225.09 ms and p95 TPOT 45.30 ms. Real short conversations are much lighter than random 4k.

## Notes

Google BF16 model on vLLM v0.23.0, KV-cache fp8, prefix caching off, gpu-memory-utilization 0.85. New suite run on 2026-06-22 and 2026-06-23.

---

License: CC-BY-4.0 (https://creativecommons.org/licenses/by/4.0/). Django de Vreng, https://djangodevreng.nl.
Full arena: https://djangodevreng.nl/en/arena/ · Raw runs (GitHub): https://github.com/djangodevreng/dgx-spark-benchmarks