---
title: "Gemma-4-26B-A4B-it MTP (BF16 + MTP) · DGX Spark Arena"
canonical: "https://djangodevreng.nl/en/arena/gemma-4-26b-a4b-it-mtp-v23/"
license: "CC-BY-4.0"
source: "https://github.com/djangodevreng/dgx-spark-benchmarks"
attribution: "Django de Vreng, https://djangodevreng.nl"
---

# Gemma-4-26B-A4B-it MTP (BF16 + MTP)

Interesting middle position on vLLM v0.23.0. MTP moves chat to 17.79 t/s/user and multi-turn to 16.57 t/s/user without switching to the NVIDIA NVFP4 re-quant. Not the best tails, but much more decode than BF16.

## Specs

- Vendor: Google
- Architecture: MoE
- Parameters: 26B-A4B
- Precision: BF16 + MTP
- Context: 256K
- VRAM: 52 GB
- Engine: vLLM v0.23.0
- Hardware: DGX Spark, NVIDIA GB10, 128 GB unified memory
- Model card: https://huggingface.co/google/gemma-4-26B-A4B-it

## Quality (model cards)

| Benchmark | Score |
| --- | --- |
| MMLU-Pro | 82.6 |
| GPQA | 82.3 |
| HumanEval / LCB | 77.1 |
| Avg | 80.7 |

## Benchmarks on the DGX Spark

| Test | tokens/s per user | tokens/s total | TTFT (ms) |
| --- | --- | --- | --- |
| 01 Chat | 18 | 139 | 1400 |
| 02 RAG · 8k context | 13 | 97 | 9519 |
| 03 Lange output / agents | 18 | 128 | 564 |
| 04 Grote context · 25k | 6 | 28 | 45640 |
| 05 Multi-turn · kantoorwerk | 17 | 143 | 2368 |
| 06 Realistische kantoor-baseline | 48 | 48 | 1608 |
| 07 Echte gesprekken · ShareGPT | 11 | 11 | 409 |
| 08 Maandagochtend-piek | 53 | 53 | 1684 |
| 09 Reasoning workload | n/a | n/a | n/a |

## What I made of it

**Worked: NVFP4 is the practical default**

Chat at 21.59 t/s/user and multi-turn at 20.01 t/s/user at c=10. For local office chat this does not feel like a compromise.

**Broke: 25k context is still prefill pain**

Even NVFP4 sits at 38.58s average TTFT for 25k and c=10. Serving profile helps decode, not the wait before large prompts.

**Cost: MTP buys decode, not perfect tail**

MTP beats BF16 on decode, but under Monday peak load its p95 TTFT and p95 TPOT are worse than BF16. Percentiles still matter.

**Surprised: ShareGPT replay is extremely friendly**

NVFP4 completes 250/250 requests with p95 TTFT 225.09 ms and p95 TPOT 45.30 ms. Real short conversations are much lighter than random 4k.

## Notes

Google model artifact with MTP profile on vLLM v0.23.0. KV-cache fp8, prefix caching off, gpu-memory-utilization 0.85. Full suite run on 2026-06-23.

---

License: CC-BY-4.0 (https://creativecommons.org/licenses/by/4.0/). Django de Vreng, https://djangodevreng.nl.
Full arena: https://djangodevreng.nl/en/arena/ · Raw runs (GitHub): https://github.com/djangodevreng/dgx-spark-benchmarks
