The Arena.
17 models.
1 DGX Spark.

Models are ranked by throughput (tokens/sec) and weighted per use-case. No latency gates: a slow model stays on the list, just near the bottom. A 0.8B model wins its own class, not the main ring.

Methodology · 9 tests · llama-benchy + vLLM → Cost · DGX Spark per month →

Methodology

How the Score is computed

score = wq·Q + wg·G + we·E

Three dimensions, three weights, summed. All dimensions are cohort-relative: numbers 0–100, where 100 is the best model in the active bucket.

The three dimensions

Quality

MMLU-Pro · GPQA · HumanEval, averaged. Normalised to cohort peak.

→ artificial analysis ↗

Throughput

tokens/sec on the preset bench (or mean over all 9 for Aggregate). Normalised to cohort peak.

→ vllm bench serve · llama-benchy ↗

Efficiency

Throughput ÷ VRAM. Tokens per second per GB. Normalised to cohort peak.

→ derived · per cohort ↗

Weights per preset

Preset	Q	G	E	Bench
Aggregaat	70%	20%	10%	mean over 9
Chat assistant	60%	30%	10%	chat
Agent / tool-use	70%	20%	10%	long-output
Batch / RAG offline	20%	70%	10%	rag-8k
Reasoning / long-output	60%	30%	10%	reasoning

Example: Aggregate

Say Q=99, G=48, E=22 →
score = 0.70·99 + 0.20·48 + 0.10·22
= 69.3 + 9.6 + 2.2 = 81.1

Colour codes in the Score column

81.1

Top

Top 25% of the cohort range. norm > 0.75.

54.0

Middle

Middle 50%. 0.25 ≤ norm ≤ 0.75.

26.4

Bottom

Bottom 25%. norm < 0.25.

Cohort-relative: norm = (score − min) / (max − min) over the visible models. Switch bucket or preset and the numbers recolour. The leader is always blue, the tail-ender always grey.

Worth knowing

Cohort = bucket. Filter on "<8B" and "100" is the fastest <8B model, not the global max. Score numbers between buckets are therefore not directly comparable.
No measurement → tps 0. Models without benchmark data stay visible but drop to the bottom with "no measurement" in the Throughput column.
No latency gates. A slow model stays on the list, just near the bottom. No hidden SLA filter or quality-floor multiplier.

The Arena. 17 models. 1 DGX Spark.

The Arena.
17 models.
1 DGX Spark.