The Arena.
17 models.
1 DGX Spark.

Models are ranked by throughput (tokens/sec) and weighted per use-case. No latency gates: a slow model stays on the list, just near the bottom. A 0.8B model wins its own class, not the main ring.

Models on the Pareto frontier (connected line) are dominant, nothing is both faster and smarter at once. Everything below is dominated, there's another model beating it on both axes. Hover for details.

On the Pareto frontier Dominated VRAM (small → large)
70 %
Quality
MMLU · GPQA · HumanEval
20 %
Throughput
tokens/sec, % cohort-max
10 %
Efficiency
throughput ÷ VRAM, % cohort-max

score = 70%·Quality + 20%·Throughput + 10%·Efficiency

Methodology

How the Score is computed

score = wq·Q + wg·G + we·E

Three dimensions, three weights, summed. All dimensions are cohort-relative: numbers 0–100, where 100 is the best model in the active bucket.

The three dimensions

Q
Quality
MMLU-Pro · GPQA · HumanEval, averaged. Normalised to cohort peak.
→ artificial analysis ↗
G
Throughput
tokens/sec on the preset bench (or mean over all 9 for Aggregate). Normalised to cohort peak.
→ vllm bench serve · llama-benchy ↗
E
Efficiency
Throughput ÷ VRAM. Tokens per second per GB. Normalised to cohort peak.
→ derived · per cohort ↗

Weights per preset

Preset Q G E Bench
Aggregaat 70% 20% 10% mean over 9
Chat assistant 60% 30% 10% chat
Agent / tool-use 70% 20% 10% long-output
Batch / RAG offline 20% 70% 10% rag-8k
Reasoning / long-output 60% 30% 10% reasoning

Example: Aggregate

Say Q=99, G=48, E=22 →
score = 0.70·99 + 0.20·48 + 0.10·22
= 69.3 + 9.6 + 2.2 = 81.1

Colour codes in the Score column

81.1
Top
Top 25% of the cohort range. norm > 0.75.
54.0
Middle
Middle 50%. 0.25 ≤ norm ≤ 0.75.
26.4
Bottom
Bottom 25%. norm < 0.25.

Cohort-relative: norm = (score − min) / (max − min) over the visible models. Switch bucket or preset and the numbers recolour. The leader is always blue, the tail-ender always grey.

Worth knowing

  • Cohort = bucket. Filter on "<8B" and "100" is the fastest <8B model, not the global max. Score numbers between buckets are therefore not directly comparable.
  • No measurement → tps 0. Models without benchmark data stay visible but drop to the bottom with "no measurement" in the Throughput column.
  • No latency gates. A slow model stays on the list, just near the bottom. No hidden SLA filter or quality-floor multiplier.
Explanation

Esc