Compare
side by side.
2-4 models · all metrics · per benchmark.
← Choose other models ↳ Comparison
01 No models selected
Go back to the Arena and tick 2 to 4 models to compare. More than 4 gets too crowded on one screen.
→ To the Arena A The selected models
B Aggregate metrics best = blue · worst = dimmed
C Throughput per benchmark tokens/sec · 9 benches
D Quality breakdown MMLU-Pro · GPQA-Diamond · HumanEval
E The short version