Reflections 7 min read Updated

Why this blog and arena exist

I looked for concrete numbers on local AI on the DGX Spark and never found them. So I measure them myself, building the blog and arena as an open workbench.

Written by Django de Vreng

For clients of Kamoo I set up AI systems that sometimes have to stay close to home. Accountants, administrative offices, firms with personal data and financial documents. Exactly the kind of data that does not make your auditor any calmer when you say: “we’ll just send it off to America”.

That is why we have a DGX Spark standing here. 128 GB unified memory, small enough for a server cabinet, big enough to run serious local models through vLLM. What practically fits on it, I collect on the overview page about local models on the DGX Spark.

Then the practical question started.

Which model do you use for what on this machine? Which precision do you pick? How much context still fits? Where does concurrency fall over? What happens on an ordinary Monday with ten people who are not all running a benchmark at the same time, but just doing their work?

I went looking for numbers on exactly those questions. Not a general leaderboard with a score that mostly looks good in a screenshot. Just: this chip, these models, these engines, these workloads, these limits.

I did not find them.

So I am building them myself.

The arena is the measuring bench

Right now there are ten benchmark profiles in the arena, with runs for things like context scaling, concurrency, output throughput, RAG-like workloads and a Monday-morning peak.

That arena has to do one thing well: show what you can practically expect on a DGX Spark. Not which model is “the best” in some abstract sense, but which model stays usable on this hardware under the workloads I run into in client work.

For a few runs I already wrote down what went wrong and what I took away from it. For instance where Gemma-4 starts to grind on the Spark, what NVFP4 wins over BF16 once the bugs are gone, and how three precisions of Nemotron-3 compare.

The raw output is public on GitHub: djangodevreng/dgx-spark-benchmarks. That is on purpose. If you have a Spark yourself, you should be able to walk the same route and get roughly the same numbers. If that does not work out, that is interesting data too.

So the arena is not a static little list. It is a workbench. New models added, other precisions next to them, workloads tightened up, odd results run again. Boring enough to actually be useful.

The blog is the context around it

Numbers are handy, but they do not tell the whole story.

A benchmark can say that NVFP4 is faster than BF16. The blog can tell you that the first runs fell apart on vLLM bugs, that a parameter was set wrong, that a model only became usable after the context length went down, or that the tail latency felt worse than the average let on.

That is the layer I missed myself when I started. Not just “here is a score”, but: this is what I tried, this broke, this is what I changed, and this is what I would do differently next time.

That is why the blog and arena sit side by side. The arena gives the measuring points. The blog gives the reasoning, the mistakes and the practical choices behind them.

Why local

Privacy is usually the polite explanation. It is also true. The more practical reason: some clients have no choice.

An accountancy firm cannot treat client data as if it were sample text in a demo. Municipalities have rules. Financial documents have rules. Personal data has rules. In practice it all comes down to the same question: can you set this up without legal, compliance and audit immediately slamming the door shut?

Then you have two options. AI does not fit there, or you make it local.

We choose local where it is needed. The Spark suddenly makes that less exotic. It is not cheap, but it is manageable for an SME office that wants to do something serious without immediately building its own data center.

That is where the interesting work is for me: running models, measuring latency, testing prompts, pulling documents through a pipeline, and watching where it breaks.

Usually it breaks somewhere boring. Those are the best spots.

What I want to be able to answer

The arena ultimately has to answer questions that keep coming back in projects.

Which model is fast enough for internal document questions? Which precision gives enough room for several users at the same time? When is NVFP4 fine, when do you want FP8, and when is BF16 mostly an expensive default? How much context can you give before latency gets annoying? Which engine fits which workload better: vLLM, TensorRT-LLM or SGLang?

These are not academic questions. They decide how you design an on-prem setup. How much hardware you need. Which data stays local. Which steps you might send off to a hosted model. And where you draw the line between “works in a demo” and “holds up on Monday morning”.

That last line is the whole reason this site exists.

Why I write this in public

Everything I use for this is open or public: vLLM, models on Hugging Face, benchmark scripts, loose JSON, the site itself. The secret is not access to some magic dashboard. It is in hours of trying, measuring, running again, hunting bugs and then measuring once more because your first run was suspiciously good.

That has cost me dozens of hours by now. Getting models running, repeating runs, figuring out odd results, and then measuring again because the first run was suspiciously good.

If someone else walks the same route, they do not have to trip over all the same paving stones again. And if someone contradicts my numbers with better runs: great. Then the arena gets better.

There is a second reason under it too. This site is itself part of the experiment. The blog, the arena, the flow from benchmark output to structured JSON to pages: that was largely built in a couple of weeks with agents that write and build along. I described the small version of that earlier in the OpenClaw setup on a Raspberry Pi.

That workflow is part of the work by now. I dump raw findings in Slack, let an agent read the repo and the writing guide, get a branch with a proposal back, run checks and review the diff myself. It does not save me any thinking. It does move a lot of preparation to a layer that just keeps working.

Writing about that process forces me to make it less messy than my terminal history. That helps. Not always fun, but necessary.

What I want to build next

First, more benchmarks. vLLM was the starting point, because it works fast and is widely used. TensorRT-LLM is already on the bench for Nemotron-3. SGLang is what I want to put next to the same workloads after that. Only with multiple engines do you see whether your model is slow, your engine is fighting you, or you just did something dumb.

After that I want to make bench-spark public: the benchmark runner the way I use it now. Not a perfect framework. But something with which someone on the same hardware can ask the same questions without first rebuilding my mistakes.

I also want to make a Dutch eval suite for local LLMs. Not another English reasoning benchmark, but office work: accountancy jargon, legal texts, financial documents, documents with odd formatting. Exactly the things local AI gets judged on in the Netherlands.

And there is more work coming around local RAG on large document sets. No platform pitch. Just figuring out how to get more than a million documents through an on-prem setup without storage, retrieval or OCR slowly starting to hate you.

What I skip

No daily AI newsletter. There are enough places for that already, some of them on purpose.

No general-purpose “we do everything with AI” story. Too broad, and usually it means nothing.

No thought-leader act. I would rather build something that creaks than an opinion that sounds smooth.

No building a platform like OpenClaw either. I use it, I write about it, I build flows with it. But that layer itself I leave to the people who live in it every day.

What this should become

For clients this has to show what local AI practically costs: hardware, latency, precision, maintenance, odd edge cases. For me it is the place where I pin down my own assumptions before the next benchmark knocks them out.

I am trying to keep the rhythm. No promise per week. If there is nothing to report, nothing goes here. If there are bugs, runs and odd graphs, there is probably too much here.

Esc