Pillar guide · on-prem AI

Running LLMs on the DGX Spark

Yes, you can seriously run local LLMs on it. A model that fits in NVFP4 or FP8 runs with enough throughput for a team or a production agent. The trick is in the precision choice and your engine config, not in brute force.

See the benchmarks What it costs

What the DGX Spark is

The DGX Spark is NVIDIA's small desktop AI machine. One GB10 superchip, memory that CPU and GPU share, and enough capacity to run models locally that you'd otherwise send to the cloud. No rack, no data center, just next to your desk.

The interesting detail is in the FP4 compute. NVIDIA advertises the Spark with a petaFLOP of FP4, but on sm_121 it lacks the instruction path that NVFP4 needs. vLLM warns about this itself and falls back to Marlin, which dequantizes the 4-bit weights to BF16. So you store 4-bit, and that memory win is real, but you compute at a higher level. For pure compute that's a downside, for memory and bandwidth it isn't, and on this hardware that's where the win sits.

Chip: GB10 superchip
Memory: 128 GB unified
Compute: SM12.1, no native FP4
Price range: ~€3.700 ex btw

What fits on it

The rule of thumb: model size in billions of parameters, times the bytes per parameter, is what you spend in memory. BF16 costs 2 bytes per parameter, FP8 one, NVFP4 half. So a 30B model in BF16 is about 60 GB of weights, in NVFP4 just 15. On top of that comes the KV cache, and that grows with your context length.

With 128 GB of unified memory you have plenty of room for most open-weight models, as long as you don't cram everything into BF16. The precision choice is therefore not a detail, it decides whether a model fits and how much context you get with it.

BF16: 2 bytes / param
FP8: 1 byte / param
NVFP4: 0,5 byte / param

How fast it is

Two numbers count. Prefill is how fast the model reads in your prompt, decode is how fast it spits out tokens after that. With short prompts you mostly notice decode, with long context prefill becomes the bottleneck. On the Spark decode scales nicely, prefill hits a wall once your context grows a lot.

Under pressure the machine behaves surprisingly grown-up. Send it more requests than it can handle and it queues them neatly. It doesn't crash, it gets slow. That's exactly what you want for a system a team leans on come Monday.

The exact numbers per model and context length are in the benchmark suite, run on a single Spark with a fixed measurement protocol. The reasoning behind these three numbers is in a separate essay.

Decode @ small: 20,9 t/s/user
Decode @ 25k: 7,6 t/s/user
Prefill wall: ~25k tokens
Stable streams: 25 parallel

Indication on Gemma-4-26B-A4B in NVFP4. This differs per model and precision, the full numbers are in the arena.

→ To the full benchmark suite

Which engine

vLLM. That's what it comes down to in short. It's the engine with the best support for NVFP4 on the Spark, decent chunked prefill, and a serve mode that behaves like a real inference server instead of a demo script.

You do need the right flags. Chunked prefill on, your max-num-batched-tokens set right, and memory utilization tuned to your context budget. Get that wrong and you get an unstable server or throughput that makes no sense. The working config is in the build log about Gemma-4.

→ vLLM flags that work for us (build log)

What it costs

The purchase is one-off, the power keeps running. A Spark draws about 170 watts under load. Don't count on 24/7 at full tilt, it almost never hits that in practice. At about 8 hours of real load per day you land on ≈€130/jaar of power. The marginal cost of one more token is then almost nothing, just power. But don't kid yourself: the real cost per token depends on how busy you keep the machine, and is far from always lower than a hosted API.

The break-even point depends on your volume. Run a prompt now and then and a cloud API is cheaper. Run day in day out with a team or a production workload and the hardware pays for itself. The full sum is on the cost page.

Purchase: ~€3.700 ex btw
Power under load: ≈170 W
Power per year: ≈€130/jaar
Per extra token: ≈€0,00

Power based on ~8 hours of load per day at €0.26/kWh. A Spark is rarely under full load 24/7, so running continuously overstates the bill.

→ The full cost comparison: local vs cloud

Who it's for

Not for everyone. If your data can happily go to a hosted model, an API is simpler and often cheaper. The Spark gets interesting once that can't or may not happen.

Think of SMBs and organizations that work with personal data, internal documents or customer data that has to stay close under GDPR. Then the question isn't "what's the fastest model", but "which part is even allowed to leave". A Spark under your own control answers that a lot more easily than a contract with a cloud provider.

It's also just nice to have your inference in-house. No rate limits, no price change per quarter, no model that changes under your feet without warning.

Reproduce it yourself

All the numbers on this site come from one Spark, with a fixed measurement protocol: the same prompts, the same seeds, three runs per measurement. No cherry-picking, no marketing benchmark. The config, the prompts and the raw output are open on GitHub.

Run it on a different Spark or a different vLLM version and get other numbers? Let me know. That's exactly the kind of feedback this whole suite gets better from.

→ The methodology in detail → The repo on GitHub

Running LLMs on the DGX Spark

What the DGX Spark is

What fits on it

How fast it is

Which engine

What it costs

Who it's for

Reproduce it yourself

Read next

Gemma-4 on the DGX Spark: where context hurts

Gemma-4: NVFP4 vs BF16

Nemotron-3: BF16 vs FP8 vs NVFP4

What quantization became after three benchmark rounds

The three numbers behind a fast DGX Spark