Running LLMs on the DGX Spark
Yes, you can seriously run local LLMs on it. A model that fits in NVFP4 or FP8 runs with enough throughput for a team or a production agent. The trick is in the precision choice and your engine config, not in brute force.
What the DGX Spark is
The DGX Spark is NVIDIA's small desktop AI machine. One GB10 superchip, memory that CPU and GPU share, and enough capacity to run models locally that you'd otherwise send to the cloud. No rack, no data center, just next to your desk.
The interesting detail is in the FP4 compute. NVIDIA advertises the Spark with a petaFLOP of FP4, but on sm_121 it lacks the instruction path that NVFP4 needs. vLLM warns about this itself and falls back to Marlin, which dequantizes the 4-bit weights to BF16. So you store 4-bit, and that memory win is real, but you compute at a higher level. For pure compute that's a downside, for memory and bandwidth it isn't, and on this hardware that's where the win sits.
- Chip
- GB10 superchip
- Memory
- 128 GB unified
- Compute
- SM12.1, no native FP4
- Price range
- ~€3.700 ex btw
What fits on it
The rule of thumb: model size in billions of parameters, times the bytes per parameter, is what you spend in memory. BF16 costs 2 bytes per parameter, FP8 one, NVFP4 half. So a 30B model in BF16 is about 60 GB of weights, in NVFP4 just 15. On top of that comes the KV cache, and that grows with your context length.
With 128 GB of unified memory you have plenty of room for most open-weight models, as long as you don't cram everything into BF16. The precision choice is therefore not a detail, it decides whether a model fits and how much context you get with it.
- BF16
- 2 bytes / param
- FP8
- 1 byte / param
- NVFP4
- 0,5 byte / param
Best quality, eats memory. For codegen where it has to be right.
The middle ground. Halves memory, quality barely noticeably lower.
Maximum room and throughput. Fine for RAG and agents, watch out for codegen.
How fast it is
Two numbers count. Prefill is how fast the model reads in your prompt, decode is how fast it spits out tokens after that. With short prompts you mostly notice decode, with long context prefill becomes the bottleneck. On the Spark decode scales nicely, prefill hits a wall once your context grows a lot.
Under pressure the machine behaves surprisingly grown-up. Send it more requests than it can handle and it queues them neatly. It doesn't crash, it gets slow. That's exactly what you want for a system a team leans on come Monday.
The exact numbers per model and context length are in the benchmark suite, run on a single Spark with a fixed measurement protocol. The reasoning behind these three numbers is in a separate essay.
- Decode @ small
- 20,9 t/s/user
- Decode @ 25k
- 7,6 t/s/user
- Prefill wall
- ~25k tokens
- Stable streams
- 25 parallel
Indication on Gemma-4-26B-A4B in NVFP4. This differs per model and precision, the full numbers are in the arena.
→ To the full benchmark suiteWhich engine
vLLM. That's what it comes down to in short. It's the engine with the best support for NVFP4 on the Spark, decent chunked prefill, and a serve mode that behaves like a real inference server instead of a demo script.
You do need the right flags. Chunked prefill on, your max-num-batched-tokens set right, and memory utilization tuned to your context budget. Get that wrong and you get an unstable server or throughput that makes no sense. The working config is in the build log about Gemma-4.
What it costs
The purchase is one-off, the power keeps running. A Spark draws about 170 watts under load. Don't count on 24/7 at full tilt, it almost never hits that in practice. At about 8 hours of real load per day you land on ≈€130/jaar of power. The marginal cost of one more token is then almost nothing, just power. But don't kid yourself: the real cost per token depends on how busy you keep the machine, and is far from always lower than a hosted API.
The break-even point depends on your volume. Run a prompt now and then and a cloud API is cheaper. Run day in day out with a team or a production workload and the hardware pays for itself. The full sum is on the cost page.
- Purchase
- ~€3.700 ex btw
- Power under load
- ≈170 W
- Power per year
- ≈€130/jaar
- Per extra token
- ≈€0,00
Power based on ~8 hours of load per day at €0.26/kWh. A Spark is rarely under full load 24/7, so running continuously overstates the bill.
→ The full cost comparison: local vs cloudWho it's for
Not for everyone. If your data can happily go to a hosted model, an API is simpler and often cheaper. The Spark gets interesting once that can't or may not happen.
Think of SMBs and organizations that work with personal data, internal documents or customer data that has to stay close under GDPR. Then the question isn't "what's the fastest model", but "which part is even allowed to leave". A Spark under your own control answers that a lot more easily than a contract with a cloud provider.
It's also just nice to have your inference in-house. No rate limits, no price change per quarter, no model that changes under your feet without warning.
Reproduce it yourself
All the numbers on this site come from one Spark, with a fixed measurement protocol: the same prompts, the same seeds, three runs per measurement. No cherry-picking, no marketing benchmark. The config, the prompts and the raw output are open on GitHub.
Run it on a different Spark or a different vLLM version and get other numbers? Let me know. That's exactly the kind of feedback this whole suite gets better from.
Read next
Gemma-4 on the DGX Spark: where context hurts
Nine benchmarks, the prefill wall in view, and the vLLM config that finally worked.
Read on → BenchmarkGemma-4: NVFP4 vs BF16
The same nine tests, two precisions. Where NVFP4 nearly doubles throughput.
Read on → BenchmarkNemotron-3: BF16 vs FP8 vs NVFP4
Three precisions side by side on the same model and the same Spark.
Read on → QuantizationWhat quantization became after three benchmark rounds
The concept under the numbers: which task may run on which precision.
Read on → LensThe three numbers behind a fast DGX Spark
Decode, prefill and queueing: the one perspective that explained every benchmark run.
Read on →