On-prem AI.
local models, no cloud
DGX Spark, local models, and everything that runs without the cloud. What fits in 128 GB, what just doesn't, and which quantization was worth it.
- 23-06-26 7 min0017 min
Gemma-4 v23 on the DGX Spark
New vLLM v0.23.0 runs for Gemma-4 on the DGX Spark: BF16, NVFP4 and MTP compared across decode, TTFT, tails and practical local-agent limits.
- 22-05-26 5 min0025 min
The three numbers behind a fast DGX Spark
Decode, prefill and queueing: three numbers decide whether a DGX Spark feels fast under a real workload, and those three are exactly what most reviews skip.
- 03-05-26 14 min00314 min
Gemma-4 on the DGX Spark: NVFP4 vs BF16
Nine identical benchmarks, two precisions. NVFP4 runs 22 to 92 percent faster per token, and peak-hour capacity grows 69 percent on the Spark.
- 03-05-26 18 min00418 min
Nemotron-3 on the DGX Spark: BF16 vs FP8 vs NVFP4
One model, three precisions, the same Spark. What memory budget, decode speed and tail-latency do when you go from 16 bit to 8 bit to 4 bit.
- 01-05-26 27 min00527 min
Gemma-4 on the DGX Spark: the price of context
Nine benchmarks of Gemma-4-26B-A4B-it on the DGX Spark with llama-benchy and vLLM. Decode holds up; prefill and queueing decide how it feels.
- 01-05-26 8 min0068 min
I put a 24/7 assistant on a Raspberry Pi
A build-log about OpenClaw on a Raspberry Pi 5: Slack as the interface, GPT-5.5 as the model, and the Pi as an always-on agent layer next to the DGX Spark.
- 01-05-26 8 min0078 min
What quantization turned out to be
A practical look back at quantization on the DGX Spark: what BF16, FP8 and NVFP4 do to memory, speed and tail latency, after three rounds with vLLM.