On-prem AI.

local models, no cloud

7 posts · ~1×/month

DGX Spark, local models, and everything that runs without the cloud. What fits in 128 GB, what just doesn't, and which quantization was worth it.

Category

All Build logs On-prem AI Field notes Reflections

23-06-26 7 min

001 23-06-26

Gemma-4 v23 on the DGX Spark

New vLLM v0.23.0 runs for Gemma-4 on the DGX Spark: BF16, NVFP4 and MTP compared across decode, TTFT, tails and practical local-agent limits.

7 min
22-05-26 5 min

002 22-05-26

The three numbers behind a fast DGX Spark

Decode, prefill and queueing: three numbers decide whether a DGX Spark feels fast under a real workload, and those three are exactly what most reviews skip.

5 min
03-05-26 14 min

003 03-05-26

Gemma-4 on the DGX Spark: NVFP4 vs BF16

Nine identical benchmarks, two precisions. NVFP4 runs 22 to 92 percent faster per token, and peak-hour capacity grows 69 percent on the Spark.

14 min
03-05-26 18 min

004 03-05-26

Nemotron-3 on the DGX Spark: BF16 vs FP8 vs NVFP4

One model, three precisions, the same Spark. What memory budget, decode speed and tail-latency do when you go from 16 bit to 8 bit to 4 bit.

18 min
01-05-26 27 min

005 01-05-26

Gemma-4 on the DGX Spark: the price of context

Nine benchmarks of Gemma-4-26B-A4B-it on the DGX Spark with llama-benchy and vLLM. Decode holds up; prefill and queueing decide how it feels.

27 min
01-05-26 8 min

006 01-05-26

I put a 24/7 assistant on a Raspberry Pi

A build-log about OpenClaw on a Raspberry Pi 5: Slack as the interface, GPT-5.5 as the model, and the Pi as an always-on agent layer next to the DGX Spark.

8 min
01-05-26 8 min

007 01-05-26

What quantization turned out to be

A practical look back at quantization on the DGX Spark: what BF16, FP8 and NVFP4 do to memory, speed and tail latency, after three rounds with vLLM.

8 min