On-prem AI 9 min read Updated

What quantization turned out to be

A practical look back at quantization on the DGX Spark: what BF16, FP8 and NVFP4 do to memory, speed and tail latency, after three rounds with vLLM.

Written by Django de Vreng

This was the first blog post I put live on this site. When I wrote it, I had just gotten two models running on the DGX Spark: Gemma-4-26B-A4B-it, a MoE model, and a 31B dense model. Both local, both through vLLM.

At that point, quantization was still mostly a question for me. I knew the term, I roughly understood what it was about, but I had too few measurements of my own to say anything firm about it.

By now we’re a few benchmark rounds further. First Gemma-4 on the DGX Spark. Then NVFP4 vs BF16 on that same model. And after that Nemotron-3 in BF16, FP8 and NVFP4. Together they make up the guide running LLMs on the DGX Spark.

That changed this post, really. It’s less about “what is quantization?” and more about what happens when quantization stops being a model-card term and becomes an architecture choice.

The first question was simply: does it fit?

With hosted models you often start at quality. Which one is smarter, which follows instructions better, which writes better code?

Locally you start blunter: does it fit?

That sounds almost too simple, but on your own hardware that’s the first wall. A model name and a model card are paperwork; the weights have to actually fit in memory. After that you still want room for context, you want to handle several requests at once, and ideally see something come back within seconds.

On the DGX Spark you feel that immediately. You watch vLLM at work: downloading, loading, reserving memory, warming up. Only then does the discussion about throughput, latency and usability begin.

That’s a different feeling than an API call to Claude or GPT-5.5. There the infrastructure mostly exists as an abstraction. You send text in and get text back. Running locally, you see the back end. Sometimes that’s fun. Sometimes it mostly takes a while.

That’s exactly where quantization comes in.

My first picture was too narrow

My first working definition was tidy enough: quantization stores model weights more compactly. FP16 or BF16 uses more space than 8-bit or 4-bit. Fewer bits means less memory. Less memory means a model fits sooner, loads faster, or leaves room for more context and more requests.

That’s correct, but it’s too small.

After the benchmarks I look at it differently. The question “does this model fit on this machine?” is only the start. After that comes the question of what you can do with that machine once the model fits.

Running one request is the demo. Running multiple requests is the workflow.

That’s where the difference sits for me. A local model that answers one prompt neatly is nice. A local model that can handle several users, agents or tasks at once without latency collapsing becomes useful.

So quantization decides how much room to move you have left.

vLLM makes it concrete

I use vLLM because one request at a time isn’t the situation I’m heading for. Starting a local chatbot is fine for testing, but the moment you talk about agents you get different traffic.

An agent fetches context, calls tools, splits up work, sometimes asks for things in parallel and waits for results in between. Meanwhile you want a second request not to have to wait until the first is completely done.

That’s where serving matters.

vLLM is the layer that makes this concrete: batching, scheduling, using memory more efficiently and handling multiple concurrent requests. It also makes visible that running locally is a system. The model, the precision, the context length, the number of simultaneous requests and the scheduler all pull on the same hardware.

That was the first real lesson for me. Quantization isn’t a separate trick at the bottom of the stack. It influences how the whole stack behaves.

BF16 felt like the safe choice at first

If you haven’t measured yet, higher precision quickly feels safer. BF16 sounds solid. More detail, less quality risk, less chance the model starts behaving oddly.

That was my first reflex too. If the hardware can handle it, why sit lower?

The measurements made that less obvious. On the DGX Spark, BF16 often turned out to be the least practical choice in the later runs. BF16 isn’t “bad”; it’s just that the hardware and workload weigh more heavily than the tidy feeling of higher precision.

If a lower precision gives much more room for concurrency, context or throughput, then in practice that can be better. Certainly for workloads where speed and concurrency count for more than the last bit of model quality.

That’s the twist I found interesting. The highest precision intuitively feels like the serious choice. On this machine it was often mostly the most expensive one.

NVFP4 changed the Spark

The biggest shift came with NVFP4. In the benchmark posts and the arena you can see that NVFP4 nearly doubles the DGX Spark for many workloads. That’s not a small optimization anymore. It changes what you dare to try on the same machine.

For on-prem AI that’s exactly the point. You buy hardware for a workflow, not for one pretty prompt. You want to know how much real work you can fit on that box.

If NVFP4 means you can run more requests at once, keep more headroom and bump into memory limits less quickly, then that’s not a detail in a table. Then your architecture changes.

You can divide tasks differently. You can keep more local. You can experiment faster with agent steps that would otherwise go straight to a hosted model.

That made quantization more practical for me than I’d expected beforehand. It stopped being about a smaller model and became about enabling a different workflow.

FP8 had a different kind of upside

FP8 didn’t simply sit “between BF16 and NVFP4”. In the Nemotron-3 runs, tail latency was the interesting part. That draws less attention than a big throughput jump, but in use it matters at least as much.

Averages don’t necessarily lie, but they reassure you at the wrong moments. A workflow feels slow because of the few requests that keep hanging.

That’s why tail latency is so practical. If an agent workflow has multiple steps, delays stack. One slow step is annoying. Three slow steps in a row feel like the system is reconsidering its life choices.

FP8 looks useful in that corner: less extreme than NVFP4, but interesting when predictability matters more than running as much as possible at once.

That’s the nuance I didn’t have yet in the first version. Precision isn’t a ladder where lower is always faster and worse. It’s a set of choices with different trade-offs.

Quality stays the open question

The benchmarks answer memory, throughput and latency. They say less about behaviour.

That stays the hard side of quantization. You don’t always see quality loss neatly in one metric. Sometimes an answer gets flatter. Sometimes code goes wrong a bit more often. Sometimes an agent picks the wrong tool. Sometimes you notice nothing, until your task is just different from your test set.

For simple tasks that can be perfectly fine. Think classification, routing, first summaries, embeddings or a light pass over internal documents. The heaviest model doesn’t always have to be on that.

For code generation and agent workflows it’s more sensitive. Small errors stack. One mediocre piece of reasoning is annoying. A wrong tool call is a different kind of problem.

That’s why I don’t want to benchmark quantized models on speed alone. I want to know where I dare to deploy them.

That’s a different question. And honestly, the only one that counts.

The split gets clearer

My expectation is still that the best on-prem setup becomes a mix. “Everything local” sounds tough, but usually it’s also needlessly strict.

The logical split looks more like:

  • embeddings local
  • sensitive documents local
  • routing and classification local
  • simple agent steps local
  • heavy reasoning to Claude or GPT-5.5 when needed

Quantization decides how big that local part can get. The more tasks run reliably and fast enough locally, the less you have to send out.

That matters for client work. Not because every token has to stay within four walls, but because some data does belong there. And because latency, cost and control simply count in production.

An on-prem setup isn’t a belief. It’s a division of work.

What I’d measure differently now

In the first version of this post I mostly had a list of questions. How long does downloading take? How long does loading take? How much VRAM is left? How many concurrent requests can I send before latency gets annoying?

Those questions stay useful, but they’re the start. How I set up those measurements on the Spark exactly is in the arena methodology.

Now I’d put three things side by side per precision:

  • system behaviour: loading, memory, throughput, latency and tail latency
  • model behaviour: Dutch output, code questions, longer context, tool use
  • workflow fit: which tasks do I dare run locally with this

That last one is easy to miss if you only look at benchmark tables. A model can technically run and still be awkward. Or score less prettily, but be exactly good enough for routing or summarizing.

For production that makes the difference. Nobody buys “tokens per second” alone. You buy room in a workflow.

What I understand now

My working definition has shifted.

Quantization makes a model smaller, but that’s only the entrance. It changes how much work you get out of the same hardware, which latency you accept and which tasks you dare to keep local.

On the DGX Spark, the highest precision rarely seems to be automatically the best choice. NVFP4 makes the machine much more usable for many workloads. FP8 is interesting when tail latency starts to matter. BF16 stays useful as a reference point, but on this hardware it less often feels like the practical default.

That’s exactly why I wanted to do these measurements. A universal ranking helps little; better architecture choices do.

The question isn’t: which quantization level wins?

The question is: which task is allowed on which precision, on which machine, with how much risk?

That’s where on-prem AI starts getting interesting for me: at the division of work.

Esc