Quantization Explained — Local AI Models

Local AI is mostly quantization in a trench coat. Every VRAM number on this site assumes Q4_K_M — the community default. This page is what that means and why it works.

Think of it like JPEG compression. A photo at JPEG 100% is huge but perfect. At 80% it's half the size and you can't tell the difference. At 40% you start seeing artifacts. At 10% it's a blurry mess — but loads instantly anywhere.

Quantization works the same way for AI models. The original 16-bit weights are the "full resolution." Compressing to 4-bit cuts VRAM by 75% and is barely distinguishable from full quality. Go below 3-bit and things start breaking.

The Compression Table

Llama 3.3 70B at every level

Same model, seven compression formats. Disk size, VRAM need, and quality preserved at each level. The starred row is what you get by default.

Level	Bits	Disk	VRAM	Quality	Like JPEG…
FP16	16-bit	~140 GB	~142 GB	Perfect — baseline	100% — lossless
Q8_0	8-bit	~70 GB	~74 GB	Essentially identical	95% — can't tell
Q6_K	6.6-bit	~58 GB	~62 GB	Extremely close	90% — pixel peeping only
Q4_K_M	4.8-bit	~42 GB	~46 GB	99.5% preserved	80% — the sweet spot
Q3_K_M	3.9-bit	~34 GB	~38 GB	Subtle degradation	60% — trained eyes notice
Q2_K	2.6-bit	~24 GB	~28 GB	Noticeable loss	30% — artifacts visible
IQ2_XS	2.3-bit	~21 GB	~25 GB	Significant loss	15% — runs, barely

Sensitivity

What breaks first

Not all tasks degrade equally. Most-sensitive at top, most-resilient at bottom. Below Q4 is where this matters.

Math & Logic

● Critical

Arithmetic errors, proof steps skipped, wrong answers to multi-step problems

Structured Code

● High

Syntax errors, wrong function signatures, broken JSON / API outputs

Reasoning Chains

● Medium

Loses track of multi-step logic, contradicts itself, shallow analysis

Instruction Following

● Medium

Ignores constraints, drifts from format, misses parts of complex prompts

Creative Writing

● Low

Slightly less nuanced vocabulary, more repetitive patterns

Casual Chat

● Minimal

Basically unaffected — conversational quality holds well even at Q3

Real Scenarios

Three actual tradeoffs

Theory is fine. Here's what these decisions look like with real models on real hardware.

Scenario 1 · RTX 4090 (24 GB)

Quality vs. Power: 32B@Q8 or 70B@Q4?

Two strong options at 24 GB:

Qwen3 32B @ Q4_K_M — fits easily (~22 GB). Best quality-per-GB at this tier. Excellent for coding, reasoning, long-context. (Q8 would be ~37 GB and overflow a single 24 GB card.)
Llama 3.3 70B @ Q4_K_M — tight (~46 GB needs offloading, or use Q3 at ~38 GB). More raw capability, but quantization costs you math precision and complex reasoning.

Rule of thumb: A smaller model at higher quant often beats a larger model at lower quant — especially for coding and structured output. The 32B@Q8 will generate cleaner code than the 70B@Q3.

Scenario 2 · 16 GB GPU

Coding assistant: 14B@Q8 or 30B MoE@Q4?

Two paths to a capable coding assistant:

Phi-4 14B @ Q8 — ~16 GB, fills the card. Full-quality reasoning and coding. 128K context. Rock solid for structured tasks.
GLM-4.7-Flash 30B MoE @ Q4 — ~20 GB total but only 3B active. 59.2% SWE-bench. Needs some CPU offloading but MoE speed advantage means it's still fast.

Verdict: GLM-4.7 wins for coding specifically — MoE keeps speed high even with offloading, and SWE-bench score crushes Phi-4. For general-purpose reasoning, Phi-4@Q8 is the safer choice.

Scenario 3 · Stretching the limits

Running Llama 4 Scout (109B) on 24 GB

The extreme case — a 109B model on consumer hardware:

IQ2_XS (1.78-bit) quant brings it down to ~24 GB VRAM
The 10M context window still works but KV cache eats VRAM fast
MoE helps — only 17B params activate per token, so speed stays usable
Quality: general chat and simple tasks fine. Complex coding and math noticeably degrade.

Verdict: Impressive that it runs at all. Fine for casual use and testing. For production agent work, you'd want this at Q4+ on multi-GPU — or pick a smaller model at higher quant.

Technical Notes — Formats, MoE, Context Scaling

GGUF vs GPTQ vs AWQ vs EXL2 GGUF = llama.cpp/Ollama format. Best for CPU+GPU split. Most models. GPTQ = GPU-only, fast batch. AWQ = activation-aware, better quality per bit. EXL2 = variable bits-per-layer, optimal quality. For Ollama: GGUF. For vLLM: AWQ/GPTQ.

MoE Memory Trap MoE models load ALL parameters into VRAM but only activate a subset per token. A 236B MoE with 22B active still needs ~120 GB at Q4 because every expert weight must be resident. Speed is set by active params. VRAM is set by total params.

Context Length Eats VRAM KV cache grows with context. At 8K: +1–2 GB. At 32K: +4–8 GB. At 128K: +20–40 GB. Models with GQA or MLA (DeepSeek, Llama 3) are much more efficient. Nemotron's 1M context uses linear attention to keep it manageable.

CPU Offloading If a model doesn't fit entirely in VRAM, llama.cpp/Ollama automatically split layers between GPU and system RAM. Each offloaded layer is ~3–10× slower. A 70B@Q4 running 50/50 GPU/CPU is ~3× slower than full GPU — but it runs.

Ollama Quant Selection ollama pull llama3.3 gives you Q4_K_M by default. For specific quants: ollama pull llama3.3:70b-instruct-q8_0 or ollama pull llama3.3:70b-instruct-q2_K. Check available tags on ollama.com/library.

Unsloth Dynamic Quants Unsloth's "Dynamic 2.0" method keeps important layers at higher precision (8 or 16-bit) while compressing less critical ones further. A 4-bit Dynamic quant often matches standard Q5 quality. Look for "Unsloth" GGUF uploads on Hugging Face.

What happens when you compress a model

Llama 3.3 70B at every level

What breaks first

Three actual tradeoffs