Local AI is mostly quantization in a trench coat. Every VRAM number on this site assumes Q4_K_M — the community default. This page is what that means and why it works.

Think of it like JPEG compression. A photo at JPEG 100% is huge but perfect. At 80% it's half the size and you can't tell the difference. At 40% you start seeing artifacts. At 10% it's a blurry mess — but loads instantly anywhere.

Quantization works the same way for AI models. The original 16-bit weights are the "full resolution." Compressing to 4-bit cuts VRAM by 75% and is barely distinguishable from full quality. Go below 3-bit and things start breaking.

The Compression Table

Llama 3.3 70B at every level

Same model, seven compression formats. Disk size, VRAM need, and quality preserved at each level. The starred row is what you get by default.

LevelBitsDiskVRAMQualityLike JPEG…
FP1616-bit~140 GB~142 GBPerfect — baseline100% — lossless
Q8_08-bit~70 GB~74 GBEssentially identical95% — can't tell
Q6_K6.6-bit~58 GB~62 GBExtremely close90% — pixel peeping only
Q4_K_M4.8-bit~42 GB~46 GB99.5% preserved80% — the sweet spot
Q3_K_M3.9-bit~34 GB~38 GBSubtle degradation60% — trained eyes notice
Q2_K2.6-bit~24 GB~28 GBNoticeable loss30% — artifacts visible
IQ2_XS2.3-bit~21 GB~25 GBSignificant loss15% — runs, barely
Sensitivity

What breaks first

Not all tasks degrade equally. Most-sensitive at top, most-resilient at bottom. Below Q4 is where this matters.

Task type
Sensitivity
What happens below Q4
Math & Logic
● Critical
Arithmetic errors, proof steps skipped, wrong answers to multi-step problems
Structured Code
● High
Syntax errors, wrong function signatures, broken JSON / API outputs
Reasoning Chains
● Medium
Loses track of multi-step logic, contradicts itself, shallow analysis
Instruction Following
● Medium
Ignores constraints, drifts from format, misses parts of complex prompts
Creative Writing
● Low
Slightly less nuanced vocabulary, more repetitive patterns
Casual Chat
● Minimal
Basically unaffected — conversational quality holds well even at Q3
Real Scenarios

Three actual tradeoffs

Theory is fine. Here's what these decisions look like with real models on real hardware.

Scenario 1 · RTX 4090 (24 GB)
Quality vs. Power: 32B@Q8 or 70B@Q4?

Two strong options at 24 GB:

  • Qwen3 32B @ Q4_K_M — fits easily (~22 GB). Best quality-per-GB at this tier. Excellent for coding, reasoning, long-context. (Q8 would be ~37 GB and overflow a single 24 GB card.)
  • Llama 3.3 70B @ Q4_K_M — tight (~46 GB needs offloading, or use Q3 at ~38 GB). More raw capability, but quantization costs you math precision and complex reasoning.
Rule of thumb: A smaller model at higher quant often beats a larger model at lower quant — especially for coding and structured output. The 32B@Q8 will generate cleaner code than the 70B@Q3.
Scenario 2 · 16 GB GPU
Coding assistant: 14B@Q8 or 30B MoE@Q4?

Two paths to a capable coding assistant:

  • Phi-4 14B @ Q8 — ~16 GB, fills the card. Full-quality reasoning and coding. 128K context. Rock solid for structured tasks.
  • GLM-4.7-Flash 30B MoE @ Q4 — ~20 GB total but only 3B active. 59.2% SWE-bench. Needs some CPU offloading but MoE speed advantage means it's still fast.
Verdict: GLM-4.7 wins for coding specifically — MoE keeps speed high even with offloading, and SWE-bench score crushes Phi-4. For general-purpose reasoning, Phi-4@Q8 is the safer choice.
Scenario 3 · Stretching the limits
Running Llama 4 Scout (109B) on 24 GB

The extreme case — a 109B model on consumer hardware:

  • IQ2_XS (1.78-bit) quant brings it down to ~24 GB VRAM
  • The 10M context window still works but KV cache eats VRAM fast
  • MoE helps — only 17B params activate per token, so speed stays usable
  • Quality: general chat and simple tasks fine. Complex coding and math noticeably degrade.
Verdict: Impressive that it runs at all. Fine for casual use and testing. For production agent work, you'd want this at Q4+ on multi-GPU — or pick a smaller model at higher quant.
Technical Notes — Formats, MoE, Context Scaling
GGUF vs GPTQ vs AWQ vs EXL2 GGUF = llama.cpp/Ollama format. Best for CPU+GPU split. Most models. GPTQ = GPU-only, fast batch. AWQ = activation-aware, better quality per bit. EXL2 = variable bits-per-layer, optimal quality. For Ollama: GGUF. For vLLM: AWQ/GPTQ.
MoE Memory Trap MoE models load ALL parameters into VRAM but only activate a subset per token. A 236B MoE with 22B active still needs ~120 GB at Q4 because every expert weight must be resident. Speed is set by active params. VRAM is set by total params.
Context Length Eats VRAM KV cache grows with context. At 8K: +1–2 GB. At 32K: +4–8 GB. At 128K: +20–40 GB. Models with GQA or MLA (DeepSeek, Llama 3) are much more efficient. Nemotron's 1M context uses linear attention to keep it manageable.
CPU Offloading If a model doesn't fit entirely in VRAM, llama.cpp/Ollama automatically split layers between GPU and system RAM. Each offloaded layer is ~3–10× slower. A 70B@Q4 running 50/50 GPU/CPU is ~3× slower than full GPU — but it runs.
Ollama Quant Selection ollama pull llama3.3 gives you Q4_K_M by default. For specific quants: ollama pull llama3.3:70b-instruct-q8_0 or ollama pull llama3.3:70b-instruct-q2_K. Check available tags on ollama.com/library.
Unsloth Dynamic Quants Unsloth's "Dynamic 2.0" method keeps important layers at higher precision (8 or 16-bit) while compressing less critical ones further. A 4-bit Dynamic quant often matches standard Q5 quality. Look for "Unsloth" GGUF uploads on Hugging Face.