Local AI is mostly quantization in a trench coat. Every VRAM number on this site assumes Q4_K_M — the community default. This page is what that means and why it works.
Think of it like JPEG compression. A photo at JPEG 100% is huge but perfect. At 80% it's half the size and you can't tell the difference. At 40% you start seeing artifacts. At 10% it's a blurry mess — but loads instantly anywhere.
Quantization works the same way for AI models. The original 16-bit weights are the "full resolution." Compressing to 4-bit cuts VRAM by 75% and is barely distinguishable from full quality. Go below 3-bit and things start breaking.
Llama 3.3 70B at every level
Same model, seven compression formats. Disk size, VRAM need, and quality preserved at each level. The starred row is what you get by default.
| Level | Bits | Disk | VRAM | Quality | Like JPEG… |
|---|---|---|---|---|---|
| FP16 | 16-bit | ~140 GB | ~142 GB | Perfect — baseline | 100% — lossless |
| Q8_0 | 8-bit | ~70 GB | ~74 GB | Essentially identical | 95% — can't tell |
| Q6_K | 6.6-bit | ~58 GB | ~62 GB | Extremely close | 90% — pixel peeping only |
| Q4_K_M | 4.8-bit | ~42 GB | ~46 GB | 99.5% preserved | 80% — the sweet spot |
| Q3_K_M | 3.9-bit | ~34 GB | ~38 GB | Subtle degradation | 60% — trained eyes notice |
| Q2_K | 2.6-bit | ~24 GB | ~28 GB | Noticeable loss | 30% — artifacts visible |
| IQ2_XS | 2.3-bit | ~21 GB | ~25 GB | Significant loss | 15% — runs, barely |
What breaks first
Not all tasks degrade equally. Most-sensitive at top, most-resilient at bottom. Below Q4 is where this matters.
Three actual tradeoffs
Theory is fine. Here's what these decisions look like with real models on real hardware.
Two strong options at 24 GB:
- Qwen3 32B @ Q4_K_M — fits easily (~22 GB). Best quality-per-GB at this tier. Excellent for coding, reasoning, long-context. (Q8 would be ~37 GB and overflow a single 24 GB card.)
- Llama 3.3 70B @ Q4_K_M — tight (~46 GB needs offloading, or use Q3 at ~38 GB). More raw capability, but quantization costs you math precision and complex reasoning.
Two paths to a capable coding assistant:
- Phi-4 14B @ Q8 — ~16 GB, fills the card. Full-quality reasoning and coding. 128K context. Rock solid for structured tasks.
- GLM-4.7-Flash 30B MoE @ Q4 — ~20 GB total but only 3B active. 59.2% SWE-bench. Needs some CPU offloading but MoE speed advantage means it's still fast.
The extreme case — a 109B model on consumer hardware:
- IQ2_XS (1.78-bit) quant brings it down to ~24 GB VRAM
- The 10M context window still works but KV cache eats VRAM fast
- MoE helps — only 17B params activate per token, so speed stays usable
- Quality: general chat and simple tasks fine. Complex coding and math noticeably degrade.
Technical Notes — Formats, MoE, Context Scaling
GGUF = llama.cpp/Ollama format. Best for CPU+GPU split. Most models. GPTQ = GPU-only, fast batch. AWQ = activation-aware, better quality per bit. EXL2 = variable bits-per-layer, optimal quality. For Ollama: GGUF. For vLLM: AWQ/GPTQ.
ollama pull llama3.3 gives you Q4_K_M by default. For specific quants: ollama pull llama3.3:70b-instruct-q8_0 or ollama pull llama3.3:70b-instruct-q2_K. Check available tags on ollama.com/library.