01 — FundamentalsWhat Drives LLM Inference Speed?
Running a large language model locally is fundamentally a memory bandwidth problem, not a compute problem. Every time the model generates a token, it must read the entire weight matrix from VRAM. The GPU does a small amount of math, outputs one token, then reads all the weights again for the next token.
This means three specs determine your performance, in order of importance:
| Spec | What It Determines | Example |
|---|---|---|
| VRAM Capacity | Does the model fit on the GPU at all? | 70B Q4 needs ~40GB VRAM |
| Memory Bandwidth | How fast can you generate tokens? | RTX 3090: 936 GB/s → fast. P40: 346 GB/s → slow |
| Price per GB VRAM | How much capability per dollar? | 3090: $35/GB. 4090: $73/GB. P40: $10/GB |
CUDA cores, clock speed, and tensor cores matter far less than you'd expect. A Ryzen 5 5600 performs within 5% of a Ryzen 9 for token generation — the CPU barely participates.
Approximate tok/s ≈ GPU bandwidth (GB/s) ÷ model size in VRAM (GB)
Example: RTX 3090 (936 GB/s) running a 20GB model → ~47 tok/s. This is simplified but remarkably predictive.
02 — ArchitectureHow Mixture-of-Experts (MoE) Models Work
Traditional "dense" models (like Llama 70B) activate every parameter for every token. A 70B model uses all 70 billion parameters on every forward pass, which is why it needs ~40GB VRAM at Q4 and generates tokens at a speed proportional to that 40GB.
MoE models take a radically different approach: they have a massive total parameter count but only activate a small fraction for each token. The model contains many specialized "expert" sub-networks, and a lightweight "router" network decides which experts to activate for each input.
Dense Model vs MoE Model
Dense: ALL parameters fire every token. MoE: Router selects a few experts per token.
Dense (Llama 70B)
MoE (DeepSeek V3 — 671B)
Why MoE Matters for Home Inference
The critical insight: all experts must live in memory (VRAM or RAM), but only the active parameters determine compute speed. This creates a unique tradeoff:
| Property | Dense 70B | MoE 671B (37B active) |
|---|---|---|
| Total Parameters | 70B | 671B |
| Active per Token | 70B (all) | 37B (~5.5%) |
| VRAM at Q4 | ~40GB | ~260GB |
| Knowledge Breadth | 70B model's worth | 671B model's worth |
| Generation Speed | Proportional to 40GB | Proportional to 37B active ≈ ~22GB |
| Quality | GPT-4-mini class | GPT-5 competitive |
MoE models are smarter than dense models of the same speed (because 671B total knowledge > 70B), but they need much more VRAM (because all 671B parameters must be stored, even though only 37B fire). They're fast to think but expensive to store.
MoE Models Available Today
| Model | Total Params | Active / Token | Experts | VRAM (Q4) | Effective Speed* |
|---|---|---|---|---|---|
| Mixtral 8×7B | 46.7B | 12.9B | 8 total, 2 active | ~26GB | Like a fast 13B |
| Qwen3-30B-A3B | 30B | 3B | 128 total, 8 active | ~18GB | Like a blazing 3B |
| Qwen3-Next-80B-A3B | 80B | 3.9B | MoE + Mamba hybrid | ~48GB | Like a fast 4B |
| Llama 4 Scout | MoE | ~17B | 16 total, 2 active | ~35GB | Like a fast 17B |
| Mixtral 8×22B | 141B | 39B | 8 total, 2 active | ~80GB | Like a fast 39B |
| DeepSeek Coder V2 | 236B | 21B | 160 total, 6 active | ~90GB | Like a fast 21B |
| Qwen3-235B-A22B | 235B | 22B | 128 total, 8 active | ~90GB | Like a fast 22B |
| Qwen3-Coder-480B-A35B | 480B | 35B | MoE | ~180GB | Like a fast 35B |
| DeepSeek V3.2 | 671B | 37B | 256 total, 8 active + 1 shared | ~260GB | Like a fast 37B |
| DeepSeek R1 | 671B | 37B | Same as V3 | ~260GB | Like a fast 37B |
| Kimi K2.5 | 1,000B | 32B | 384 total, 8 active + 1 shared | ~240GB (1.8-bit) | Like a fast 32B |
*"Effective speed" means the model generates tokens at a rate comparable to a dense model of the active parameter size — but with the knowledge breadth of its full parameter count.
03 — Hardware TiersThe Six Build Tiers
Each tier represents a meaningful capability jump. The key question at each tier: what models can I run at conversational speed (10+ tok/s)?
VRAM & Bandwidth by Tier
Dense Models
| Model | VRAM | Speed | Quality |
|---|---|---|---|
| Llama 3.1 8B Q4 | ~5GB | 25–30 t/s | Good agent/chat |
| Phi-4 14B Q4 | ~9GB | 15–20 t/s | Strong math+code |
| Qwen 2.5 32B Q4 | ~20GB | 6–10 t/s | Best at this size |
| Llama 3.3 70B Q4 | ~40GB | 1–3 t/s ❌ | Doesn't fit — spills to RAM |
MoE Models
| Model | VRAM | Speed | Notes |
|---|---|---|---|
| Qwen3-30B-A3B (3B active) | ~18GB | 15–22 t/s | MoE magic — 30B knowledge at 3B speed |
| Mixtral 8×7B (12.9B active) | ~26GB | 2–4 t/s | Slightly too large, heavy offload |
Dense Models
| Model | VRAM | Speed | Quality |
|---|---|---|---|
| Llama 3.1 8B Q4 | ~5GB | 40–50 t/s | Blazing watcher agent |
| Phi-4 14B Q4 | ~9GB | 25–35 t/s | Excellent math+code |
| Qwen 2.5 32B Q4 | ~20GB | 15–20 t/s | Daily driver ✓ |
| DeepSeek-R1-Distill-32B Q4 | ~20GB | 12–18 t/s | o1-mini class reasoning |
| Llama 3.3 70B Q4 | ~40GB | 3–6 t/s | With RAM offload — usable for batch |
MoE Models
| Model | VRAM | Speed | Notes |
|---|---|---|---|
| Qwen3-30B-A3B (3B active) | ~18GB | 20–30 t/s | Outstanding value — frontier-class MoE on one card |
| Qwen3-Coder-30B-A3B | ~18GB | 20–30 t/s | Agentic coding with 3B active — incredible speed |
| Mixtral 8×7B (12.9B active) | ~26GB | 8–12 t/s | Tight fit, slight offload |
Dense Models
| Model | VRAM | Speed | Quality |
|---|---|---|---|
| Llama 3.3 70B Q4 | ~40GB | 15–20 t/s ✅ | Conversational speed. The standard |
| Qwen 2.5 72B Q4 | ~42GB | 14–18 t/s | Best multilingual 70B |
| DeepSeek-R1-Distill-70B Q4 | ~40GB | 14–18 t/s | Best reasoning at 70B |
| Qwen 2.5 Coder 72B Q4 | ~42GB | 12–16 t/s | Top open-source coder |
MoE Models
| Model | VRAM | Speed | Notes |
|---|---|---|---|
| Mixtral 8×7B (12.9B active) | ~26GB | 20–25 t/s | Fits comfortably, fast MoE |
| Llama 4 Scout (~17B active) | ~35GB | 15–22 t/s | 10M context window — read entire codebases |
| Qwen3-30B-A3B (3B active) | ~18GB | 30–40 t/s | Room for agents alongside |
Dense Models
| Model | VRAM | Speed | Quality |
|---|---|---|---|
| Llama 3.3 70B Q8 (full quality) | ~70GB | 12–15 t/s | Zero quantization loss |
| Command R+ 104B Q4 | ~60GB | 8–12 t/s | Best RAG-optimized model |
| GLM-5 Reasoning Q4 | ~60GB | 10–14 t/s | Top leaderboard Jan 2026 |
MoE Models
| Model | VRAM | Speed | Notes |
|---|---|---|---|
| Mixtral 8×22B (39B active) | ~80GB ⚠️ | 8–12 t/s | Tight — some RAM offload needed |
| Qwen3-235B-A22B (22B active) Q2 | ~70GB | 4–8 t/s | Heavy quant but fits. Frontier reasoning |
| DeepSeek Coder V2 (21B active) | ~90GB ⚠️ | 3–6 t/s | With RAM offload. Top coding MoE |
Dense Models
| Model | VRAM | Speed | Quality |
|---|---|---|---|
| Llama 3.3 70B Q8 (full quality) | ~70GB | 15–20 t/s | Fast full-quality 70B |
| Llama 3.3 70B FP16 | ~140GB ⚠️ | 5–8 t/s | True full precision with RAM offload |
| GLM-5 Reasoning Q4 | ~60GB | 12–16 t/s | Top of leaderboards |
MoE Models — This Is Where Tier 5 Shines
| Model | VRAM | Speed | Notes |
|---|---|---|---|
| Qwen3-235B-A22B (22B active) | ~90GB | 8–14 t/s ✅ | Frontier model. Rivals GPT-4o. Fits entirely in VRAM |
| Mixtral 8×22B (39B active) | ~80GB | 10–15 t/s | Fully in VRAM now. Fast MoE |
| DeepSeek V3.2 (37B active) | 96GB + RAM | 3–6 t/s | Frontier reasoning. GPT-5 competitive |
| DeepSeek R1 (37B active) | 96GB + RAM | 3–6 t/s | o1-class reasoning on your hardware |
| Qwen3-Coder-480B-A35B | ~180GB ⚠️ | 2–5 t/s | Claude Sonnet 4 class coding — heavy RAM offload |
Why Tier 6 Exists
Tier 6 trades raw VRAM-per-dollar (where quad 3090s win) for bandwidth, reliability, noise, power efficiency, and form factor. The A100's 2 TB/s HBM2e memory is roughly 2× faster than a 3090's GDDR6X per card — meaning significantly faster per-token inference. The A100 also supports MIG partitioning: splitting one GPU into up to 7 independent instances, each running a separate model simultaneously with zero interference.
Dense Models
| Model | Config | Speed | Notes |
|---|---|---|---|
| Llama 3.3 70B Q4 | A100 80GB | 25–35 t/s 🔥 | HBM2e bandwidth = speed king |
| Llama 3.3 70B Q8 | A100 80GB | 20–28 t/s | Full quality, still blazing |
| Qwen 2.5 72B Q4 | 2× A6000 NVLink | 16–22 t/s | 96GB unified memory pool |
MoE Models
| Model | Config | Speed | Notes |
|---|---|---|---|
| Qwen3-235B-A22B | 2× A6000 NVLink (96GB) | 10–16 t/s | Frontier MoE, unified memory |
| DeepSeek V3.2 (37B active) | A100 + 256GB RAM | 3–6 t/s | HBM2e accelerates active params |
A100 MIG Multi-Model (Unique to Tier 6)
| MIG Partition | VRAM | Model | Speed | Agent Role |
|---|---|---|---|---|
| Slice 1 | 10GB | Llama 8B Q4 | 15–20 t/s | Watcher |
| Slice 2 | 10GB | Phi-4 Q4 | 12–15 t/s | Security Scanner |
| Slice 3 | 20GB | Qwen 32B Q4 | 10–14 t/s | Worker |
| Slice 4 | 40GB | Llama 70B Q4 | 12–18 t/s | CEO Thinker |
Four independent models, one card, zero interference. This is the A100's killer feature.
04 — PerformanceTokens Per Second Across All Tiers
Dense Model Inference Speed by Tier
MoE Model Inference Speed by Tier
05 — MoE Tier MappingWhere Each MoE Model Becomes Usable
This chart shows the minimum tier where each MoE model first fits in VRAM, and the tier where it reaches conversational speed (10+ tok/s).
MoE Models: VRAM Requirement vs Available VRAM by Tier
| MoE Model | VRAM (Q4) | First Fits | Conversational Speed (10+ t/s) | Peak Speed (best tier) |
|---|---|---|---|---|
| Qwen3-30B-A3B | ~18GB | Tier 1 | Tier 1 (15–22 t/s) | T3: 30–40 t/s |
| Mixtral 8×7B | ~26GB | Tier 2 (tight) | Tier 3 (20–25 t/s) | T3: 20–25 t/s |
| Llama 4 Scout | ~35GB | Tier 3 | Tier 3 (15–22 t/s) | T5: 18–25 t/s |
| Mixtral 8×22B | ~80GB | Tier 5 | Tier 5 (10–15 t/s) | T5: 10–15 t/s |
| Qwen3-235B-A22B | ~90GB | Tier 5 | Tier 5 (8–14 t/s) | T6: 10–16 t/s |
| DeepSeek V3.2 / R1 | ~260GB | Tier 5 + 256GB RAM | Never (3–6 t/s max) | T5/T6: 3–6 t/s |
| Kimi K2.5 | ~240GB (1.8-bit) | Tier 5 + RAM (extreme quant) | Never (1–2 t/s max) | Needs 2× H100 |
Qwen3-235B-A22B on Tier 5 is the most exciting MoE model for homelab builders. With 235B total parameters but only 22B active, it delivers frontier-class intelligence (rivaling GPT-4o on many benchmarks) at speeds comparable to a dense 22B model. At 8–14 tok/s on quad 3090s, it's genuinely conversational — a full 235B model responding as fast as you can read. This is the "holy grail" of local AI: proprietary-model quality with zero token costs.
06 — Cost AnalysisTotal Cost of Ownership
3-Year Total Cost of Ownership (Build + Electricity)
| Tier | Build Cost | Year 1 Electric | Year 1 Total | Year 3 Total | Break-Even vs API* |
|---|---|---|---|---|---|
| Tier 1 | ~$650 | ~$144 | $794 | $1,082 | ~5 months |
| Tier 2 | ~$1,200 | ~$216 | $1,416 | $1,848 | ~8 months |
| Tier 3 | ~$2,100 | ~$276 | $2,376 | $2,928 | ~13 months |
| Tier 4 | ~$3,100 | ~$346 | $3,446 | $4,138 | ~18 months |
| Tier 5 | ~$4,700 | ~$373 | $5,073 | $5,819 | ~24 months |
| Tier 6 | ~$8,000 | ~$240 | $8,240 | $8,720 | ~36 months |
*Break-even calculated against Claude Sonnet API moderate usage (~2M tokens/day = ~$2,160/year). Your actual usage will vary.
07 — Why RTX 3090?The GPU That Dominates Every Tier
Five of six tiers are built around the RTX 3090. Here's why no other card comes close at this price point:
VRAM per Dollar — Used GPU Market (Feb 2026)
The 3090 offers 24GB at 936 GB/s for ~$800 used. The 4090 gives the same 24GB at 1,008 GB/s for ~$1,700 — an 8% bandwidth gain for 112% more money. The 5090's 32GB at $2,500+ means three 3090s (72GB, 2,808 GB/s combined) cost the same as one 5090 while delivering 2.25× the VRAM and 1.57× the bandwidth.
The only alternatives that make sense are professional/datacenter cards (A6000, A100) at Tier 6, where you're paying a premium for noise, thermals, reliability, and per-card VRAM capacity — not price/performance.
08 — RecommendationsWhich Tier Should You Build?
| Your Goal | Recommended Tier | Why |
|---|---|---|
| Learn Linux + AI, run 8B/32B agents | Tier 1 ($650) | Cheapest entry. Validates the workflow |
| Serious daily driver, 32B at full speed | Tier 2 ($1,200) | Best value. Upgradeable to Tier 3 |
| 70B at conversational speed, character work | Tier 3 ($2,100) | The capability inflection point |
| 70B at full quality, large MoE experiments | Tier 4 ($3,100) | Q8 quality, zero quantization compromise |
| Frontier MoE models, multi-agent swarm | Tier 5 ($4,700) | Qwen3-235B + DeepSeek R1 on your hardware |
| Enterprise reliability, silent 24/7, speed | Tier 6 ($8,000+) | A100's bandwidth + MIG is unmatched |
Start with an EPYC motherboard + single 3090 (~$1,800). This gives you Tier 2 performance on a Tier 5 platform. Add GPUs over time — each 3090 is a ~$800 modular upgrade. Full Tier 5 builds out over 6–12 months with zero wasted components.