01 — FundamentalsWhat Drives LLM Inference Speed?

Running a large language model locally is fundamentally a memory bandwidth problem, not a compute problem. Every time the model generates a token, it must read the entire weight matrix from VRAM. The GPU does a small amount of math, outputs one token, then reads all the weights again for the next token.

This means three specs determine your performance, in order of importance:

SpecWhat It DeterminesExample
VRAM CapacityDoes the model fit on the GPU at all?70B Q4 needs ~40GB VRAM
Memory BandwidthHow fast can you generate tokens?RTX 3090: 936 GB/s → fast. P40: 346 GB/s → slow
Price per GB VRAMHow much capability per dollar?3090: $35/GB. 4090: $73/GB. P40: $10/GB

CUDA cores, clock speed, and tensor cores matter far less than you'd expect. A Ryzen 5 5600 performs within 5% of a Ryzen 9 for token generation — the CPU barely participates.

Key Formula

Approximate tok/s ≈ GPU bandwidth (GB/s) ÷ model size in VRAM (GB)
Example: RTX 3090 (936 GB/s) running a 20GB model → ~47 tok/s. This is simplified but remarkably predictive.

02 — ArchitectureHow Mixture-of-Experts (MoE) Models Work

Traditional "dense" models (like Llama 70B) activate every parameter for every token. A 70B model uses all 70 billion parameters on every forward pass, which is why it needs ~40GB VRAM at Q4 and generates tokens at a speed proportional to that 40GB.

MoE models take a radically different approach: they have a massive total parameter count but only activate a small fraction for each token. The model contains many specialized "expert" sub-networks, and a lightweight "router" network decides which experts to activate for each input.

Dense Model vs MoE Model

Dense: ALL parameters fire every token. MoE: Router selects a few experts per token.

Dense (Llama 70B)

Input Token
ALL 70B params
Output Token

MoE (DeepSeek V3 — 671B)

Input Token
Router (selects 8 of 256 experts)
Expert 14 ✓
Expert 2
Expert 87 ✓
Expert 4
Expert 5
Expert 201 ✓
Expert 7
Expert 55 ✓
...256 total experts, 8 active = 37B active params
Output Token

Why MoE Matters for Home Inference

The critical insight: all experts must live in memory (VRAM or RAM), but only the active parameters determine compute speed. This creates a unique tradeoff:

PropertyDense 70BMoE 671B (37B active)
Total Parameters70B671B
Active per Token70B (all)37B (~5.5%)
VRAM at Q4~40GB~260GB
Knowledge Breadth70B model's worth671B model's worth
Generation SpeedProportional to 40GBProportional to 37B active ≈ ~22GB
QualityGPT-4-mini classGPT-5 competitive
The MoE Paradox

MoE models are smarter than dense models of the same speed (because 671B total knowledge > 70B), but they need much more VRAM (because all 671B parameters must be stored, even though only 37B fire). They're fast to think but expensive to store.

MoE Models Available Today

ModelTotal ParamsActive / TokenExpertsVRAM (Q4)Effective Speed*
Mixtral 8×7B46.7B12.9B8 total, 2 active~26GBLike a fast 13B
Qwen3-30B-A3B30B3B128 total, 8 active~18GBLike a blazing 3B
Qwen3-Next-80B-A3B80B3.9BMoE + Mamba hybrid~48GBLike a fast 4B
Llama 4 ScoutMoE~17B16 total, 2 active~35GBLike a fast 17B
Mixtral 8×22B141B39B8 total, 2 active~80GBLike a fast 39B
DeepSeek Coder V2236B21B160 total, 6 active~90GBLike a fast 21B
Qwen3-235B-A22B235B22B128 total, 8 active~90GBLike a fast 22B
Qwen3-Coder-480B-A35B480B35BMoE~180GBLike a fast 35B
DeepSeek V3.2671B37B256 total, 8 active + 1 shared~260GBLike a fast 37B
DeepSeek R1671B37BSame as V3~260GBLike a fast 37B
Kimi K2.51,000B32B384 total, 8 active + 1 shared~240GB (1.8-bit)Like a fast 32B

*"Effective speed" means the model generates tokens at a rate comparable to a dense model of the active parameter size — but with the knowledge breadth of its full parameter count.

03 — Hardware TiersThe Six Build Tiers

Each tier represents a meaningful capability jump. The key question at each tier: what models can I run at conversational speed (10+ tok/s)?

VRAM & Bandwidth by Tier

Tier 1 — "The Scout" ENTRY
$550 – $700
GPU
1× Tesla P40 24GB
Total VRAM
24 GB
Bandwidth
346 GB/s (GDDR5)
System RAM
32 GB DDR4
CPU
Ryzen 5 5600
PSU
650W
Power Draw
80–200W
Monthly Electric
~$12

Dense Models

ModelVRAMSpeedQuality
Llama 3.1 8B Q4~5GB25–30 t/sGood agent/chat
Phi-4 14B Q4~9GB15–20 t/sStrong math+code
Qwen 2.5 32B Q4~20GB6–10 t/sBest at this size
Llama 3.3 70B Q4~40GB1–3 t/s ❌Doesn't fit — spills to RAM

MoE Models

ModelVRAMSpeedNotes
Qwen3-30B-A3B (3B active)~18GB15–22 t/sMoE magic — 30B knowledge at 3B speed
Mixtral 8×7B (12.9B active)~26GB2–4 t/sSlightly too large, heavy offload
Tier 2 — "The Workhorse" RECOMMENDED START
$1,000 – $1,300
GPU
1× RTX 3090 24GB
Total VRAM
24 GB
Bandwidth
936 GB/s (GDDR6X)
System RAM
64 GB DDR4
CPU
Ryzen 5 5600
PSU
850W
Power Draw
100–320W
Monthly Electric
~$18

Dense Models

ModelVRAMSpeedQuality
Llama 3.1 8B Q4~5GB40–50 t/sBlazing watcher agent
Phi-4 14B Q4~9GB25–35 t/sExcellent math+code
Qwen 2.5 32B Q4~20GB15–20 t/sDaily driver ✓
DeepSeek-R1-Distill-32B Q4~20GB12–18 t/so1-mini class reasoning
Llama 3.3 70B Q4~40GB3–6 t/sWith RAM offload — usable for batch

MoE Models

ModelVRAMSpeedNotes
Qwen3-30B-A3B (3B active)~18GB20–30 t/sOutstanding value — frontier-class MoE on one card
Qwen3-Coder-30B-A3B~18GB20–30 t/sAgentic coding with 3B active — incredible speed
Mixtral 8×7B (12.9B active)~26GB8–12 t/sTight fit, slight offload
Tier 3 — "The Commander" 70B UNLOCKED
$1,800 – $2,400
GPU
2× RTX 3090 24GB
Total VRAM
48 GB
Bandwidth
1,872 GB/s combined
System RAM
64 GB DDR4
CPU
Ryzen 5 5600
PSU
1200W
Power Draw
150–550W
Monthly Electric
~$23

Dense Models

ModelVRAMSpeedQuality
Llama 3.3 70B Q4~40GB15–20 t/s ✅Conversational speed. The standard
Qwen 2.5 72B Q4~42GB14–18 t/sBest multilingual 70B
DeepSeek-R1-Distill-70B Q4~40GB14–18 t/sBest reasoning at 70B
Qwen 2.5 Coder 72B Q4~42GB12–16 t/sTop open-source coder

MoE Models

ModelVRAMSpeedNotes
Mixtral 8×7B (12.9B active)~26GB20–25 t/sFits comfortably, fast MoE
Llama 4 Scout (~17B active)~35GB15–22 t/s10M context window — read entire codebases
Qwen3-30B-A3B (3B active)~18GB30–40 t/sRoom for agents alongside
Tier 4 — "The General" FULL QUALITY
$2,800 – $3,800
GPU
2× RTX 3090 + 1× P40
Total VRAM
72 GB
Bandwidth
2,218 GB/s combined
System RAM
128 GB DDR4
PSU
1600W
Power Draw
200–750W
Monthly Electric
~$29

Dense Models

ModelVRAMSpeedQuality
Llama 3.3 70B Q8 (full quality)~70GB12–15 t/sZero quantization loss
Command R+ 104B Q4~60GB8–12 t/sBest RAG-optimized model
GLM-5 Reasoning Q4~60GB10–14 t/sTop leaderboard Jan 2026

MoE Models

ModelVRAMSpeedNotes
Mixtral 8×22B (39B active)~80GB ⚠️8–12 t/sTight — some RAM offload needed
Qwen3-235B-A22B (22B active) Q2~70GB4–8 t/sHeavy quant but fits. Frontier reasoning
DeepSeek Coder V2 (21B active)~90GB ⚠️3–6 t/sWith RAM offload. Top coding MoE
Tier 5 — "The Admiral" FRONTIER
$4,500 – $6,500
GPU
4× RTX 3090 24GB
Total VRAM
96 GB
Bandwidth
3,744 GB/s combined
System RAM
128–256 GB DDR4
Platform
EPYC 7702 (recommended)
PSU
2× 1000W (dual PSU)
Form Factor
Open-air frame
Monthly Electric
~$31

Dense Models

ModelVRAMSpeedQuality
Llama 3.3 70B Q8 (full quality)~70GB15–20 t/sFast full-quality 70B
Llama 3.3 70B FP16~140GB ⚠️5–8 t/sTrue full precision with RAM offload
GLM-5 Reasoning Q4~60GB12–16 t/sTop of leaderboards

MoE Models — This Is Where Tier 5 Shines

ModelVRAMSpeedNotes
Qwen3-235B-A22B (22B active)~90GB8–14 t/s ✅Frontier model. Rivals GPT-4o. Fits entirely in VRAM
Mixtral 8×22B (39B active)~80GB10–15 t/sFully in VRAM now. Fast MoE
DeepSeek V3.2 (37B active)96GB + RAM3–6 t/sFrontier reasoning. GPT-5 competitive
DeepSeek R1 (37B active)96GB + RAM3–6 t/so1-class reasoning on your hardware
Qwen3-Coder-480B-A35B~180GB ⚠️2–5 t/sClaude Sonnet 4 class coding — heavy RAM offload
Tier 6 — "The Fleet Commander" ENTERPRISE
$8,000 – $15,000+
GPU Options
2× A6000 48GB or 1× A100 80GB
Total VRAM
80–96 GB
Bandwidth
Up to 2,000 GB/s (A100 HBM2e)
System RAM
256 GB ECC DDR4
Platform
EPYC server (Supermicro/Dell)
Form Factor
Tower or 4U Rackmount
Noise
Whisper-quiet (blower/passive)
Monthly Electric
~$20

Why Tier 6 Exists

Tier 6 trades raw VRAM-per-dollar (where quad 3090s win) for bandwidth, reliability, noise, power efficiency, and form factor. The A100's 2 TB/s HBM2e memory is roughly 2× faster than a 3090's GDDR6X per card — meaning significantly faster per-token inference. The A100 also supports MIG partitioning: splitting one GPU into up to 7 independent instances, each running a separate model simultaneously with zero interference.

Dense Models

ModelConfigSpeedNotes
Llama 3.3 70B Q4A100 80GB25–35 t/s 🔥HBM2e bandwidth = speed king
Llama 3.3 70B Q8A100 80GB20–28 t/sFull quality, still blazing
Qwen 2.5 72B Q42× A6000 NVLink16–22 t/s96GB unified memory pool

MoE Models

ModelConfigSpeedNotes
Qwen3-235B-A22B2× A6000 NVLink (96GB)10–16 t/sFrontier MoE, unified memory
DeepSeek V3.2 (37B active)A100 + 256GB RAM3–6 t/sHBM2e accelerates active params

A100 MIG Multi-Model (Unique to Tier 6)

MIG PartitionVRAMModelSpeedAgent Role
Slice 110GBLlama 8B Q415–20 t/sWatcher
Slice 210GBPhi-4 Q412–15 t/sSecurity Scanner
Slice 320GBQwen 32B Q410–14 t/sWorker
Slice 440GBLlama 70B Q412–18 t/sCEO Thinker

Four independent models, one card, zero interference. This is the A100's killer feature.

04 — PerformanceTokens Per Second Across All Tiers

Dense Model Inference Speed by Tier

MoE Model Inference Speed by Tier

05 — MoE Tier MappingWhere Each MoE Model Becomes Usable

This chart shows the minimum tier where each MoE model first fits in VRAM, and the tier where it reaches conversational speed (10+ tok/s).

MoE Models: VRAM Requirement vs Available VRAM by Tier

MoE ModelVRAM (Q4)First FitsConversational Speed (10+ t/s)Peak Speed (best tier)
Qwen3-30B-A3B~18GBTier 1Tier 1 (15–22 t/s)T3: 30–40 t/s
Mixtral 8×7B~26GBTier 2 (tight)Tier 3 (20–25 t/s)T3: 20–25 t/s
Llama 4 Scout~35GBTier 3Tier 3 (15–22 t/s)T5: 18–25 t/s
Mixtral 8×22B~80GBTier 5Tier 5 (10–15 t/s)T5: 10–15 t/s
Qwen3-235B-A22B~90GBTier 5Tier 5 (8–14 t/s)T6: 10–16 t/s
DeepSeek V3.2 / R1~260GBTier 5 + 256GB RAMNever (3–6 t/s max)T5/T6: 3–6 t/s
Kimi K2.5~240GB (1.8-bit)Tier 5 + RAM (extreme quant)Never (1–2 t/s max)Needs 2× H100
The MoE Sweet Spot

Qwen3-235B-A22B on Tier 5 is the most exciting MoE model for homelab builders. With 235B total parameters but only 22B active, it delivers frontier-class intelligence (rivaling GPT-4o on many benchmarks) at speeds comparable to a dense 22B model. At 8–14 tok/s on quad 3090s, it's genuinely conversational — a full 235B model responding as fast as you can read. This is the "holy grail" of local AI: proprietary-model quality with zero token costs.

06 — Cost AnalysisTotal Cost of Ownership

3-Year Total Cost of Ownership (Build + Electricity)

TierBuild CostYear 1 ElectricYear 1 TotalYear 3 TotalBreak-Even vs API*
Tier 1~$650~$144$794$1,082~5 months
Tier 2~$1,200~$216$1,416$1,848~8 months
Tier 3~$2,100~$276$2,376$2,928~13 months
Tier 4~$3,100~$346$3,446$4,138~18 months
Tier 5~$4,700~$373$5,073$5,819~24 months
Tier 6~$8,000~$240$8,240$8,720~36 months

*Break-even calculated against Claude Sonnet API moderate usage (~2M tokens/day = ~$2,160/year). Your actual usage will vary.

07 — Why RTX 3090?The GPU That Dominates Every Tier

Five of six tiers are built around the RTX 3090. Here's why no other card comes close at this price point:

VRAM per Dollar — Used GPU Market (Feb 2026)

The 3090 offers 24GB at 936 GB/s for ~$800 used. The 4090 gives the same 24GB at 1,008 GB/s for ~$1,700 — an 8% bandwidth gain for 112% more money. The 5090's 32GB at $2,500+ means three 3090s (72GB, 2,808 GB/s combined) cost the same as one 5090 while delivering 2.25× the VRAM and 1.57× the bandwidth.

The only alternatives that make sense are professional/datacenter cards (A6000, A100) at Tier 6, where you're paying a premium for noise, thermals, reliability, and per-card VRAM capacity — not price/performance.

08 — RecommendationsWhich Tier Should You Build?

Your GoalRecommended TierWhy
Learn Linux + AI, run 8B/32B agentsTier 1 ($650)Cheapest entry. Validates the workflow
Serious daily driver, 32B at full speedTier 2 ($1,200)Best value. Upgradeable to Tier 3
70B at conversational speed, character workTier 3 ($2,100)The capability inflection point
70B at full quality, large MoE experimentsTier 4 ($3,100)Q8 quality, zero quantization compromise
Frontier MoE models, multi-agent swarmTier 5 ($4,700)Qwen3-235B + DeepSeek R1 on your hardware
Enterprise reliability, silent 24/7, speedTier 6 ($8,000+)A100's bandwidth + MIG is unmatched
The Piecemeal Path to Tier 5

Start with an EPYC motherboard + single 3090 (~$1,800). This gives you Tier 2 performance on a Tier 5 platform. Add GPUs over time — each 3090 is a ~$800 modular upgrade. Full Tier 5 builds out over 6–12 months with zero wasted components.