Local AI Server Build Guide — Tiers 1

01 — FundamentalsWhat Drives LLM Inference Speed?

Running a large language model locally is fundamentally a memory bandwidth problem, not a compute problem. Every time the model generates a token, it must read the entire weight matrix from VRAM. The GPU does a small amount of math, outputs one token, then reads all the weights again for the next token.

This means three specs determine your performance, in order of importance:

Spec	What It Determines	Example
VRAM Capacity	Does the model fit on the GPU at all?	70B Q4 needs ~40GB VRAM
Memory Bandwidth	How fast can you generate tokens?	RTX 3090: 936 GB/s → fast. P40: 346 GB/s → slow
Price per GB VRAM	How much capability per dollar?	3090: $35/GB. 4090: $73/GB. P40: $10/GB

CUDA cores, clock speed, and tensor cores matter far less than you'd expect. A Ryzen 5 5600 performs within 5% of a Ryzen 9 for token generation — the CPU barely participates.

Key Formula

Approximate tok/s ≈ GPU bandwidth (GB/s) ÷ model size in VRAM (GB)
Example: RTX 3090 (936 GB/s) running a 20GB model → ~47 tok/s. This is simplified but remarkably predictive.

02 — ArchitectureHow Mixture-of-Experts (MoE) Models Work

Traditional "dense" models (like Llama 70B) activate every parameter for every token. A 70B model uses all 70 billion parameters on every forward pass, which is why it needs ~40GB VRAM at Q4 and generates tokens at a speed proportional to that 40GB.

MoE models take a radically different approach: they have a massive total parameter count but only activate a small fraction for each token. The model contains many specialized "expert" sub-networks, and a lightweight "router" network decides which experts to activate for each input.

Dense Model vs MoE Model

Dense: ALL parameters fire every token. MoE: Router selects a few experts per token.

Dense (Llama 70B)

Input Token

↓

ALL 70B params

↓

Output Token

MoE (DeepSeek V3 — 671B)

Input Token

↓

Router (selects 8 of 256 experts)

↓

Expert 14 ✓

Expert 2

Expert 87 ✓

Expert 4

Expert 5

Expert 201 ✓

Expert 7

Expert 55 ✓

...256 total experts, 8 active = 37B active params

↓

Output Token

Why MoE Matters for Home Inference

The critical insight: all experts must live in memory (VRAM or RAM), but only the active parameters determine compute speed. This creates a unique tradeoff:

Property	Dense 70B	MoE 671B (37B active)
Total Parameters	70B	671B
Active per Token	70B (all)	37B (~5.5%)
VRAM at Q4	~40GB	~260GB
Knowledge Breadth	70B model's worth	671B model's worth
Generation Speed	Proportional to 40GB	Proportional to 37B active ≈ ~22GB
Quality	GPT-4-mini class	GPT-5 competitive

The MoE Paradox

MoE models are smarter than dense models of the same speed (because 671B total knowledge > 70B), but they need much more VRAM (because all 671B parameters must be stored, even though only 37B fire). They're fast to think but expensive to store.

MoE Models Available Today

Model	Total Params	Active / Token	Experts	VRAM (Q4)	Effective Speed*
Mixtral 8×7B	46.7B	12.9B	8 total, 2 active	~26GB	Like a fast 13B
Qwen3-30B-A3B	30B	3B	128 total, 8 active	~18GB	Like a blazing 3B
Qwen3-Next-80B-A3B	80B	3.9B	MoE + Mamba hybrid	~48GB	Like a fast 4B
Llama 4 Scout	MoE	~17B	16 total, 2 active	~35GB	Like a fast 17B
Mixtral 8×22B	141B	39B	8 total, 2 active	~80GB	Like a fast 39B
DeepSeek Coder V2	236B	21B	160 total, 6 active	~90GB	Like a fast 21B
Qwen3-235B-A22B	235B	22B	128 total, 8 active	~90GB	Like a fast 22B
Qwen3-Coder-480B-A35B	480B	35B	MoE	~180GB	Like a fast 35B
DeepSeek V3.2	671B	37B	256 total, 8 active + 1 shared	~260GB	Like a fast 37B
DeepSeek R1	671B	37B	Same as V3	~260GB	Like a fast 37B
Kimi K2.5	1,000B	32B	384 total, 8 active + 1 shared	~240GB (1.8-bit)	Like a fast 32B

*"Effective speed" means the model generates tokens at a rate comparable to a dense model of the active parameter size — but with the knowledge breadth of its full parameter count.

03 — Hardware TiersThe Six Build Tiers

Each tier represents a meaningful capability jump. The key question at each tier: what models can I run at conversational speed (10+ tok/s)?

VRAM & Bandwidth by Tier

Tier 1 — "The Scout" ENTRY

$550 – $700

GPU

1× Tesla P40 24GB

Total VRAM

24 GB

Bandwidth

346 GB/s (GDDR5)

System RAM

32 GB DDR4

CPU

Ryzen 5 5600

PSU

650W

Power Draw

80–200W

Monthly Electric

~$12

Dense Models

Model	VRAM	Speed	Quality
Llama 3.1 8B Q4	~5GB	25–30 t/s	Good agent/chat
Phi-4 14B Q4	~9GB	15–20 t/s	Strong math+code
Qwen 2.5 32B Q4	~20GB	6–10 t/s	Best at this size
Llama 3.3 70B Q4	~40GB	1–3 t/s ❌	Doesn't fit — spills to RAM

MoE Models

Model	VRAM	Speed	Notes
Qwen3-30B-A3B (3B active)	~18GB	15–22 t/s	MoE magic — 30B knowledge at 3B speed
Mixtral 8×7B (12.9B active)	~26GB	2–4 t/s	Slightly too large, heavy offload

Tier 2 — "The Workhorse" RECOMMENDED START

$1,000 – $1,300

GPU

1× RTX 3090 24GB

Total VRAM

24 GB

Bandwidth

936 GB/s (GDDR6X)

System RAM

64 GB DDR4

CPU

Ryzen 5 5600

PSU

850W

Power Draw

100–320W

Monthly Electric

~$18

Dense Models

Model	VRAM	Speed	Quality
Llama 3.1 8B Q4	~5GB	40–50 t/s	Blazing watcher agent
Phi-4 14B Q4	~9GB	25–35 t/s	Excellent math+code
Qwen 2.5 32B Q4	~20GB	15–20 t/s	Daily driver ✓
DeepSeek-R1-Distill-32B Q4	~20GB	12–18 t/s	o1-mini class reasoning
Llama 3.3 70B Q4	~40GB	3–6 t/s	With RAM offload — usable for batch

MoE Models

Model	VRAM	Speed	Notes
Qwen3-30B-A3B (3B active)	~18GB	20–30 t/s	Outstanding value — frontier-class MoE on one card
Qwen3-Coder-30B-A3B	~18GB	20–30 t/s	Agentic coding with 3B active — incredible speed
Mixtral 8×7B (12.9B active)	~26GB	8–12 t/s	Tight fit, slight offload

Tier 3 — "The Commander" 70B UNLOCKED

$1,800 – $2,400

GPU

2× RTX 3090 24GB

Total VRAM

48 GB

Bandwidth

1,872 GB/s combined

System RAM

64 GB DDR4

CPU

Ryzen 5 5600

PSU

1200W

Power Draw

150–550W

Monthly Electric

~$23

Dense Models

Model	VRAM	Speed	Quality
Llama 3.3 70B Q4	~40GB	15–20 t/s ✅	Conversational speed. The standard
Qwen 2.5 72B Q4	~42GB	14–18 t/s	Best multilingual 70B
DeepSeek-R1-Distill-70B Q4	~40GB	14–18 t/s	Best reasoning at 70B
Qwen 2.5 Coder 72B Q4	~42GB	12–16 t/s	Top open-source coder

MoE Models

Model	VRAM	Speed	Notes
Mixtral 8×7B (12.9B active)	~26GB	20–25 t/s	Fits comfortably, fast MoE
Llama 4 Scout (~17B active)	~35GB	15–22 t/s	10M context window — read entire codebases
Qwen3-30B-A3B (3B active)	~18GB	30–40 t/s	Room for agents alongside

Tier 4 — "The General" FULL QUALITY

$2,800 – $3,800

GPU

2× RTX 3090 + 1× P40

Total VRAM

72 GB

Bandwidth

2,218 GB/s combined

System RAM

128 GB DDR4

PSU

1600W

Power Draw

200–750W

Monthly Electric

~$29

Dense Models

Model	VRAM	Speed	Quality
Llama 3.3 70B Q8 (full quality)	~70GB	12–15 t/s	Zero quantization loss
Command R+ 104B Q4	~60GB	8–12 t/s	Best RAG-optimized model
GLM-5 Reasoning Q4	~60GB	10–14 t/s	Top leaderboard Jan 2026

MoE Models

Model	VRAM	Speed	Notes
Mixtral 8×22B (39B active)	~80GB ⚠️	8–12 t/s	Tight — some RAM offload needed
Qwen3-235B-A22B (22B active) Q2	~70GB	4–8 t/s	Heavy quant but fits. Frontier reasoning
DeepSeek Coder V2 (21B active)	~90GB ⚠️	3–6 t/s	With RAM offload. Top coding MoE

Tier 5 — "The Admiral" FRONTIER

$4,500 – $6,500

GPU

4× RTX 3090 24GB

Total VRAM

96 GB

Bandwidth

3,744 GB/s combined

System RAM

128–256 GB DDR4

Platform

EPYC 7702 (recommended)

PSU

2× 1000W (dual PSU)

Form Factor

Open-air frame

Monthly Electric

~$31

Dense Models

Model	VRAM	Speed	Quality
Llama 3.3 70B Q8 (full quality)	~70GB	15–20 t/s	Fast full-quality 70B
Llama 3.3 70B FP16	~140GB ⚠️	5–8 t/s	True full precision with RAM offload
GLM-5 Reasoning Q4	~60GB	12–16 t/s	Top of leaderboards

MoE Models — This Is Where Tier 5 Shines

Model	VRAM	Speed	Notes
Qwen3-235B-A22B (22B active)	~90GB	8–14 t/s ✅	Frontier model. Rivals GPT-4o. Fits entirely in VRAM
Mixtral 8×22B (39B active)	~80GB	10–15 t/s	Fully in VRAM now. Fast MoE
DeepSeek V3.2 (37B active)	96GB + RAM	3–6 t/s	Frontier reasoning. GPT-5 competitive
DeepSeek R1 (37B active)	96GB + RAM	3–6 t/s	o1-class reasoning on your hardware
Qwen3-Coder-480B-A35B	~180GB ⚠️	2–5 t/s	Claude Sonnet 4 class coding — heavy RAM offload

Tier 6 — "The Fleet Commander" ENTERPRISE

$8,000 – $15,000+

GPU Options

2× A6000 48GB or 1× A100 80GB

Total VRAM

80–96 GB

Bandwidth

Up to 2,000 GB/s (A100 HBM2e)

System RAM

256 GB ECC DDR4

Platform

EPYC server (Supermicro/Dell)

Form Factor

Tower or 4U Rackmount

Noise

Whisper-quiet (blower/passive)

Monthly Electric

~$20

Why Tier 6 Exists

Tier 6 trades raw VRAM-per-dollar (where quad 3090s win) for bandwidth, reliability, noise, power efficiency, and form factor. The A100's 2 TB/s HBM2e memory is roughly 2× faster than a 3090's GDDR6X per card — meaning significantly faster per-token inference. The A100 also supports MIG partitioning: splitting one GPU into up to 7 independent instances, each running a separate model simultaneously with zero interference.

Dense Models

Model	Config	Speed	Notes
Llama 3.3 70B Q4	A100 80GB	25–35 t/s 🔥	HBM2e bandwidth = speed king
Llama 3.3 70B Q8	A100 80GB	20–28 t/s	Full quality, still blazing
Qwen 2.5 72B Q4	2× A6000 NVLink	16–22 t/s	96GB unified memory pool

MoE Models

Model	Config	Speed	Notes
Qwen3-235B-A22B	2× A6000 NVLink (96GB)	10–16 t/s	Frontier MoE, unified memory
DeepSeek V3.2 (37B active)	A100 + 256GB RAM	3–6 t/s	HBM2e accelerates active params

A100 MIG Multi-Model (Unique to Tier 6)

MIG Partition	VRAM	Model	Speed	Agent Role
Slice 1	10GB	Llama 8B Q4	15–20 t/s	Watcher
Slice 2	10GB	Phi-4 Q4	12–15 t/s	Security Scanner
Slice 3	20GB	Qwen 32B Q4	10–14 t/s	Worker
Slice 4	40GB	Llama 70B Q4	12–18 t/s	CEO Thinker

Four independent models, one card, zero interference. This is the A100's killer feature.

04 — PerformanceTokens Per Second Across All Tiers

Dense Model Inference Speed by Tier

MoE Model Inference Speed by Tier

05 — MoE Tier MappingWhere Each MoE Model Becomes Usable

This chart shows the minimum tier where each MoE model first fits in VRAM, and the tier where it reaches conversational speed (10+ tok/s).

MoE Models: VRAM Requirement vs Available VRAM by Tier

MoE Model	VRAM (Q4)	First Fits	Conversational Speed (10+ t/s)	Peak Speed (best tier)
Qwen3-30B-A3B	~18GB	Tier 1	Tier 1 (15–22 t/s)	T3: 30–40 t/s
Mixtral 8×7B	~26GB	Tier 2 (tight)	Tier 3 (20–25 t/s)	T3: 20–25 t/s
Llama 4 Scout	~35GB	Tier 3	Tier 3 (15–22 t/s)	T5: 18–25 t/s
Mixtral 8×22B	~80GB	Tier 5	Tier 5 (10–15 t/s)	T5: 10–15 t/s
Qwen3-235B-A22B	~90GB	Tier 5	Tier 5 (8–14 t/s)	T6: 10–16 t/s
DeepSeek V3.2 / R1	~260GB	Tier 5 + 256GB RAM	Never (3–6 t/s max)	T5/T6: 3–6 t/s
Kimi K2.5	~240GB (1.8-bit)	Tier 5 + RAM (extreme quant)	Never (1–2 t/s max)	Needs 2× H100

The MoE Sweet Spot

Qwen3-235B-A22B on Tier 5 is the most exciting MoE model for homelab builders. With 235B total parameters but only 22B active, it delivers frontier-class intelligence (rivaling GPT-4o on many benchmarks) at speeds comparable to a dense 22B model. At 8–14 tok/s on quad 3090s, it's genuinely conversational — a full 235B model responding as fast as you can read. This is the "holy grail" of local AI: proprietary-model quality with zero token costs.

06 — Cost AnalysisTotal Cost of Ownership

3-Year Total Cost of Ownership (Build + Electricity)

Tier	Build Cost	Year 1 Electric	Year 1 Total	Year 3 Total	Break-Even vs API*
Tier 1	~$650	~$144	$794	$1,082	~5 months
Tier 2	~$1,200	~$216	$1,416	$1,848	~8 months
Tier 3	~$2,100	~$276	$2,376	$2,928	~13 months
Tier 4	~$3,100	~$346	$3,446	$4,138	~18 months
Tier 5	~$4,700	~$373	$5,073	$5,819	~24 months
Tier 6	~$8,000	~$240	$8,240	$8,720	~36 months

*Break-even calculated against Claude Sonnet API moderate usage (~2M tokens/day = ~$2,160/year). Your actual usage will vary.

07 — Why RTX 3090?The GPU That Dominates Every Tier

Five of six tiers are built around the RTX 3090. Here's why no other card comes close at this price point:

VRAM per Dollar — Used GPU Market (Feb 2026)

The 3090 offers 24GB at 936 GB/s for ~$800 used. The 4090 gives the same 24GB at 1,008 GB/s for ~$1,700 — an 8% bandwidth gain for 112% more money. The 5090's 32GB at $2,500+ means three 3090s (72GB, 2,808 GB/s combined) cost the same as one 5090 while delivering 2.25× the VRAM and 1.57× the bandwidth.

The only alternatives that make sense are professional/datacenter cards (A6000, A100) at Tier 6, where you're paying a premium for noise, thermals, reliability, and per-card VRAM capacity — not price/performance.

08 — RecommendationsWhich Tier Should You Build?

Your Goal	Recommended Tier	Why
Learn Linux + AI, run 8B/32B agents	Tier 1 ($650)	Cheapest entry. Validates the workflow
Serious daily driver, 32B at full speed	Tier 2 ($1,200)	Best value. Upgradeable to Tier 3
70B at conversational speed, character work	Tier 3 ($2,100)	The capability inflection point
70B at full quality, large MoE experiments	Tier 4 ($3,100)	Q8 quality, zero quantization compromise
Frontier MoE models, multi-agent swarm	Tier 5 ($4,700)	Qwen3-235B + DeepSeek R1 on your hardware
Enterprise reliability, silent 24/7, speed	Tier 6 ($8,000+)	A100's bandwidth + MIG is unmatched

The Piecemeal Path to Tier 5

Start with an EPYC motherboard + single 3090 (~$1,800). This gives you Tier 2 performance on a Tier 5 platform. Add GPUs over time — each 3090 is a ~$800 modular upgrade. Full Tier 5 builds out over 6–12 months with zero wasted components.