AI Hardware Stack Builder
Plan your local AI workstation — see what fits, how fast it runs, and build your perfect stack
// PRIMER
This tool estimates whether a local AI model fits your GPU and how fast it will run. The short version: a model's VRAM use is mostly its parameter count × its quantization — a 7B model needs ~14 GB at full precision but only ~3.5 GB at 4-bit. Pick your hardware below to see what fits, or jump to how VRAM sizing works.
01 Select Your Hardware
CPU-Only Inference
CPU-only inference is possible but much slower than GPU (typically 3-15 tok/s for 7B models). Models load into regular system RAM instead of VRAM. Enter your total system RAM below to see what fits.
Speed estimate: roughly 1 tok/s per 2 GB of model size at Q4 quantization. Leave ~4 GB free for OS overhead.
02 What Fits
03 Your Stack
04 Stack Summary
All speed estimates are approximate, based on GPU memory bandwidth / model size at the selected quantization with ~60% real-world efficiency. Actual performance varies by context length, batch size, prompt complexity, and system config. See reference table below.
05 Recommended Stacks
06 Token Speed Reference
Estimated tok/s at Q4_K_M quantization. Click a column header to sort.
07 How VRAM Sizing Works
VRAM is the single biggest constraint on running models locally. The math is simple enough to do in your head, and it explains every “fits / doesn't fit” result above.
model VRAM ≈ parameters × bytes/param + KV cache + overhead
// example a 7B model at 4-bit ≈ 7 × 0.5 = 3.5 GB of weights, plus ~1–2 GB for the KV cache and runtime → you need ~5–6 GB free. Full precision (FP16) would need ~14 GB.
// What runs on what — at 4-bit (Q4)
| VRAM | 7B | 13B | 32B | 70B |
|---|---|---|---|---|
| 8 GB | runs | no | no | no |
| 12 GB | runs | runs | no | no |
| 16 GB | runs | runs | no | no |
| 24 GB | runs | runs | tight | no |
| 48 GB | runs | runs | runs | tight |
Rough guide at 4-bit quantization with moderate context. Higher precision, long context, or running multiple models need more. Use the builder above for your exact GPU and model.
▶ Common questions
Can I run a 70B model on a 24 GB GPU?
Not at usable quality. A 70B model needs ~40 GB even at 4-bit. Options: two 24 GB cards (VRAM combines for inference), a single 48 GB card, an aggressive 2–3 bit quant (quality drops), or a smaller model.
FP16 vs 8-bit vs 4-bit — what's the tradeoff?
Quantization shrinks each weight. FP16 (2 bytes) is full quality; 8-bit (1 byte) halves VRAM with near-zero quality loss; 4-bit (0.5 byte) quarters it with a small, usually-acceptable quality hit. 4-bit (Q4_K_M) is the common sweet spot for local use.
Does context length change VRAM?
Yes. The KV cache grows with context length, so a 32K-token session needs noticeably more headroom than a 4K one. The estimate above assumes moderate context — long context can add several GB.
Do two GPUs combine their VRAM?
For inference, yes — the model is split across cards, so 2×24 GB can hold a model needing ~40 GB. Speed is gated by the slower card's bandwidth and the interconnect; it doesn't double throughput.