← Model Index

AI Hardware Stack Builder

Plan your local AI workstation — see what fits, how fast it runs, and build your perfect stack

SYSTEMS ONLINE

// PRIMER

This tool estimates whether a local AI model fits your GPU and how fast it will run. The short version: a model's VRAM use is mostly its parameter count × its quantization — a 7B model needs ~14 GB at full precision but only ~3.5 GB at 4-bit. Pick your hardware below to see what fits, or jump to how VRAM sizing works.

01 Select Your Hardware

Available VRAM

CPU-Only Inference

CPU-only inference is possible but much slower than GPU (typically 3-15 tok/s for 7B models). Models load into regular system RAM instead of VRAM. Enter your total system RAM below to see what fits.

Speed estimate: roughly 1 tok/s per 2 GB of model size at Q4 quantization. Leave ~4 GB free for OS overhead.

02 What Fits

03 Your Stack

8K
VRAM Usage No stack selected
Total VRAM exceeds hardware limit. Some models may not run simultaneously.
Add models from the list above to build your stack

04 Stack Summary

VRAM Capacity
0used of 0 GB
0%

All speed estimates are approximate, based on GPU memory bandwidth / model size at the selected quantization with ~60% real-world efficiency. Actual performance varies by context length, batch size, prompt complexity, and system config. See reference table below.

05 Recommended Stacks

06 Token Speed Reference

Estimated tok/s at Q4_K_M quantization. Click a column header to sort.

07 How VRAM Sizing Works

VRAM is the single biggest constraint on running models locally. The math is simple enough to do in your head, and it explains every “fits / doesn't fit” result above.

// THE ESTIMATE

model VRAM ≈ parameters × bytes/param + KV cache + overhead

bytes/param by quantization: FP16 2.0 8-bit 1.0 4-bit 0.5

// example  a 7B model at 4-bit ≈ 7 × 0.5 = 3.5 GB of weights, plus ~1–2 GB for the KV cache and runtime → you need ~5–6 GB free. Full precision (FP16) would need ~14 GB.

// What runs on what — at 4-bit (Q4)

VRAM7B13B32B70B
8 GBrunsnonono
12 GBrunsrunsnono
16 GBrunsrunsnono
24 GBrunsrunstightno
48 GBrunsrunsrunstight

Rough guide at 4-bit quantization with moderate context. Higher precision, long context, or running multiple models need more. Use the builder above for your exact GPU and model.

Common questions

Can I run a 70B model on a 24 GB GPU?

Not at usable quality. A 70B model needs ~40 GB even at 4-bit. Options: two 24 GB cards (VRAM combines for inference), a single 48 GB card, an aggressive 2–3 bit quant (quality drops), or a smaller model.

FP16 vs 8-bit vs 4-bit — what's the tradeoff?

Quantization shrinks each weight. FP16 (2 bytes) is full quality; 8-bit (1 byte) halves VRAM with near-zero quality loss; 4-bit (0.5 byte) quarters it with a small, usually-acceptable quality hit. 4-bit (Q4_K_M) is the common sweet spot for local use.

Does context length change VRAM?

Yes. The KV cache grows with context length, so a 32K-token session needs noticeably more headroom than a 4K one. The estimate above assumes moderate context — long context can add several GB.

Do two GPUs combine their VRAM?

For inference, yes — the model is split across cards, so 2×24 GB can hold a model needing ~40 GB. Speed is gated by the slower card's bandwidth and the interconnect; it doesn't double throughput.