AI Hardware Stack Builder — Local AI Models

// PRIMER

This tool estimates whether a local AI model fits your GPU and how fast it will run. The short version: a model's VRAM use is mostly its parameter count × its quantization — a 7B model needs ~14 GB at full precision but only ~3.5 GB at 4-bit. Pick your hardware below to see what fits, or jump to how VRAM sizing works.

01 Select Your Hardware

Available VRAM

CPU-Only Inference

CPU-only inference is possible but much slower than GPU (typically 3-15 tok/s for 7B models). Models load into regular system RAM instead of VRAM. Enter your total system RAM below to see what fits.

System RAM (GB):

Speed estimate: roughly 1 tok/s per 2 GB of model size at Q4 quantization. Leave ~4 GB free for OS overhead.

Custom VRAM:

02 What Fits

03 Your Stack

Context Length 8K

VRAM Usage No stack selected

Total VRAM exceeds hardware limit. Some models may not run simultaneously.

Add models from the list above to build your stack

04 Stack Summary

VRAM Capacity

0used of 0 GB

All speed estimates are approximate, based on GPU memory bandwidth / model size at the selected quantization with ~60% real-world efficiency. Actual performance varies by context length, batch size, prompt complexity, and system config. See reference table below.

05 Recommended Stacks

06 Token Speed Reference

Estimated tok/s at Q4_K_M quantization. Click a column header to sort.

07 How VRAM Sizing Works

VRAM is the single biggest constraint on running models locally. The math is simple enough to do in your head, and it explains every “fits / doesn't fit” result above.

// THE ESTIMATE

model VRAM ≈ parameters × bytes/param + KV cache + overhead

bytes/param by quantization: FP16 2.0 8-bit 1.0 4-bit 0.5

// example a 7B model at 4-bit ≈ 7 × 0.5 = 3.5 GB of weights, plus ~1–2 GB for the KV cache and runtime → you need ~5–6 GB free. Full precision (FP16) would need ~14 GB.

// What runs on what — at 4-bit (Q4)

VRAM	7B	13B	32B	70B
8 GB	runs	no	no	no
12 GB	runs	runs	no	no
16 GB	runs	runs	no	no
24 GB	runs	runs	tight	no
48 GB	runs	runs	runs	tight

Rough guide at 4-bit quantization with moderate context. Higher precision, long context, or running multiple models need more. Use the builder above for your exact GPU and model.

▶ Common questions

Can I run a 70B model on a 24 GB GPU?

Not at usable quality. A 70B model needs ~40 GB even at 4-bit. Options: two 24 GB cards (VRAM combines for inference), a single 48 GB card, an aggressive 2–3 bit quant (quality drops), or a smaller model.

FP16 vs 8-bit vs 4-bit — what's the tradeoff?

Quantization shrinks each weight. FP16 (2 bytes) is full quality; 8-bit (1 byte) halves VRAM with near-zero quality loss; 4-bit (0.5 byte) quarters it with a small, usually-acceptable quality hit. 4-bit (Q4_K_M) is the common sweet spot for local use.

Does context length change VRAM?

Yes. The KV cache grows with context length, so a 32K-token session needs noticeably more headroom than a 4K one. The estimate above assumes moderate context — long context can add several GB.

Do two GPUs combine their VRAM?

For inference, yes — the model is split across cards, so 2×24 GB can hold a model needing ~40 GB. Speed is gated by the slower card's bandwidth and the interconnect; it doesn't double throughput.