01 Side-by-side: Three Speeds
Same prompt, same response, three different speeds. The slow lane (CPU-only) is real life for anyone without a GPU. The fast lane (RTX 4090) is what enthusiasts pay for. Watch and feel the difference.
02 Estimate Your Hardware
Pick your GPU, model size, and quantization. Estimates use published llama.cpp / Ollama community benchmarks at typical context lengths. Real-world numbers vary by ±20%.
03 What Tokens / Second Actually Means
Speed isn't just "faster is better." Different workloads care about different things. Here's how to think about it.
time-to-first-token matters more than tok/s. A 200-token answer at 30 tok/s feels worse than a 500-token answer at 80 tok/s if the first one took 4s to start. Watch prompt-eval speed too.04 Sources & Methodology
Estimates derived from published llama.cpp benchmark threads, Ollama community testing, and HuggingFace model cards. Real-world performance varies ±20% depending on driver, batch settings, prompt length, and thermal headroom.
Benchmark Sources
- llama.cpp GPU performance discussion (#4167) — community-maintained tok/s tables across CUDA, Metal, ROCm
- llama.cpp wiki — official benchmark methodology, M-series Apple Silicon results
- Ollama Library — model cards with reference hardware speeds
- HuggingFace Hub — published quantization tradeoffs and inference speeds
- r/LocalLLaMA benchmark threads — real-world community numbers across hardware
Speed estimates assume single-user inference, Q4_K_M quantization, default context window, and a fully cached model. Multi-user serving (vLLM, TGI) achieves higher aggregate throughput per request via batching. Last calibrated: May 25, 2026.