01 Side-by-side: Three Speeds

Same prompt, same response, three different speeds. The slow lane (CPU-only) is real life for anyone without a GPU. The fast lane (RTX 4090) is what enthusiasts pay for. Watch and feel the difference.

5tok/s
Slow Lane
Laptop CPU · no GPU
Painful — slower than you read
Elapsed: 0.0s Tokens: 0
25tok/s
Sweet Spot
M2 / RTX 3060 · 7B Q4_K_M
Comfortable — matches reading pace
Elapsed: 0.0s Tokens: 0
80tok/s
Fast Lane
RTX 4090 · 7B Q4_K_M
Effortless — faster than you read
Elapsed: 0.0s Tokens: 0
Where the thresholds are
Average reading speed is around ~250 words/min5–7 tokens/sec. Below that you're waiting on the model. Above ~30 tok/s you can't read fast enough — and the speed becomes invisible.
Ad · Support this free resource

02 Estimate Your Hardware

Pick your GPU, model size, and quantization. Estimates use published llama.cpp / Ollama community benchmarks at typical context lengths. Real-world numbers vary by ±20%.

105
Tokens / Second (estimated)
Effortless — faster than you read
RTX 4090 running Llama 3.1 8B at Q4_K_M, 8K context. Streaming feels instant — text arrives faster than you can read it. This is the consumer reference point.

03 What Tokens / Second Actually Means

Speed isn't just "faster is better." Different workloads care about different things. Here's how to think about it.

CHAT
Reading speed = ~5–7 tok/s
Average adult reads ~250 wpm, which is roughly 5–7 LLM tokens/sec. Below that, you're staring at the screen waiting. Above ~30 tok/s the streaming becomes invisible — you're reading the whole response after it's done.
RAG
RAG cares about TTFT, not throughput
For retrieval workflows, time-to-first-token matters more than tok/s. A 200-token answer at 30 tok/s feels worse than a 500-token answer at 80 tok/s if the first one took 4s to start. Watch prompt-eval speed too.
CODE
Coding tools want headroom
Aider, Continue, Cursor pipe responses through diff parsers and apply patches live. 80+ tok/s keeps the loop tight; under 20 tok/s and you're babysitting. Match your hardware to your workflow, not just your model.
BATCH
Batch jobs want total throughput
Summarizing 10,000 tickets overnight? You don't care about streaming — you care about tokens-per-hour. Batch inference engines (vLLM, TGI) can push 5–10× the per-stream speed by serving many requests in parallel.
CTX
Long context taxes both ends
Doubling context cuts speed roughly 15–25% from KV-cache pressure. A 128K context can run half the speed of 8K. If you don't need long context, don't load it — most chat workflows fit fine in 4K.
QUANT
Quantization buys you speed AND VRAM
Going from FP16 to Q4_K_M roughly 4× the speed and ¼ the VRAM, with ~99.5% of the quality. Q4_K_M is the community default for a reason. Only drop to Q3 if you're VRAM-starved.

04 Sources & Methodology

Estimates derived from published llama.cpp benchmark threads, Ollama community testing, and HuggingFace model cards. Real-world performance varies ±20% depending on driver, batch settings, prompt length, and thermal headroom.

Benchmark Sources

Speed estimates assume single-user inference, Q4_K_M quantization, default context window, and a fully cached model. Multi-user serving (vLLM, TGI) achieves higher aggregate throughput per request via batching. Last calibrated: May 25, 2026.

Ad · Support this free resource