01 Side-by-side: Three Speeds

Same prompt, same response, three different speeds. The slow lane (CPU-only) is real life for anyone without a GPU. The fast lane (RTX 4090) is what enthusiasts pay for. Watch and feel the difference.

Prompt:

5tok/s

Slow Lane

Laptop CPU · no GPU

Painful — slower than you read

Elapsed: 0.0s Tokens: 0

25tok/s

Sweet Spot

M2 / RTX 3060 · 7B Q4_K_M

Comfortable — matches reading pace

Elapsed: 0.0s Tokens: 0

80tok/s

Fast Lane

RTX 4090 · 7B Q4_K_M

Effortless — faster than you read

Elapsed: 0.0s Tokens: 0

Where the thresholds are

Average reading speed is around ~250 words/min ≈ 5–7 tokens/sec. Below that you're waiting on the model. Above ~30 tok/s you can't read fast enough — and the speed becomes invisible.

02 Estimate Your Hardware

Pick your GPU, model size, and quantization. Estimates use published llama.cpp / Ollama community benchmarks at typical context lengths. Real-world numbers vary by ±20%.

Hardware

Model Size

Quantization

Context Length

105

Tokens / Second (estimated)

Effortless — faster than you read

RTX 4090 running Llama 3.1 8B at Q4_K_M, 8K context. Streaming feels instant — text arrives faster than you can read it. This is the consumer reference point.

03 What Tokens / Second Actually Means

Speed isn't just "faster is better." Different workloads care about different things. Here's how to think about it.

CHAT

Reading speed = ~5–7 tok/s

Average adult reads ~250 wpm, which is roughly 5–7 LLM tokens/sec. Below that, you're staring at the screen waiting. Above ~30 tok/s the streaming becomes invisible — you're reading the whole response after it's done.

RAG

RAG cares about TTFT, not throughput

For retrieval workflows, time-to-first-token matters more than tok/s. A 200-token answer at 30 tok/s feels worse than a 500-token answer at 80 tok/s if the first one took 4s to start. Watch prompt-eval speed too.

CODE

Coding tools want headroom

Aider, Continue, Cursor pipe responses through diff parsers and apply patches live. 80+ tok/s keeps the loop tight; under 20 tok/s and you're babysitting. Match your hardware to your workflow, not just your model.

BATCH

Batch jobs want total throughput

Summarizing 10,000 tickets overnight? You don't care about streaming — you care about tokens-per-hour. Batch inference engines (vLLM, TGI) can push 5–10× the per-stream speed by serving many requests in parallel.

CTX

Long context taxes both ends

Doubling context cuts speed roughly 15–25% from KV-cache pressure. A 128K context can run half the speed of 8K. If you don't need long context, don't load it — most chat workflows fit fine in 4K.

QUANT

Quantization buys you speed AND VRAM

Going from FP16 to Q4_K_M roughly 4× the speed and ¼ the VRAM, with ~99.5% of the quality. Q4_K_M is the community default for a reason. Only drop to Q3 if you're VRAM-starved.

04 Sources & Methodology

Estimates derived from published llama.cpp benchmark threads, Ollama community testing, and HuggingFace model cards. Real-world performance varies ±20% depending on driver, batch settings, prompt length, and thermal headroom.

Benchmark Sources

llama.cpp GPU performance discussion (#4167) — community-maintained tok/s tables across CUDA, Metal, ROCm
llama.cpp wiki — official benchmark methodology, M-series Apple Silicon results
Ollama Library — model cards with reference hardware speeds
HuggingFace Hub — published quantization tradeoffs and inference speeds
r/LocalLLaMA benchmark threads — real-world community numbers across hardware

Speed estimates assume single-user inference, Q4_K_M quantization, default context window, and a fully cached model. Multi-user serving (vLLM, TGI) achieves higher aggregate throughput per request via batching. Last calibrated: May 25, 2026.

Token Speed Simulator

01 Side-by-side: Three Speeds

02 Estimate Your Hardware

03 What Tokens / Second Actually Means

04 Sources & Methodology

Benchmark Sources