Local AI Model Cheat Sheet

Q: How much VRAM do I need to run a local LLM?

It depends on model size and quantization. At Q4_K_M (the community default): 7B models need 5-7GB, 14B models need 10-11GB, 32B models need 20-22GB, and 70B models need 45-50GB. The formula is roughly: VRAM = Parameters_B x 0.5 + KV_overhead. Consumer GPUs like the RTX 4090 (24GB) can comfortably run 32B models.

Q: What is the best local LLM in 2026?

For 24GB GPUs: Qwen3 32B (Apache 2.0, dual-mode thinking, excellent multilingual/coding). For 16GB: Phi-4 14B (MIT, exceptional reasoning/math). For 8GB or less: Qwen3 8B or Llama 3.2 3B. For frontier quality: GLM-5 744B or DeepSeek V3.2 671B (datacenter hardware required).

Q: What is Q4_K_M quantization?

Q4_K_M is a 4.8-bit quantization format that reduces model size by ~75% while preserving 99.5% of quality. It's the community default for running LLMs locally via Ollama and llama.cpp. Think of it like JPEG at 80% quality — barely distinguishable from the original but dramatically smaller.

Q: Can I run AI models locally without internet?

Yes. Once downloaded, all models on this page run entirely offline with zero cloud dependency. Tools like Ollama, LM Studio, and llama.cpp handle local inference. This is ideal for air-gapped environments, HIPAA compliance, and privacy-sensitive workflows.

100

Models Cataloged

Model Families

0.3–1500

GB VRAM Range

82M–744B

Parameter Range

📐

VRAM ≈ (Params_B × 0.5) + KV_overhead

Rule of thumb for Q4_K_M quantization at 8K context. Add 20–50% for 32K–128K contexts. MoE models load all params but only activate a fraction per token.

Tier 1 — Entry Level

4–8 GB VRAM / CPU

Laptops, iGPUs, M-series Macs. Models up to ~8B. Basic chat, summaries, simple automation. ~30–80 t/s on 8B models, ~15–25 t/s on M-series Macs.

Tier 2 — Mid Range

12–24 GB VRAM

RTX 4070/4080/4090. Models 14B–32B. Strong coding, reasoning, agents. ~40–60 t/s on 14B, ~20–35 t/s on 32B. Sweet spot for power users.

Tier 3 — High End

32–80 GB VRAM

Multi-GPU or A100/H100. Models 70B+. Near-cloud quality. ~15–30 t/s on 70B, ~40+ t/s on A100. Production-grade local inference.

Tier 4 — Datacenter

80+ GB / Multi-Node

H100/H200 clusters. Full-scale MoE giants (200B–744B). Frontier-level capabilities. ~50–100+ t/s with tensor parallelism.

Dedicated Guides

Deep dives, one click away

The hard stuff — install paths, quantization, complete stacks, weekly changelog — each lives on its own page.

Changelog · Live

What's new

Eight weeks of model releases and feature launches — Nemotron 3, GLM-5 744B, Music + 3D categories, Community Analytics integration.

8 entries · weekly cadence View timeline Live Data

Community pulse

Real HuggingFace download counts and likes. What's trending, what's rising, what's discussed — refreshed every 6 hours.

From zero to running

Install Ollama, pull a model, add a UI, scale up. Plus the eight frontend tools worth knowing — Open WebUI, LM Studio, ComfyUI, Pinokio.

4 steps · 8 tools compared Open the guide Tech Article

What happens when you compress a model

The JPEG analogy. The seven-row compression table from FP16 down to IQ2_XS. The "what breaks first" sensitivity matrix. Three real-world tradeoff scenarios for RTX 4090 / 16 GB / Llama 4 Scout 109B.

7 quant levels · 6 task types · 3 verdicts Read the deep dive Recipes

Six pre-vetted stacks by hardware tier

Stop picking individual models — pick a complete stack. RTX 4090 power-user, 16 GB coding, 12 GB budget, Mac M-series, air-gapped compliance, ComfyUI creative production. Filter by tier, copy the stack, ship.

6 stacks · LLM + RAG + STT + TTS Browse the stacks

Tiny Models — ≤ 8B Parameters

Meta — Llama

Llama 3.2 (1B / 3B)

1B / 3B 1.5–3.6 GB 128K ctx Community Tool Use

Meta's ultra-lightweight on-device models. Fast inference (50–80 t/s), solid instruction following and multilingual support. The 3B variant is a popular starting point for testing local AI setups.

Strengths

Extremely fast — runs on CPU-only setups
Strong instruction following for size
Massive fine-tune ecosystem (100K+ HF variants)
Good multilingual support

Weaknesses

Limited reasoning depth on complex tasks
Shallower knowledge than 7B+ peers
Can hallucinate on niche domains

Best For

Always-on chatbots and agents
Mobile / edge / low-power devices
Quick local automation scripts
Testing and prototyping pipelines

VRAM Breakdown

1B Q4: ~1.5–2 GB VRAM
3B Q4: ~2.3 GB VRAM
CPU-only: 8+ GB RAM recommended

Ollama ollama pull llama3.2:1b ollama pull llama3.2:3b

Google — Gemma

Gemma 3 (1B / 4B)

1B / 4B 2–4 GB 128K ctx Vision Gemma Tool Use

Google's highly efficient small model family trained for maximum quality-per-parameter. The 4B variant includes vision support and handles 140+ languages. Power-efficient enough for mobile and embedded deployments.

Strengths

Exceptional efficiency — punches above its weight
Multilingual support (140+ languages)
Vision-ready in 4B+ variants
Very power-efficient for edge/mobile

Weaknesses

Lower benchmark scores on heavy reasoning vs denser peers
Limited coding depth compared to Qwen/Phi

Best For

On-device classification and summaries
Multilingual text processing
Quick image understanding tasks
Edge deployments and IoT

VRAM Breakdown

1B Q4: ~2 GB
4B Q4: ~4 GB
Runs well on Apple M-series via MLX

Ollama ollama pull gemma3:1b ollama pull gemma3:4b

Alibaba — Qwen

Qwen3 (0.6B / 1.7B / 4B)

0.6B–4B 1–4 GB 32K ctx Apache 2.0 Tool Use

Alibaba's smallest Qwen3 variants bring dual-mode thinking (fast vs. chain-of-thought) even to tiny form factors. Excellent multilingual coverage (100+ languages) and surprising early coding/math ability for size.

Strengths

Outstanding multilingual (100+ languages)
Dual-mode: fast inference or thinking mode
Strong early coding/math for size
Apache 2.0 — fully permissive license

Weaknesses

Very small variants hallucinate on niche topics
Limited context window vs. larger siblings

Ollama ollama pull qwen3:0.6b ollama pull qwen3:4b

Alibaba — Qwen

Qwen3.5 (4B / 9B / 27B)

4B–27B 3–18 GB 262K ctx Apache 2.0 Tool Use

Next-gen Qwen with hybrid Gated DeltaNet + Attention architecture, native vision-language fusion, and 262K native context (extensible to 1M). Supports 201 languages. Surpasses Qwen3 across reasoning, coding, and visual understanding.

Strengths

262K native context, extensible to 1M tokens
Native multimodal — vision + text fused
201 languages supported
Apache 2.0 — fully commercial

Weaknesses

Very new — limited community fine-tunes
Requires newer inference engines for hybrid attention

Ollama ollama pull qwen3.5:4b ollama pull qwen3.5:27b

Microsoft — Phi

Phi-4-mini (3.8B)

3.8B ~3 GB 128K ctx MIT Tool Use

Microsoft's tiny powerhouse, specifically tuned for reasoning and math. Often beats 7–13B models on STEM benchmarks despite its compact size. Supports 128K context and function calling.

Strengths

Exceptional reasoning/math — beats larger models
128K context window
Very fast inference
Function calling support

Weaknesses

Narrower general knowledge
Less creative writing ability

Ollama ollama pull phi4-mini

Alibaba — Qwen

Qwen3 (8B)

8B 6–7 GB 128K ctx Apache 2.0 Tool Use

The community's top all-around 8B model. Leads benchmarks in multilingual, coding, and long-context tasks for its size class. Dual-mode thinking enables both quick responses and deep chain-of-thought reasoning.

Strengths

Leading multilingual / coding / long-context for 8B
Dual-mode thinking (fast + deep)
Beats many larger models on benchmarks
Excellent community daily driver

Weaknesses

Occasional inconsistency in fast mode
Knowledge depth limited vs 14B+

Ollama ollama pull qwen3:8b

Meta — Llama

Llama 3.1 (8B)

8B ~6.2 GB 128K ctx Community Tool Use

Meta's ecosystem king at the 8B tier. The most fine-tuned open model in history with vast community support. Strong generalist with 128K context and tool-calling capabilities. Ideal as a fine-tuning base.

Strengths

Massive fine-tune ecosystem
Strong generalist — tool-calling, agents
128K context window
Excellent fine-tuning base

Weaknesses

Safety alignment can refuse creative prompts
Slightly behind Qwen3 8B on benchmarks

Ollama ollama pull llama3.1:8b

Mistral AI

Mistral 7B / Nemo 12B

7B / 12B 5–9 GB 128K ctx Apache 2.0

Mistral's foundational models — Apache 2.0 licensed with outstanding instruction-following, creativity, and European-language performance. Nemo 12B (built with NVIDIA) brings 128K context in a very efficient package.

Strengths

Superb instruction following and creativity
Excellent European language performance
Fully permissive Apache 2.0 license
Very efficient inference

Weaknesses

Older base — less competitive on 2026 reasoning benchmarks
Smaller models limited on very complex tasks

Ollama ollama pull mistral ollama pull mistral-nemo

DeepSeek

DeepSeek-R1 Distilled (7B / 8B)

7B / 8B 5–7 GB 64K ctx MIT

Distilled versions of DeepSeek's R1 reasoning model, bringing reinforcement-learning "thinking" mode to tiny form factors. The thinking traces enable complex problem-solving that rivals much larger dense models.

Strengths

RL-based thinking mode — rivals o1-level reasoning
Excellent coding/math for size
MIT license — fully permissive

Weaknesses

Occasionally less coherent without thinking mode
Slower when thinking traces are long

Ollama ollama pull deepseek-r1:8b

Xiaomi — MiMo

MiMo-7B-RL

7B 5–7 GB 48K ctx MIT

Xiaomi's 7B reasoning model that punches far above its weight. Trained on 25T tokens with RL, it matches OpenAI o1-mini on math and code tasks, outperforming many 32B models. Multi-Token Prediction enables ~90% speculative decoding acceptance.

Strengths

Matches o1-mini — MATH-500: 95.8%, AIME 2024: 68.2%
Outperforms 32B models in reasoning
MTP speculative decoding — very fast
MIT license — fully permissive

Weaknesses

Requires trust_remote_code for deployment
48K context — smaller than competitors

Ollama ollama pull mimo:7b

Hugging Face

SmolLM2 (135M / 360M / 1.7B)

135M–1.7B 0.3–1.5 GB 8K ctx Apache 2.0 Tool Use

Hugging Face's ultra-compact language models designed for extreme edge deployment. The smallest viable LLMs for on-device inference where every megabyte counts.

Strengths

Incredibly small — runs on anything
Sub-1GB models available
Fast training/fine-tuning

Weaknesses

Very limited capability — basic tasks only
High hallucination rate

Ollama ollama pull smollm2:1.7b

Hugging Face

SmolLM3 (3B)

3B ~3 GB (Q4) 128K ctx Apache 2.0

Hugging Face's fully transparent 3B reasoning model. Outperforms Llama 3.2 3B and Qwen2.5 3B across 12 benchmarks. Dual-mode thinking, NoPE architecture, trained on 11.2T tokens. Full training recipe published.

Strengths

Outperforms Llama 3.2 3B and Qwen2.5 3B
Dual-mode reasoning (think/no_think)
128K context via YaRN extrapolation
Fully open — weights + training recipe

Weaknesses

Falls behind Qwen3 4B on math tasks
6 languages only (EN, FR, ES, DE, IT, PT)

Ollama ollama pull smollm3:3b

Google — Gemma

Gemma 3n (2B / 4B)

2B / 4B ~2–4 GB Apache 2.0 Vision

Google's edge-optimized multimodal model with vision and audio understanding. Built for on-device deployment via MediaPipe, enabling efficient inference on mobile and edge hardware without cloud dependency.

Strengths

Multimodal — vision + audio in tiny package
MediaPipe optimized for mobile/edge
Runs on phones and low-end hardware
Apache 2.0 — fully open

Weaknesses

Limited general reasoning at 2–4B scale
Narrower language coverage than larger Gemma

Ollama ollama pull gemma3n

BigCode / Hugging Face

StarCoder2 (3B / 7B)

3B / 7B ~3–5 GB 16K ctx BigCode OpenRAIL-M

Dedicated code completion model trained on The Stack v2. Supports 600+ programming languages with fill-in-the-middle capability. Lightweight enough for real-time IDE integration on consumer hardware.

Strengths

600+ language support — widest code coverage
Fill-in-the-middle for IDE completion
Lightweight — real-time on CPU
Trained on ethically-sourced code (The Stack v2)

Weaknesses

Code-only — no general chat ability
16K context (smaller than competitors)

Ollama ollama pull starcoder2:3b

Small Models — 9–14B Parameters

Microsoft — Phi

Phi-4 (14B)

14B ~11 GB 128K ctx MIT Tool Use

Microsoft's reasoning champion at 14B parameters. Tops charts for math and reasoning in its size class (84%+ MMLU). A compact powerhouse that fits comfortably on an RTX 4090 with room to spare for context.

Strengths

Exceptional reasoning/math — tops 14B class
84%+ MMLU, strong GPQA scores
128K context, fast inference
MIT license — fully open

Weaknesses

Less creative / broad general knowledge than Llama/Qwen
Narrower training data focus

Best For

STEM tasks, math tutoring
Research assistants on mid-range GPUs
Technical document analysis
Code review and generation

VRAM Breakdown

Q4: ~11 GB — fits RTX 4060 Ti 16GB
Q8: ~16 GB — fits RTX 4090
FP16: ~28 GB

Ollama ollama pull phi4:14b

Alibaba — Qwen

Qwen3 (14B)

14B ~10.7 GB 128K ctx Apache 2.0

Alibaba's 14B dense model with dual-mode thinking. Excels at multilingual tasks, coding, and long-context processing. A top community recommendation for users with 16GB VRAM GPUs.

Strengths

Top-tier multilingual and coding for 14B class
Dual-mode: fast + thinking
128K context window
Apache 2.0 license

Weaknesses

Occasional fast-mode inconsistency
Slightly below Phi-4 on pure math

Ollama ollama pull qwen3:14b

DeepSeek

DeepSeek-R1 (14B distilled)

14B ~11 GB 64K ctx MIT Tool Use

The 14B distillation of DeepSeek-R1, bringing near-SOTA reasoning and coding ability to a single consumer GPU. Thinking mode enables complex multi-step problem solving.

Strengths

Near-SOTA coding and reasoning for size
RL-trained thinking traces
Excellent for agentic workflows

Weaknesses

Distilled — some quality loss vs full R1
Can be verbose in thinking mode

Ollama ollama pull deepseek-r1:14b

NVIDIA — Nemotron

Nemotron Nano 12B v2 VL

12B ~9 GB 128K ctx Vision+Video NVIDIA Open Tool Use

NVIDIA's multimodal reasoning model designed for document intelligence, video understanding, and visual Q&A. Hybrid Transformer-Mamba architecture combines accuracy with memory efficiency.

Strengths

Multi-image and video understanding
Leading OCR and document intelligence
Hybrid architecture — efficient memory
Reasoning mode toggle via system prompt

Weaknesses

Reasoning not supported for video inputs
Newer — smaller community ecosystem

Ollama ollama pull nemotron-nano:12b

Mistral AI

Mistral Small 3.1 (24B)

24B ~15 GB 128K ctx Apache 2.0 Tool Use

Updated with multimodal vision support and improved text performance. Outperforms Gemma 3 27B and GPT-4o Mini on most benchmarks while hitting 150 tok/s inference. Runs on a single RTX 4090 or 32GB Mac.

Strengths

Vision + text multimodal in one model
Outperforms Gemma 3 and GPT-4o Mini
150 tok/s — very fast inference
Apache 2.0 — commercial friendly

Weaknesses

Heavier than 14B models — needs 16GB+ VRAM
Not as strong on pure math as Phi-4

Ollama ollama pull mistral-small3.1

NVIDIA / IBM

Granite 4.0 (350M → 32B MoE)

350M–32B 1B–9B active 0.5–20 GB 128K ctx Apache 2.0 Tool Use

IBM's enterprise-grade model family with a novel hybrid Mamba-2 architecture for faster inference and lower memory. Spans edge to server: Nano (350M/1B), Micro (3B), Tiny (7B MoE/1B active), Small (32B MoE/9B active). Trained on 15T tokens with strong tool calling and 12-language support.

Strengths

Hybrid Mamba-2 arch — faster inference, lower memory
Edge to server in one family (350M to 32B)
Strong instruction following and tool calling
Apache 2.0 — clean enterprise licensing

Weaknesses

Mamba-2 backend support still maturing in some tools
Smaller community than Llama/Qwen ecosystems
MoE variants need compatible inference stacks

Ollama ollama pull granite4 ollama pull granite4:small-h

MiniMax

MiniMax M2.5 (45.9B MoE / 8.6B active)

45.9B total 8.6B active ~12 GB (Q4) 128K ctx MoE Apache 2.0

MiniMax's dual-mode thinking model with MoE efficiency. Only 8.6B active parameters keep VRAM low while the full 45.9B architecture delivers strong multilingual reasoning. Competitive with much larger dense models on general benchmarks.

Strengths

Dual-mode thinking — fast and deep reasoning
MoE efficiency — only 8.6B active params
Strong multilingual performance
128K context, Apache 2.0

Weaknesses

Smaller community than Qwen/Llama ecosystems
Fewer fine-tunes and adapters available

Ollama ollama pull minimax-m2.5

Cohere — Command

Command A (111B MoE / ~28B active)

111B total ~28B active ~25 GB (Q4) 256K ctx MoE CC-BY-NC Tool Use

Cohere's enterprise-focused MoE model built for RAG and agentic workflows. 256K context with strong grounded generation and citation support. MoE architecture keeps active params at ~28B for efficient inference on a single 48GB GPU.

Strengths

256K context — massive document processing
Enterprise RAG with inline citations
MoE — fits on single 48GB GPU (Q4)
Strong agentic tool-use capabilities

Weaknesses

CC-BY-NC — no commercial use without agreement
25GB+ VRAM even with MoE efficiency
Smaller fine-tune ecosystem

Ollama ollama pull command-a

Medium Models — 15–35B Parameters

OpenAI — GPT-OSS

GPT-OSS (20B)

20B ~12 GB 128K ctx Open Tool Use

OpenAI's open-weight model, designed with GPT-style structured reasoning and tool-use capabilities. Agent-friendly with strong structured output generation. Runs well quantized on 24GB cards.

Strengths

GPT-like structured reasoning and tool-use
Agent-friendly design with function calling
Runs well quantized on consumer GPUs
Strong structured outputs

Weaknesses

Newer — smaller community than Llama
128K context limit (no 1M option)

Ollama ollama pull gpt-oss:20b

Mistral AI

Devstral Small 2 (24B)

24B ~15 GB 128K ctx Apache 2.0

Mistral's dedicated coding model optimized for software engineering and agentic coding tasks. Top SWE-bench scores in its size class with strong multi-file code understanding.

Strengths

68% SWE-bench — exceptional coding
Strong agentic task completion
Multi-file code understanding

Weaknesses

Coding-focused — less general purpose
Not ideal for creative or chat tasks

Ollama ollama pull devstral-small

Google — Gemma

Gemma 3 (27B)

27B ~22.5 GB 128K ctx Vision Gemma Tool Use

Google's strong all-rounder with multimodal vision capabilities. Efficient for its size with good performance across reasoning, coding, and multilingual tasks. Fits on a 24GB GPU at Q4.

Strengths

Well-balanced across all benchmarks
Built-in vision support
Efficient for size, good throughput
140+ language support

Weaknesses

Tight fit on 24GB at full context
Not absolute SOTA on any single benchmark

Ollama ollama pull gemma3:27b

Zhipu AI (Z.ai) — GLM

GLM-4.7-Flash (30B MoE / 3B active)

30B total 3B active ~20 GB 200K ctx MoE Open

The efficiency king for coding on consumer hardware. A 30B MoE model that only activates 3B parameters per token, delivering 59.2% SWE-bench at 60–80 t/s. Community calls it "best 70B-or-less model" for UI generation and tool calling. Interleaved thinking between actions.

Strengths

59.2% SWE-bench — top for local coding
60–80 t/s at 4-bit — very fast
Interleaved + preserved thinking modes
Excellent tool-use and agentic capabilities
Runs on RTX 3090/4090 and Mac M-series

Weaknesses

Chat template issues with some runtimes
Needs 24GB+ VRAM for good experience
Smaller community than Qwen/Llama

Ollama ollama pull glm4:latest llama.cpp recommended — use --jinja flag

NVIDIA — Nemotron

Nemotron 3 Nano (30B-A3B)

31.6B total 3.6B active ~20 GB 1M ctx Hybrid MoE NVIDIA Open Tool Use

NVIDIA's hybrid Mamba-Transformer MoE model designed for agentic AI. 1M token context window with 4x faster inference than predecessors. Activates only 3.6B of 31.6B parameters. 91% Math 500 score — the highest among peers. Open weights, training data, AND recipes.

Strengths

91% Math Index — top in class
1M token context window
3.3x faster throughput than Qwen3-30B
Hybrid Mamba-Transformer architecture
Fully open: weights + data + recipes

Weaknesses

Automatic CPU offloading can reduce speed
Context cliff at high token counts
Newer Mamba architecture — less battle-tested

Ollama ollama pull nemotron-nano

NVIDIA — Nemotron

Nemotron 3 Super (120B-A12B)

120B MoE ~64 GB 1M ctx NVIDIA Open Tool Use

NVIDIA's hybrid Mamba-Transformer MoE model with 120.6B total params but only 12.7B active per forward pass. 7x higher throughput than previous gen. Supports multi-token prediction for faster generation. Built for agentic reasoning and collaborative multi-agent workflows. 1M token context via linear attention.

Strengths

7x throughput improvement over previous gen
1M token context window
Multi-token prediction for faster inference
Excellent agentic reasoning and tool use

Weaknesses

Requires 64GB+ VRAM/RAM at minimum
New architecture may have limited tooling support initially
Enterprise-focused — less community fine-tuning

Best For

Multi-agent orchestration
Enterprise agentic AI workflows
IT ticket automation and complex reasoning
High-throughput production inference

VRAM Breakdown

Q4: ~40 GB (12B active params)
Q8: ~64 GB
FP16: ~256 GB (multi-GPU)
Runs on Mac Studio M4 Ultra 192GB

OLLAMA ollama pull nemotron-3-super

NVIDIA — Nemotron

Nemotron 3 Omni

MoE ~24 GB 1M ctx Vision NVIDIA Open

NVIDIA's omni-understanding model for AI agents that need natural conversations, complex reasoning, and advanced visual capabilities. Analyzes video content, documents, and images. Built on the Nemotron 3 architecture with multimodal extensions.

Strengths

Video and document understanding
Natural conversational abilities
Complex visual reasoning
1M token context for long documents

Weaknesses

New release — limited community benchmarks
NVIDIA GPU optimization bias
Less tested on consumer hardware

Best For

Document and video analysis agents
Visual question answering
Multimodal enterprise workflows
Real-time conversational AI with vision

VRAM Breakdown

Q4: ~16-24 GB (estimated)
Optimized for NVIDIA GPUs
vLLM and TensorRT-LLM support

NVIDIA pip install nemo-toolkit[all]

Alibaba — Qwen

Qwen3 (32B)

32B ~22 GB 128K ctx Apache 2.0

The community's top recommendation for 24GB GPUs. Exceptional multilingual, coding, and long-context stability — maintains 33–53 t/s at 48K tokens with zero offloading. The best long-context stability king in its class.

Strengths

Best long-context stability — 100% GPU at 48K
Exceptional multilingual and coding
Dual-mode thinking
Apache 2.0 license

Weaknesses

Tight fit on 24GB — context limited
Dense model — heavier than MoE alternatives

Ollama ollama pull qwen3:32b

DeepSeek

DeepSeek-R1 (32B)

32B ~22 GB 64K ctx MIT

Near-SOTA coding/math/reasoning with thinking mode at a size that fits on a single 24GB GPU. KV cache efficiency via MLA architecture. The go-to for developer tools and agentic workflows.

Strengths

Near-SOTA reasoning with thinking traces
KV cache efficiency (MLA architecture)
Excellent for agentic/developer workflows
MIT license

Weaknesses

May require vLLM patches for optimal speed
Verbose thinking traces eat context

Ollama ollama pull deepseek-r1:32b

Alibaba — Qwen

QwQ (32B)

32B ~22 GB 128K ctx Apache 2.0 Tool Use

Qwen's dedicated reasoning model, trained specifically for deep chain-of-thought problem solving. Excels at complex multi-step math and logic puzzles where extended reasoning is needed.

Strengths

Purpose-built for deep reasoning
Exceptional on math competitions (AIME)
128K context for long reasoning chains

Weaknesses

Not a generalist — specialized for reasoning
Very verbose reasoning traces

Ollama ollama pull qwq

Allen AI (Ai2) — OLMo

OLMo 3.1 (7B / 32B)

7B / 32B 5–20 GB 65K ctx Apache 2.0

The only truly open-source model family — not just open weights, but full training code, Dolma 3 dataset (6T tokens), intermediate checkpoints, reward models, and OlmoTrace for auditing outputs back to training data. Think variant competitive with Qwen3 32B on reasoning.

Strengths

Fully open: code, data, checkpoints, logs — auditable end-to-end
Think variant strong on math, code, and reasoning
OlmoTrace lets you trace outputs to training data
Apache 2.0 — clean for enterprise and compliance

Weaknesses

Slightly behind Qwen3 on general chat tasks
Smaller community ecosystem than Llama/Qwen
65K context (vs 128K+ on competitors)

Best For

Compliance environments requiring full auditability
Research where training data provenance matters
Reasoning tasks (Think variant)
Organizations that need Apache 2.0 + full transparency

VRAM Breakdown

7B Q4: ~5 GB VRAM
32B Q4: ~18–20 GB VRAM
32B Q8: ~34 GB VRAM

Ollama ollama pull olmo-3.1 ollama pull olmo-3.1:7b

Alibaba — Qwen

Qwen3.5 (32B)

32B ~20 GB (Q4) 128K ctx Apache 2.0 Tool Use

The latest generation of Alibaba's Qwen series, surpassing Qwen3 on reasoning, coding, and multilingual benchmarks. A strong all-rounder that fits on a single RTX 4090 at Q4 quantization.

Strengths

Surpasses Qwen3 32B across benchmarks
Strong reasoning and coding performance
Fits on 24GB GPU at Q4
Apache 2.0 — fully commercial

Weaknesses

Newer — fewer fine-tunes than Qwen3
Tight fit on 24GB cards with long contexts

Ollama ollama pull qwen3.5:32b

TII — Falcon

Falcon 3 (7B / 10B)

7B / 10B ~8–10 GB Apache 2.0

TII's latest Falcon generation delivering state-of-the-art results for its size class. Strong multilingual capabilities with efficient inference. A solid contender in the 7–10B parameter range.

Strengths

State-of-the-art for 7–10B size class
Strong multilingual performance
Efficient inference on consumer GPUs
Apache 2.0 — fully open

Weaknesses

Smaller ecosystem than Llama/Qwen
Fewer community fine-tunes available

Ollama ollama pull falcon3:7b ollama pull falcon3:10b

Large Models — 36–80B Parameters

Meta — Llama

Llama 3.3 (70B)

70B 45–50 GB 128K ctx Community Tool Use

Near-405B performance in a 70B package. The most versatile large dense model with the biggest ecosystem of fine-tunes, adapters, and tooling. Production-grade quality for virtually any general task.

Strengths

Near-405B performance on many tasks
Massive ecosystem — thousands of fine-tunes
Versatile generalist with strong tool-calling
Production-grade quality

Weaknesses

45–50 GB VRAM — needs multi-GPU or offloading
Safety alignment can feel restrictive for creative tasks

Ollama ollama pull llama3.3:70b

Alibaba — Qwen

Qwen3 (72B)

72B ~50 GB 128K ctx Apache 2.0

Alibaba's flagship dense model. Frontier-level multilingual and coding performance with the full dual-mode thinking system. Apache 2.0 licensed for commercial use.

Strengths

Frontier-level multilingual/coding/reasoning
Dual-mode thinking system
Apache 2.0 — fully commercial

Weaknesses

~50 GB VRAM — multi-GPU needed
HF approval may be required

Ollama ollama pull qwen3:72b

DeepSeek

DeepSeek-R1 (70B)

70B ~45 GB 128K ctx MIT Tool Use

The 70B distillation of DeepSeek-R1, delivering top-tier reasoning and coding with thinking traces. High-stakes problem solving with MIT licensing.

Strengths

Top reasoning and coding at 70B tier
Thinking mode for complex problems
MIT license — fully permissive

Weaknesses

45+ GB VRAM requirement
Verbose thinking traces

Ollama ollama pull deepseek-r1:70b

Frontier Models — 80B+ Parameters

Meta — Llama 4

Llama 4 Scout (109B MoE / 17B active)

109B total 17B active 55+ GB (Q4) 10M ctx MoE 16 experts Vision Llama 4 Tool Use

Meta's revolutionary MoE model with an industry-leading 10M token context window. Only 17B parameters active per token across 16 experts, enabling near-70B quality at a fraction of the inference cost. Natively multimodal with text and image processing. Ultra-quantized versions (1.78-bit) fit on a single 24GB GPU.

Strengths

10M token context — industry leading
MoE: 70B+ quality at 17B inference cost
Natively multimodal (text + vision)
1.78-bit quant fits on 24GB (~20 t/s)
Fits on single H100 at INT4

Weaknesses

Full weights need 216GB VRAM
Initial reviews show room for improvement vs hype
Extreme quant degrades quality noticeably

Ollama ollama pull llama4-scout

OpenAI — GPT-OSS

GPT-OSS (120B MoE / 5B active)

120B total 5B active 60+ GB 128K ctx MoE Open Tool Use

OpenAI's largest open model with 62.7% SWE-bench. MoE architecture activates only 5B of 120B parameters per token. Strong GPT-style reasoning with advanced tool-use and structured outputs.

Strengths

62.7% SWE-bench — strong coding
GPT-style reasoning quality
Advanced tool-use and agents

Weaknesses

Needs 48GB+ VRAM
Newer ecosystem — fewer fine-tunes

Ollama ollama pull gpt-oss:120b

Alibaba — Qwen

Qwen3 (235B MoE / 22B active)

235B total 22B active 55+ GB 128K ctx MoE Apache 2.0 Tool Use

Near-frontier multilingual and long-context performance via Mixture of Experts. Quality Index of 57 — among the top open models. Activates 22B of 235B parameters per token for efficient inference.

Strengths

Near-frontier quality (Quality Index: 57)
22B active params — efficient MoE
Exceptional multilingual/long-context
Apache 2.0 license

Weaknesses

55+ GB VRAM minimum
Multi-GPU needed for full power

Ollama ollama pull qwen3:235b

Alibaba — Qwen

Qwen3.5 (397B MoE / 17B active)

397B total 17B active 230+ GB 262K ctx MoE 512 experts Apache 2.0 Tool Use

Qwen3.5 flagship — hybrid Gated DeltaNet + sparse MoE with 512 experts. Native vision-language fusion, 262K context (extensible to 1M), 201 languages. 3.5M downloads in 5 weeks. MMLU-Pro 87.8%, LiveCodeBench v6 83.6%, MMMU 85.0%.

Strengths

Frontier performance — 87.8% MMLU-Pro, 83.6% LiveCodeBench
Native multimodal — text + vision fused
262K–1M context, 201 languages
Apache 2.0 — fully commercial

Weaknesses

230+ GB — multi-GPU cluster required (8-way TP)
Needs SGLang/vLLM — no simple Ollama yet

SGLang / vLLM python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --tp-size 8

Meta — Llama 4

Llama 4 Maverick (400B MoE / 17B active)

400B total 17B active 200+ GB (Q4) 1M ctx MoE 128 experts Vision Llama 4 Tool Use

Meta's highest-performance open model. 128 experts with 1M context window. Competes with GPT-4o class models. Natively multimodal with text and image understanding. Requires multi-GPU setup for production use.

Strengths

GPT-4o class performance
128 experts, 1M context
Natively multimodal
Co-distilled from Llama Behemoth

Weaknesses

200+ GB VRAM — requires 4+ H100s at Q4
1.78-bit quant needs 2x48GB (~40 t/s)
Extremely resource intensive

Ollama ollama pull llama4-maverick

Zhipu AI (Z.ai) — GLM

GLM-4.7 Full (355B MoE / 32B active)

355B total 32B active 205+ GB 200K ctx MoE Open Tool Use

Zhipu's flagship MoE model with interleaved thinking, preserved thinking, and turn-level thinking. 73.8% SWE-bench, 66.7% SWE-bench Multilingual. Exceptional for agentic coding and complex multi-step tasks.

Strengths

73.8% SWE-bench — top-tier coding
Advanced thinking modes (interleaved, preserved)
Strong multilingual agentic coding
Competitive with Sonnet 3.5 for coding

Weaknesses

205+ GB minimum — needs multi-GPU cluster
2-bit GGUF needs 135GB + 128GB RAM

llama.cpp / vLLM recommended ollama pull glm4.7

Xiaomi — MiMo

MiMo-V2-Flash (309B MoE / 15B active)

309B total 15B active 180+ GB 256K ctx MoE MIT

Xiaomi's breakout frontier model that shocked the community — mistaken for DeepSeek V4 on OpenRouter. Hybrid SWA/Global attention with 5:1 ratio slashes KV-cache 6x. Multi-Token Prediction triples generation speed. 73.4% SWE-bench, 94.1% AIME 2025.

Strengths

73.4% SWE-bench, 94.1% AIME 2025
Only 15B active — extremely efficient MoE
6x KV-cache reduction via hybrid attention
MIT license — fully permissive

Weaknesses

Multi-GPU required (8-way TP recommended)
Needs SGLang or KTransformers — no simple Ollama

SGLang / KTransformers python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-V2-Flash --tp-size 8

DeepSeek

DeepSeek V3.2 / R1 (671B MoE / 37B active)

671B total 37B active 300+ GB 128K ctx MoE MIT Tool Use

DeepSeek's full-scale MoE model — one of the most capable open models ever released. Rivals GPT-5 class on coding/reasoning benchmarks. MLA architecture provides extreme KV cache efficiency. Quality Index: 57.

Strengths

Near-frontier on all benchmarks
MLA architecture — extreme KV cache efficiency
37B active params — efficient despite size
MIT license — fully permissive

Weaknesses

Datacenter-scale hardware required
Requires patched vLLM for optimal speed
300+ GB VRAM minimum

vLLM / SGLang recommended for production ollama pull deepseek-v3.2-exp

Zhipu AI (Z.ai) — GLM

GLM-5 (744B MoE / 40B active)

744B total 40B active 1.5 TB (FP16) 200K ctx MoE MIT Tool Use

The largest open-weight model as of Feb 2026. 744B parameters (40B active) trained on 28.5T tokens. 86% GPQA-Diamond (graduate-level reasoning), 90% HumanEval. DeepSeek Sparse Attention for long-context efficiency. 2-bit GGUF: 241GB — fits 256GB unified memory Mac.

Strengths

86% GPQA-Diamond — exceptional reasoning
90% HumanEval — top coding
Quality Index: 49.64 — #1 open source
MIT license — fully open
2-bit GGUF fits 256GB Mac

Weaknesses

FP16 needs 1.5TB VRAM (8x H200 minimum)
2-bit quant: ~5 t/s with offloading
Consumer-unfriendly at full quality

llama.cpp / vLLM (8xH200+ for full quality) ollama pull glm5

Mistral AI

Mistral Large 3 (675B MoE)

675B total 300+ GB 128K ctx MoE Mistral Tool Use

Mistral's largest model to date. Positions itself as one of the strongest open-weight choices for advanced reasoning and high-end self-hosted assistants. Premium quality local inference.

Strengths

Premium reasoning and creative quality
Strong European language support
Advanced instruction following

Weaknesses

Datacenter-scale hardware required
Restrictive license vs Apache 2.0 models

vLLM / SGLang ollama pull mistral-large

Specialty — Coding, Vision, Embedding

Alibaba — Qwen

Qwen3-Coder (480B MoE / 35B active)

480B total 35B active 200+ GB 256K ctx MoE Apache 2.0

Alibaba's dedicated agentic coding model with massive 480B MoE architecture. 55.4% SWE-bench. Optimized for large-scale code generation and software engineering tasks.

HF / vLLM ollama pull qwen3-coder

DeepSeek

DeepSeek Coder V2 (16B / 236B)

16B / 236B 10–50+ GB 128K ctx MIT Tool Use

Purpose-built for code generation with MoE efficiency. Excellent for code completion, generation, and review. The 16B version fits on consumer GPUs.

Ollama ollama pull deepseek-coder-v2

Moonshot AI — Kimi

Kimi K2.5

MoE Varies 262K ctx Open

Moonshot's thinking-focused model with systematic reasoning for research and planning tasks. Exceptional multi-step reasoning with 262K context window. Strong math competition performance.

Ollama / vLLM ollama pull kimi-k2.5

Cohere — Command R

Command R+ (104B)

104B ~60 GB (Q4) 128K ctx CC-BY-NC Tool Use

Purpose-built for RAG and multi-step tool use. Generates grounded responses with inline citations. Highly efficient multilingual tokenizer (10+ languages) means lower cost per token for non-English content. Outperforms GPT-4 Turbo on tool-use benchmarks.

Strengths

Best-in-class RAG — grounded generation with citations
Zero-shot multi-step tool use
Efficient tokenizer cuts cost for multilingual workloads
Strong enterprise workflow integration

Weaknesses

CC-BY-NC license — no commercial use without Cohere agreement
104B needs 60+ GB VRAM (Q4) — multi-GPU or 80GB cards
Older architecture — not MoE, so VRAM scales linearly

Ollama ollama pull command-r-plus

Mistral — Codestral

Codestral 25.05 (25B)

25B ~16 GB (Q4) 256K ctx Non-commercial Tool Use

Mistral's dedicated coding model with 256K context and 80+ programming language support. Built for agentic coding workflows with strong code generation, review, and refactoring capabilities.

Strengths

256K context — handles entire codebases
80+ programming languages
Strong agentic coding capabilities
Fits on 24GB GPU at Q4

Weaknesses

Non-commercial license — research/personal only
Code-focused — weaker on general tasks

Ollama ollama pull codestral

Alibaba — Qwen

Qwen3-Coder (32B Dense)

32B ~20 GB (Q4) 256K ctx Apache 2.0 Tool Use

The consumer-friendly dense variant of Qwen3-Coder. All 32B parameters active (no MoE), delivering strong agentic coding on a single RTX 4090. Excellent for software engineering tasks and IDE integration.

Strengths

Dense 32B — no MoE complexity
256K context for large codebases
Strong agentic coding performance
Apache 2.0 — fully commercial

Weaknesses

Tight fit on 24GB GPUs with long context
Code-focused — use Qwen3.5 for general tasks

Ollama ollama pull qwen3-coder:32b

Various — Vision Models

LLaVA / Qwen2.5-VL / InternVL2.5 / Llama 3.2-Vision

7B–72B +2–5 GB overhead 128K ctx Vision

Vision-language models that add image understanding to base LLMs. LLaVA pioneered the approach. Qwen2.5-VL leads benchmarks with document OCR and video understanding. InternVL2.5 (Shanghai AI Lab) is the top open-source vision model for complex reasoning over images. Llama 3.2-Vision rounds out Meta's multimodal offering.

Top Picks

Qwen2.5-VL: Best all-around — OCR, video, charts, documents
InternVL2.5: Strongest visual reasoning and multi-image
Llama 3.2-Vision: Best Meta ecosystem integration
LLaVA: Lightweight, great for experimentation

Notes

Add 2–5 GB VRAM overhead on top of base model
Larger vision models (72B) need 48 GB+ VRAM
Quality varies heavily by size — 7B vision ≠ 72B vision

Ollama ollama pull llava ollama pull llama3.2-vision

Hugging Face

SmolVLM (256M / 500M / 2B)

256M–2B <1 GB Apache 2.0 Vision

Ultra-tiny vision-language models that run on as little as 1GB VRAM. Capable of document OCR, image captioning, and visual question answering at a fraction of the cost of larger VLMs. Perfect for edge deployment.

Strengths

Runs on 1GB VRAM — smallest viable VLM
Document OCR and image understanding
Multiple sizes for flexibility
Apache 2.0 — fully open

Weaknesses

Limited reasoning at tiny scale
Lower accuracy than larger VLMs

pip pip install transformers

Ai2 (Allen AI)

Molmo (7B / 72B)

7B / 72B 5–50 GB Apache 2.0 Vision

Ai2's fully open vision model with unique pointing and grounding capabilities. Can identify and locate objects in images with spatial coordinates. Fully open — weights, data, and code all available under Apache 2.0.

Strengths

Pointing/grounding — locates objects in images
Fully open: weights, data, and code
7B version runs on consumer GPUs
Apache 2.0 — fully commercial

Weaknesses

72B version needs datacenter hardware
Smaller ecosystem than Qwen2.5-VL

pip pip install transformers

Various — Embeddings

nomic-embed-text / bge-m3 / snowflake-arctic-embed

137M–335M <1 GB 8K tokens Apache 2.0

Lightweight embedding models for RAG pipelines, semantic search, and memory systems. Run alongside any LLM with negligible VRAM overhead. Essential for building retrieval-augmented generation systems.

Ollama ollama pull nomic-embed-text ollama pull bge-m3

Speech-to-Text (STT / ASR)

OpenAI — Whisper

Whisper Large V3 / V3 Turbo

1.55B / 809M 6–10 GB 99+ languages MIT

The gold standard for multilingual speech recognition. 99+ languages, automatic language detection, phrase-level timestamps, and punctuation. V3 Turbo cuts decoder layers from 32 to 4, delivering 6x faster inference with only 1–2% accuracy loss. 7.4% WER average on mixed benchmarks.

Strengths

99+ language support — best multilingual STT
Automatic language identification
Phrase-level timestamps
Turbo variant: 6x faster, 809M params
Handles noise and accents well

Weaknesses

Large V3 needs ~10 GB VRAM
Not streaming-native (batch-oriented)
Can hallucinate on silence or music

pip pip install openai-whisper or faster-whisper for CTranslate2

Hugging Face

Distil-Whisper

756M ~3 GB MIT

6x faster than Whisper Large V3 with only 1% WER degradation. Distilled from Whisper for low-latency transcription. Ideal for real-time applications on consumer hardware.

pip pip install transformers accelerate

NVIDIA — NeMo

Parakeet TDT (0.6B / 1.1B)

0.6B / 1.1B 2–4 GB CC-BY-4.0

NVIDIA's speed-optimized ASR model. RTFx near 2,000x — processes audio dramatically faster than Whisper. RNN-Transducer architecture enables streaming recognition with minimal latency. Trained on 65,000 hours of English audio.

Strengths

Among the fastest open ASR models
Streaming-capable — real-time transcription
65K hours of training data
Optimized for NVIDIA GPUs

Weaknesses

English-only
Ranks lower on pure accuracy vs Whisper
Speed-optimized — accuracy tradeoff

pip pip install nemo_toolkit[asr]

NVIDIA / IBM

Canary Qwen 2.5B / Granite Speech 3.3 8B

2.5B / 8B 4–10 GB Various

Top-accuracy English STT models. Canary Qwen combines speech recognition with the Qwen language model for superior contextual understanding. IBM Granite Speech brings enterprise-grade accuracy. Best for strict accuracy requirements.

NeMo / HF Transformers pip install nemo_toolkit[asr]

Useful Sensors

Moonshine (Tiny / Base)

27M / 61M <0.5 GB MIT

Ultra-lightweight edge ASR model. Outperforms Whisper Tiny and Small despite being significantly smaller. Designed for smartphones, IoT, and offline environments where every MB counts.

pip pip install moonshine

Useful Sensors

Moonshine Gen 2 (0.3B / 0.5B)

0.3B / 0.5B <0.5 GB MIT

Second-generation edge ASR with streaming real-time transcription. Improved accuracy over the original Moonshine while maintaining ultra-low latency. Ideal for always-on voice interfaces and IoT devices.

Strengths

Streaming real-time ASR — sub-100ms latency
Edge-optimized — runs on microcontrollers
Improved accuracy over Moonshine v1
MIT license

Weaknesses

English-focused
Lower accuracy than Whisper on complex audio

pip pip install moonshine

OpenAI — Whisper

Whisper Large V4 (1.5B)

1.5B ~10 GB 99+ languages MIT

The latest iteration of OpenAI's Whisper, improving accuracy over V3 across multilingual benchmarks. Retains full 99+ language support with better handling of accents, noise, and domain-specific terminology.

Strengths

Improved accuracy over Whisper V3
99+ languages — best multilingual coverage
Better noise and accent handling
MIT license — fully open

Weaknesses

Needs ~10 GB VRAM
Not streaming-native (batch-oriented)
Still hallucination-prone on silence

pip pip install openai-whisper

Text-to-Speech (TTS / Voice Synthesis)

Canopy Labs

Orpheus TTS (3B)

3B ~4 GB Apache 2.0

The breakthrough TTS model of late 2025. Human-like emotional speech that rivals ElevenLabs — laughing, crying, whispering on command. Real-time on modern GPUs. State-of-the-art naturalness with emotional control, completely free and local.

Strengths

State-of-the-art naturalness — rivals ElevenLabs
Emotional control (laughing, crying, whispering)
Real-time on modern GPUs
Apache 2.0 — fully commercial

Weaknesses

English-focused (multilingual expanding)
Needs GPU for real-time speeds
Newer — smaller community than Piper

pip pip install orpheus-tts

Kokoro

Kokoro-82M

82M <1 GB Apache 2.0

The breakout star of 2026 local TTS. Only 82M parameters yet delivers neural-quality speech with breathing and natural pauses. Runs on CPU, Apple Silicon, or any GPU. Shockingly good for its tiny size.

Strengths

82M params — runs on anything
Neural quality (breathing, pausing)
CPU and Apple Silicon capable
Near-zero VRAM requirement

Weaknesses

Limited voice cloning ability
Fewer emotional controls than Orpheus

pip pip install kokoro

Fish Audio

Fish Speech V1.5 (S1-mini)

~500M 2–4 GB Apache 2.0

The go-to open-source model for voice cloning across languages. Handles code-switching (e.g., Spanglish) better than most paid APIs. Strong multilingual voice synthesis with zero-shot cloning from short reference audio.

Strengths

Best open-source voice cloning
Excellent cross-language code-switching
Zero-shot cloning from ~10s audio
Strong multilingual support

Weaknesses

Higher VRAM than Kokoro/Piper
Quality varies by language

pip pip install fish-speech

Coqui AI

XTTS v2 / Coqui TTS

~450M 2–4 GB 17 languages MPL-2.0

High-quality multilingual TTS with voice cloning from a 6-second reference. 17 languages supported. The broadest toolkit in open-source TTS with pre-trained voices, fine-tuning, and extensive documentation. Runs well on MacBook Air with 16GB.

Strengths

17 language support out-of-box
Voice cloning from 6-second reference
1100+ pre-trained voices
Extensive documentation and community

Weaknesses

Higher latency than Piper/MeloTTS
MPL license (some restrictions)

pip pip install TTS

Open Home Foundation

Piper TTS

Various (~15M–60M) <0.5 GB 40+ languages MIT

Ultra-fast, ultra-lightweight neural TTS designed for offline and embedded use. Sub-second latency, 40+ languages, runs on Raspberry Pi. The default TTS for Home Assistant. Doesn't clone voices but offers many pre-trained speakers.

Strengths

Fastest open TTS — sub-second latency
Runs on Raspberry Pi / embedded
40+ languages, 100+ voices
Home Assistant integration

Weaknesses

No voice cloning
Less natural than Orpheus/XTTS
Fixed voice catalog

pip pip install piper-tts

Suno AI

Bark

~900M 4–6 GB MIT

Not just TTS — Bark generates music, sound effects, and non-verbal vocalizations (laughs, sighs, throat clears). The most creative and fun model in the list. Can generate background music and ambient sounds alongside speech.

Strengths

Speech + music + sound effects
Non-verbal vocalizations
Creative and expressive
MIT license

Weaknesses

High latency — not real-time
Unpredictable output quality
GPU-heavy for best results

pip pip install git+https://github.com/suno-ai/bark.git

MyShell

MeloTTS

~200M <1 GB 5 languages MIT

Lightweight and remarkably consistent TTS. Maintains low latency even with long texts — processes short texts in under a second. 5 languages with natural prosody. Ideal for low-resource devices and consistent production use.

pip pip install melotts

Resemble AI

Chatterbox TTS (0.4B)

0.4B ~2 GB MIT

Zero-shot voice cloning TTS with emotional control. Clone any voice from a short audio sample and generate speech with adjustable emotion, pitch, and speaking style. Compact and fast on consumer hardware.

Strengths

Zero-shot voice cloning from short samples
Emotional control — adjust tone and style
Compact 0.4B — fast inference
MIT license — fully open

Weaknesses

English-focused
Cloned voice quality depends on input sample

pip pip install chatterbox-tts

Nari Labs

Dia TTS (1.6B)

1.6B ~3 GB Apache 2.0

Multi-speaker dialogue TTS that generates natural conversations between multiple speakers with integrated sound effects. Perfect for audiobook production, podcast generation, and interactive storytelling.

Strengths

Multi-speaker dialogue generation
Integrated sound effects
Natural conversational flow
Apache 2.0 — fully commercial

Weaknesses

Larger than single-speaker TTS models
Sound effects limited to trained set

pip pip install dia-tts

Amphion

MaskGCT (0.4B)

0.4B ~2 GB MIT

Non-autoregressive TTS using masked generative codec transformers for fast zero-shot multilingual speech synthesis. Generates speech in parallel rather than sequentially, enabling significantly faster inference.

Strengths

Non-autoregressive — very fast generation
Zero-shot multilingual voice cloning
Compact 0.4B parameters
MIT license

Weaknesses

Newer — smaller community
Non-AR can sacrifice some prosody quality

pip pip install amphion

Embedding, Search & Retrieval

Nomic AI

Nomic Embed Text V2

475M (305M active) <1 GB 8K tokens MoE Apache 2.0

First MoE embedding model. Trained on 1.6B multilingual pairs across 100+ languages. Supports flexible output dimensions (256–768) via Matryoshka learning. Competitive with models twice its size on BEIR and MIRACL benchmarks. 86.2% top-5 accuracy.

Strengths

MoE — efficient with 305M active params
100+ languages, 100+ code languages
Flexible dimensions (256–768)
Top BEIR/MIRACL scores for size

Weaknesses

Can drop on noisy/multilingual data
Needs prefix prompts for optimal results

Ollama ollama pull nomic-embed-text

BAAI (Beijing Academy)

BGE-M3

~570M <1 GB 8K tokens MIT

The Swiss army knife of embedding models. M3 = Multi-functionality (dense + sparse + ColBERT retrieval), Multi-linguality (100+ languages), Multi-granularity (up to 8K tokens). SOTA on MIRACL and MKQA. The first model to unify all three retrieval methods.

Strengths

Dense + sparse + ColBERT in one model
100+ languages — best cross-lingual
8K token input length
SOTA on multilingual benchmarks

Weaknesses

Slightly slower than single-mode models
Requires prompt engineering for best results

Ollama ollama pull bge-m3

Alibaba — Qwen

Qwen3-Embedding (0.6B / 4B / 8B)

0.6B–8B 0.5–7 GB 32K tokens Apache 2.0

Alibaba's instruction-aware embedding models built on Qwen3. Support user-defined task instructions for 1–5% accuracy improvement. Flexible output dimensions (32–1024). 100+ natural and programming languages. The 4B and 8B variants outperform most competitors.

pip pip install sentence-transformers

Snowflake

Arctic Embed (Various sizes)

22M–335M <0.5 GB 512 tokens Apache 2.0

Snowflake's optimized embedding suite. Multiple sizes from 22M to 335M for different accuracy/speed tradeoffs. Strong English retrieval performance with focus on enterprise search workloads.

Ollama ollama pull snowflake-arctic-embed

BAAI (Beijing Academy)

BGE Reranker v2-M3 / Gemma-based

~570M–9B 0.5–8 GB 8K tokens Apache 2.0

The most popular open-source rerankers for RAG pipelines. Cross-encoder architecture processes query + document together for precise relevance scoring. Adds 100–500ms latency but significantly improves retrieval quality. Run after initial embedding search to rerank top-K candidates.

Strengths

Dramatically improves RAG accuracy
Runs on consumer hardware
Multiple sizes for speed/accuracy tradeoff
Apache 2.0 — no licensing fees

Weaknesses

Adds 100–500ms latency per query
Cross-encoder — can't precompute
Larger variants need significant VRAM

pip pip install FlagEmbedding

Stanford / AnswerAI

ColBERT v2 / ColBERTv2.0

~110M <1 GB 512 tokens MIT

Late-interaction retrieval model using token-level embeddings for scalable BERT-based search in milliseconds. Superior to traditional single-vector embeddings while maintaining speed. Uses MaxSim operator for efficient contextual matching across large datasets.

pip pip install colbert-ai

Jina AI

Jina Embeddings v3 (0.6B)

0.6B <1 GB 8K tokens Apache 2.0

Task-specific embedding model using LoRA adapters for retrieval, classification, separation, and text-matching tasks. Automatically selects the optimal adapter based on query intent, improving accuracy across diverse embedding use cases.

Strengths

Task-specific LoRA adapters — optimized per use case
8K context — handles longer documents
Strong multilingual support
Apache 2.0 — fully open

Weaknesses

Larger than simpler embedding models
LoRA switching adds minor complexity

pip pip install jina-embeddings

Image Generation (Diffusion Models)

Stability AI

Stable Diffusion 1.5

860M 4–6 GB 512×512 native CreativeML Open

The model that started the local image generation revolution. Still viable in 2026 for low-VRAM setups. The largest ecosystem of LoRAs, checkpoints, and fine-tunes. Thousands of community models on CivitAI. Best entry point for learning diffusion workflows.

Strengths

Runs on 4 GB VRAM — most accessible
Largest LoRA/checkpoint ecosystem ever
Fastest generation times
Mature tooling (A1111, ComfyUI)

Weaknesses

512×512 native — needs upscaling
Weaker prompt adherence vs newer models
Poor text rendering in images

ComfyUI / A1111 pip install diffusers transformers accelerate

Stability AI

Stable Diffusion XL (SDXL 1.0)

3.5B (UNet) 8–12 GB 1024×1024 native CreativeML Open

The most widely used open-source image model in existence. 1024×1024 native with dramatically better prompt adherence, photorealism, and composition than SD 1.5. Deepest community ecosystem — thousands of fine-tuned checkpoints (Juggernaut XL, RealVis, DreamShaper) and LoRAs on CivitAI.

Strengths

1024×1024 native resolution
Largest fine-tune ecosystem (Juggernaut, RealVis, etc.)
Excellent photorealism with right checkpoint
ControlNet, inpainting, upscaling support
Commercial use allowed

Weaknesses

8 GB minimum, 12 GB recommended
Refiner adds complexity and VRAM
Text rendering still inconsistent

Variants SDXL 1.0 · SDXL Turbo · SDXL Lightning

SDXL Lightning: 1–4 step generation. SDXL Turbo: real-time. Use ComfyUI or Forge WebUI.

Stability AI

Stable Diffusion 3.5 (Medium / Large / Turbo)

2.5B–8B 12–24 GB 1024×1024 Stability Community

Stability's latest architecture with improved text rendering inside images and better prompt understanding. Uses MMDiT (Multi-Modal Diffusion Transformer). Better typography than SDXL but smaller community fine-tune ecosystem. Medium variant fits 12 GB, Large needs 24 GB.

Strengths

Improved text rendering in images
Better prompt fidelity than SDXL
Multiple size variants
Compatible with existing SD tooling

Weaknesses

12–24 GB VRAM requirement
Much smaller LoRA/fine-tune ecosystem
Community license — not fully open

ComfyUI pip install diffusers transformers

Black Forest Labs

FLUX.1 (Schnell / Dev)

12B 8–24 GB 1024×1024+ Apache 2.0 (Schnell) / NC (Dev)

The next generation from the original Stable Diffusion creators. 12B parameter DiT architecture delivering Midjourney-level quality locally. Best text rendering of any open model. Schnell = 4-step fast generation (Apache 2.0 commercial). Dev = 28–35 step high quality (non-commercial). GGUF quants run on 6–8 GB.

Strengths

Midjourney-level quality — locally
Best text rendering in open models
Schnell: 4 steps, Apache 2.0 commercial
GGUF/NF4 quants: 6–8 GB VRAM
Excellent anatomy and photorealism

Weaknesses

Full FP16: needs 24 GB VRAM
Dev license is non-commercial
Smaller fine-tune ecosystem than SDXL
No A1111 support — ComfyUI or Forge only

ComfyUI git clone https://github.com/comfyanonymous/ComfyUI.git

NF4 quant: 8 GB VRAM, slight detail loss. FP8: 16 GB, near-lossless. Full BF16: 24 GB. LoRA training needs 24 GB+.

Black Forest Labs

FLUX.2 (Dev / Pro / Flex / Klein)

4B–32B 13–24 GB 1024×1024+ Various

Released November 2025. Production-grade successor to FLUX.1 with state-of-the-art quality. Klein (4B) fits 13–16 GB. Dev (32B) is open-weight. Pro rivals top proprietary models. Flex variant gives fine-grained control over generation parameters.

Strengths

State-of-the-art open image quality
Klein 4B variant fits consumer GPUs
Exceptional prompt fidelity
Flex: developer-friendly parameter control

Weaknesses

Dev 32B needs significant VRAM
Pro is API-only
Commercial licensing required for some variants

ComfyUI / Diffusers pip install diffusers transformers

Zhipu AI

Z-Image-Turbo

~3B ≤16 GB 1024×1024+ Apache 2.0

Sub-second inference on enterprise GPUs, comfortable on 16 GB consumer cards. Matches or exceeds FLUX.2 Dev and HunyuanImage 3.0 on benchmarks at a fraction of the compute. Standout bilingual text rendering (English + Chinese) with high clarity. Apache 2.0 — fully commercial.

Strengths

Sub-second generation — fastest quality model
Fits 16 GB consumer GPUs
Best bilingual text rendering (EN/CN)
Apache 2.0 — fully commercial
Beats models 10x its size

Weaknesses

New — small community ecosystem
Fewer LoRAs/fine-tunes available

Diffusers / ComfyUI pip install diffusers

Stability AI

Stable Cascade

~5B (staged) 8–16 GB 1024×1024 Stability Community

Three-stage architecture (Stage A/B/C) that compresses the latent space dramatically — 40% less VRAM than SDXL for comparable quality. Much improved text rendering inside images. Faster training with fewer data requirements.

Diffusers pip install diffusers

PixArt (Alpha Lab)

PixArt-α / PixArt-Σ

600M 4–8 GB 1024×1024+ Apache 2.0

Remarkably efficient DiT model — only 600M params yet competitive with SDXL. PixArt-Σ supports 4K resolution generation. Trained with 90% less compute than SD. Excellent for resource-constrained setups that still need high-quality output.

Diffusers pip install diffusers

Tencent — Hunyuan

Hunyuan Image (1.5B DiT)

1.5B 8–12 GB 1024×1024+ Apache 2.0

Tencent's bilingual (English/Chinese) DiT image generator with high-quality output. Strong prompt adherence and aesthetic quality competitive with FLUX at a smaller parameter count.

Strengths

Bilingual EN/CN prompt support
High aesthetic quality
Smaller than FLUX — faster inference
Apache 2.0 — fully open

Weaknesses

Smaller LoRA ecosystem than SD/FLUX
Less community tooling support

Diffusers pip install diffusers

NVIDIA / MIT

Sana (1.6B)

1.6B 6–10 GB Up to 4096×4096 Apache 2.0

Efficient DiT model capable of generating up to 4K resolution images with very fast generation times. Uses linear attention for efficient scaling to high resolutions without the quadratic memory cost of standard transformers.

Strengths

Up to 4K resolution generation
Very fast — efficient linear attention
Compact 1.6B parameters
Apache 2.0 — fully open

Weaknesses

Newer — limited community fine-tunes
Less refined than FLUX on complex prompts

Diffusers pip install diffusers

Various

ESRGAN / Real-ESRGAN / 4x-UltraSharp

~16M 1–2 GB BSD / Apache

AI upscaling models essential for any image generation pipeline. Real-ESRGAN handles 4x upscaling with face enhancement. 4x-UltraSharp is the community favorite for sharp detail. Tiny VRAM footprint — pairs with any generation model.

pip pip install realesrgan

Video Generation (Text/Image-to-Video)

Alibaba — Wan

Wan 2.1 / 2.2 (1.3B / 5B / 14B / 27B MoE)

1.3B–27B 8–24+ GB Apache 2.0

The most versatile open-source video model suite. Text-to-video, image-to-video, video editing, and video-to-audio. The 1.3B runs on consumer GPUs (~8 GB), while the 14B delivers SOTA quality that rivals commercial models. Wan 2.2 adds MoE architecture (27B total / 14B active) for improved detail with same inference cost. Bilingual English/Chinese text generation in video.

Strengths

1.3B variant fits on almost any GPU (~8 GB)
SOTA among open-source video models
T2V, I2V, video editing, V2A — full pipeline
ComfyUI + Diffusers integration
FP8 and GGUF quants available

Weaknesses

14B needs offloading on 24 GB (4+ min per clip)
480p is more stable than 720p on 1.3B
Complex face consistency can vary

Diffusers pip install diffusers ComfyUI recommended

Tencent — Hunyuan

HunyuanVideo (13B)

13B 14–80 GB Community

The largest open-source video generation model. Cinema-quality output with exceptional motion coherence and facial realism. Uses a "dual-stream to single-stream" transformer with a causal 3D VAE. Supports FP8 weights for lower VRAM. Has spawned community fine-tunes like SkyReels V1 for cinematic human-centric content.

Strengths

Best face/human rendering among open models
Cinema-grade temporal consistency
Camera motion control (zoom, pan, tilt)
xDiT multi-GPU parallelism support
FP8 mode runs on 14 GB with offloading

Weaknesses

8–15 min per clip on RTX 4090
Full precision needs 80 GB+
Prompt engineering required for best results

ComfyUI pip install diffusers

Lightricks

LTX Video (2B / 13B)

2B / 13B 12–24 GB Open

The speed champion — generates 30fps video at 1216×704 faster than real-time on RTX 4090. First DiT-based model optimized for rapid iteration. Multiple variants: 13B dev, 13B distilled, 2B distilled, and FP8 builds. Includes spatial and temporal upscalers.

Strengths

5–10 second generation on RTX 4090
30fps output at up to 1216×704
FP8 and distilled variants for efficiency
Built-in upscaler pipeline

Weaknesses

Lower quality ceiling than Wan/Hunyuan
Struggles with close-up faces
Best with concise 10–20 word prompts

Diffusers pip install diffusers

Zhipu AI — CogVideo

CogVideoX (2B / 5B)

2B / 5B 8–16 GB Apache 2.0

Solid mid-range video generation with 3D Causal VAE technology. Generates 6-second 720×480 clips with strong prompt adherence. CogVideoX 1.5 supports 10-second videos at higher resolution and I2V generation at any resolution. Good LoRA fine-tuning support via CogKit framework.

Strengths

Excellent detail and semantic accuracy
Strong Diffusers integration
LoRA fine-tuning with CogKit
5B fits on consumer 16 GB GPUs
Supports quantized inference (TorchAO)

Weaknesses

Close-up faces can struggle
6–12 min generation on RTX 4090
Fixed resolution modes

pip pip install cogkit

Genmo AI

Mochi 1 (10B)

10B 60–80 GB Apache 2.0

The largest dedicated text-to-video model under Apache 2.0. Asymmetric Diffusion Transformer (AsymmDiT) architecture with a custom VAE that compresses video to 128× smaller. Exceptional motion fluidity at 30fps and strong photorealistic faces in slow-motion shots. Best for high-fidelity short clips.

Strengths

Best natural motion quality among open models
Apache 2.0 — fully commercial
30fps photorealistic output
Fine-tuning with custom video datasets

Weaknesses

Needs 60–80 GB VRAM (A100/H100 territory)
480p max resolution currently
10–20 min per clip on RTX 4090 (with offload)

pip pip install diffusers

Music Generation (Text-to-Music / Audio)

Ace Studio

ACE-Step (0.9B)

0.9B ~4 GB Up to 4 min Apache 2.0

Text-to-music and singing voice generation model that can produce full songs up to 4 minutes long. Supports lyrics input for vocal tracks and instrumental generation from text descriptions.

Strengths

Full songs up to 4 minutes
Text-to-music + singing voice
Lyrics-conditioned vocal generation
Apache 2.0 — fully commercial

Weaknesses

Music quality below commercial services
Limited style control granularity

pip pip install ace-step

Stability AI

Stable Audio Open (1.3B)

1.3B ~6 GB 47s stereo CC-BY-SA

Stability AI's open-weight audio generation model producing up to 47 seconds of stereo audio from text descriptions. Supports music, sound effects, and ambient soundscapes with good fidelity.

Strengths

47s stereo audio generation
Music, SFX, and ambient sounds
Good audio fidelity
Open weights

Weaknesses

CC-BY-SA — share-alike requirement
47s max length — no long-form
Less coherent on complex musical structures

Diffusers pip install diffusers

Meta — AudioCraft

MusicGen (1.5B / 3.3B)

1.5B / 3.3B 6–12 GB MIT

Meta's text-to-music model from the AudioCraft suite. Generates high-quality music in various styles from text descriptions or melody conditioning. Well-established with strong community support and tooling.

Strengths

High-quality music generation
Melody conditioning — hum a tune, get a track
Multiple sizes for speed/quality tradeoff
MIT license — fully open

Weaknesses

30s max length per generation
3.3B version needs 12+ GB VRAM
No vocal/lyrics generation

pip pip install audiocraft

3D Generation (Text/Image-to-3D)

Microsoft

TRELLIS 2 (2B)

2B ~8 GB MIT

Microsoft's image and text-to-3D generation model. Produces high-quality 3D assets exportable as GLB and OBJ files for direct use in game engines and 3D applications.

Strengths

Image + text-to-3D generation
GLB/OBJ export — game-engine ready
High-quality mesh output
MIT license — fully open

Weaknesses

Complex scenes can lack detail
Texturing quality varies by input

pip pip install trellis3d

Tencent — Hunyuan

Hunyuan3D 2.0

Various 8–16 GB Tencent

Tencent's high-fidelity 3D generation system from text or image input. Produces detailed meshes with PBR textures suitable for professional 3D workflows and game asset pipelines.

Strengths

High-fidelity 3D mesh generation
PBR texture output
Text and image input support
Professional-quality assets

Weaknesses

Tencent license — check terms for commercial use
Higher VRAM requirements

pip pip install hunyuan3d

Tripo / Stability AI

TripoSR (1B)

1B ~4 GB MIT

Ultra-fast single-image-to-3D reconstruction. Generates a 3D mesh from a single photograph in under one second on modern GPUs. Ideal for rapid prototyping and batch 3D asset creation.

Strengths

Sub-1-second 3D generation
Single image input — no multi-view needed
Compact 1B parameters
MIT license — fully open

Weaknesses

Single-view — back of objects can be inaccurate
Lower detail than multi-step pipelines

pip pip install triposr

💬 Suggest an Improvement

Missing a model? Found incorrect info? Have a feature request? Help make this reference better for everyone.

✓ Thanks! Your suggestion has been submitted.

❓ Frequently Asked Questions

Click to expand ▼

How much VRAM do I need to run a local LLM?

It depends on model size and quantization. At Q4_K_M (the community default): 7B models need 5–7 GB, 14B models need 10–11 GB, 32B models need 20–22 GB, and 70B models need 45–50 GB. The formula approximates: VRAM ≈ Parameters_B × 0.5 + KV_overhead. Consumer GPUs like the RTX 4090 (24 GB) comfortably accommodate 32B models.

What is the best local LLM in 2026?

Top choices vary by hardware tier. For 24 GB GPUs: Qwen3 32B (Apache 2.0 license, dual-mode thinking, multilingual/coding excellence). For 16 GB: Phi-4 14B (MIT-licensed, exceptional reasoning/math). For 8 GB or less: Qwen3 8B or Llama 3.2 3B. Frontier-grade models include GLM-5 744B and DeepSeek V3.2 671B (requiring datacenter infrastructure).

What is Q4_K_M quantization?

Q4_K_M is a 4.8-bit quantization format that reduces model size by approximately 75% while preserving 99.5% of quality. It represents the community standard for local LLM operation via Ollama and llama.cpp. Think of it as "JPEG at 80% quality — barely distinguishable from original but dramatically smaller."

Can I run AI models locally without internet?

Yes. Once downloaded, all models operate entirely offline with zero cloud dependency. Tools like Ollama, LM Studio, and llama.cpp enable local inference. This approach suits air-gapped environments, HIPAA compliance requirements, and privacy-sensitive workflows.

Local AI Model Index

Local AI Model Index

Deep dives, one click away

What's new

Community pulse

From zero to running

What happens when you compress a model

Six pre-vetted stacks by hardware tier

Tiny Models — ≤ 8B Parameters

Small Models — 9–14B Parameters

Medium Models — 15–35B Parameters

Large Models — 36–80B Parameters

Frontier Models — 80B+ Parameters

Specialty — Coding, Vision, Embedding

Speech-to-Text (STT / ASR)

Text-to-Speech (TTS / Voice Synthesis)

Embedding, Search & Retrieval

Image Generation (Diffusion Models)

Video Generation (Text/Image-to-Video)

Music Generation (Text-to-Music / Audio)

3D Generation (Text/Image-to-3D)

💬 Suggest an Improvement

❓ Frequently Asked Questions

How much VRAM do I need to run a local LLM?

What is the best local LLM in 2026?

What is Q4_K_M quantization?

Can I run AI models locally without internet?