Local AI Model Index

Every major open-weight model you can run locally via Ollama, llama.cpp, or vLLM — cataloged.

Updated June 19, 2026
100
Models Cataloged
16
Model Families
0.3–1500
GB VRAM Range
82M–744B
Parameter Range
📐
VRAM ≈ (Params_B × 0.5) + KV_overhead
Rule of thumb for Q4_K_M quantization at 8K context. Add 20–50% for 32K–128K contexts. MoE models load all params but only activate a fraction per token.
Tier 1 — Entry Level
4–8 GB VRAM / CPU
Laptops, iGPUs, M-series Macs. Models up to ~8B. Basic chat, summaries, simple automation. ~30–80 t/s on 8B models, ~15–25 t/s on M-series Macs.
Tier 2 — Mid Range
12–24 GB VRAM
RTX 4070/4080/4090. Models 14B–32B. Strong coding, reasoning, agents. ~40–60 t/s on 14B, ~20–35 t/s on 32B. Sweet spot for power users.
Tier 3 — High End
32–80 GB VRAM
Multi-GPU or A100/H100. Models 70B+. Near-cloud quality. ~15–30 t/s on 70B, ~40+ t/s on A100. Production-grade local inference.
Tier 4 — Datacenter
80+ GB / Multi-Node
H100/H200 clusters. Full-scale MoE giants (200B–744B). Frontier-level capabilities. ~50–100+ t/s with tensor parallelism.
Dedicated Guides

Deep dives, one click away

The hard stuff — install paths, quantization, complete stacks, weekly changelog — each lives on its own page.

Tiny Models — ≤ 8B Parameters

Meta — Llama
Llama 3.2 (1B / 3B)
1B / 3B 1.5–3.6 GB 128K ctx Community Tool Use
Meta's ultra-lightweight on-device models. Fast inference (50–80 t/s), solid instruction following and multilingual support. The 3B variant is a popular starting point for testing local AI setups.
Strengths
  • Extremely fast — runs on CPU-only setups
  • Strong instruction following for size
  • Massive fine-tune ecosystem (100K+ HF variants)
  • Good multilingual support
Weaknesses
  • Limited reasoning depth on complex tasks
  • Shallower knowledge than 7B+ peers
  • Can hallucinate on niche domains
Best For
  • Always-on chatbots and agents
  • Mobile / edge / low-power devices
  • Quick local automation scripts
  • Testing and prototyping pipelines
VRAM Breakdown
  • 1B Q4: ~1.5–2 GB VRAM
  • 3B Q4: ~2.3 GB VRAM
  • CPU-only: 8+ GB RAM recommended
Ollama ollama pull llama3.2:1b ollama pull llama3.2:3b
Google — Gemma
Gemma 3 (1B / 4B)
1B / 4B 2–4 GB 128K ctx Vision Gemma Tool Use
Google's highly efficient small model family trained for maximum quality-per-parameter. The 4B variant includes vision support and handles 140+ languages. Power-efficient enough for mobile and embedded deployments.
Strengths
  • Exceptional efficiency — punches above its weight
  • Multilingual support (140+ languages)
  • Vision-ready in 4B+ variants
  • Very power-efficient for edge/mobile
Weaknesses
  • Lower benchmark scores on heavy reasoning vs denser peers
  • Limited coding depth compared to Qwen/Phi
Best For
  • On-device classification and summaries
  • Multilingual text processing
  • Quick image understanding tasks
  • Edge deployments and IoT
VRAM Breakdown
  • 1B Q4: ~2 GB
  • 4B Q4: ~4 GB
  • Runs well on Apple M-series via MLX
Ollama ollama pull gemma3:1b ollama pull gemma3:4b
Alibaba — Qwen
Qwen3 (0.6B / 1.7B / 4B)
0.6B–4B 1–4 GB 32K ctx Apache 2.0 Tool Use
Alibaba's smallest Qwen3 variants bring dual-mode thinking (fast vs. chain-of-thought) even to tiny form factors. Excellent multilingual coverage (100+ languages) and surprising early coding/math ability for size.
Strengths
  • Outstanding multilingual (100+ languages)
  • Dual-mode: fast inference or thinking mode
  • Strong early coding/math for size
  • Apache 2.0 — fully permissive license
Weaknesses
  • Very small variants hallucinate on niche topics
  • Limited context window vs. larger siblings
Ollama ollama pull qwen3:0.6b ollama pull qwen3:4b
Alibaba — Qwen
Qwen3.5 (4B / 9B / 27B)
4B–27B 3–18 GB 262K ctx Apache 2.0 Tool Use
Next-gen Qwen with hybrid Gated DeltaNet + Attention architecture, native vision-language fusion, and 262K native context (extensible to 1M). Supports 201 languages. Surpasses Qwen3 across reasoning, coding, and visual understanding.
Strengths
  • 262K native context, extensible to 1M tokens
  • Native multimodal — vision + text fused
  • 201 languages supported
  • Apache 2.0 — fully commercial
Weaknesses
  • Very new — limited community fine-tunes
  • Requires newer inference engines for hybrid attention
Ollama ollama pull qwen3.5:4b ollama pull qwen3.5:27b
Microsoft — Phi
Phi-4-mini (3.8B)
3.8B ~3 GB 128K ctx MIT Tool Use
Microsoft's tiny powerhouse, specifically tuned for reasoning and math. Often beats 7–13B models on STEM benchmarks despite its compact size. Supports 128K context and function calling.
Strengths
  • Exceptional reasoning/math — beats larger models
  • 128K context window
  • Very fast inference
  • Function calling support
Weaknesses
  • Narrower general knowledge
  • Less creative writing ability
Ollama ollama pull phi4-mini
Alibaba — Qwen
Qwen3 (8B)
8B 6–7 GB 128K ctx Apache 2.0 Tool Use
The community's top all-around 8B model. Leads benchmarks in multilingual, coding, and long-context tasks for its size class. Dual-mode thinking enables both quick responses and deep chain-of-thought reasoning.
Strengths
  • Leading multilingual / coding / long-context for 8B
  • Dual-mode thinking (fast + deep)
  • Beats many larger models on benchmarks
  • Excellent community daily driver
Weaknesses
  • Occasional inconsistency in fast mode
  • Knowledge depth limited vs 14B+
Ollama ollama pull qwen3:8b
Meta — Llama
Llama 3.1 (8B)
8B ~6.2 GB 128K ctx Community Tool Use
Meta's ecosystem king at the 8B tier. The most fine-tuned open model in history with vast community support. Strong generalist with 128K context and tool-calling capabilities. Ideal as a fine-tuning base.
Strengths
  • Massive fine-tune ecosystem
  • Strong generalist — tool-calling, agents
  • 128K context window
  • Excellent fine-tuning base
Weaknesses
  • Safety alignment can refuse creative prompts
  • Slightly behind Qwen3 8B on benchmarks
Ollama ollama pull llama3.1:8b
Mistral AI
Mistral 7B / Nemo 12B
7B / 12B 5–9 GB 128K ctx Apache 2.0
Mistral's foundational models — Apache 2.0 licensed with outstanding instruction-following, creativity, and European-language performance. Nemo 12B (built with NVIDIA) brings 128K context in a very efficient package.
Strengths
  • Superb instruction following and creativity
  • Excellent European language performance
  • Fully permissive Apache 2.0 license
  • Very efficient inference
Weaknesses
  • Older base — less competitive on 2026 reasoning benchmarks
  • Smaller models limited on very complex tasks
Ollama ollama pull mistral ollama pull mistral-nemo
DeepSeek
DeepSeek-R1 Distilled (7B / 8B)
7B / 8B 5–7 GB 64K ctx MIT
Distilled versions of DeepSeek's R1 reasoning model, bringing reinforcement-learning "thinking" mode to tiny form factors. The thinking traces enable complex problem-solving that rivals much larger dense models.
Strengths
  • RL-based thinking mode — rivals o1-level reasoning
  • Excellent coding/math for size
  • MIT license — fully permissive
Weaknesses
  • Occasionally less coherent without thinking mode
  • Slower when thinking traces are long
Ollama ollama pull deepseek-r1:8b
Xiaomi — MiMo
MiMo-7B-RL
7B 5–7 GB 48K ctx MIT
Xiaomi's 7B reasoning model that punches far above its weight. Trained on 25T tokens with RL, it matches OpenAI o1-mini on math and code tasks, outperforming many 32B models. Multi-Token Prediction enables ~90% speculative decoding acceptance.
Strengths
  • Matches o1-mini — MATH-500: 95.8%, AIME 2024: 68.2%
  • Outperforms 32B models in reasoning
  • MTP speculative decoding — very fast
  • MIT license — fully permissive
Weaknesses
  • Requires trust_remote_code for deployment
  • 48K context — smaller than competitors
Ollama ollama pull mimo:7b
Hugging Face
SmolLM2 (135M / 360M / 1.7B)
135M–1.7B 0.3–1.5 GB 8K ctx Apache 2.0 Tool Use
Hugging Face's ultra-compact language models designed for extreme edge deployment. The smallest viable LLMs for on-device inference where every megabyte counts.
Strengths
  • Incredibly small — runs on anything
  • Sub-1GB models available
  • Fast training/fine-tuning
Weaknesses
  • Very limited capability — basic tasks only
  • High hallucination rate
Ollama ollama pull smollm2:1.7b
Hugging Face
SmolLM3 (3B)
3B ~3 GB (Q4) 128K ctx Apache 2.0
Hugging Face's fully transparent 3B reasoning model. Outperforms Llama 3.2 3B and Qwen2.5 3B across 12 benchmarks. Dual-mode thinking, NoPE architecture, trained on 11.2T tokens. Full training recipe published.
Strengths
  • Outperforms Llama 3.2 3B and Qwen2.5 3B
  • Dual-mode reasoning (think/no_think)
  • 128K context via YaRN extrapolation
  • Fully open — weights + training recipe
Weaknesses
  • Falls behind Qwen3 4B on math tasks
  • 6 languages only (EN, FR, ES, DE, IT, PT)
Ollama ollama pull smollm3:3b
Google — Gemma
Gemma 3n (2B / 4B)
2B / 4B ~2–4 GB Apache 2.0 Vision
Google's edge-optimized multimodal model with vision and audio understanding. Built for on-device deployment via MediaPipe, enabling efficient inference on mobile and edge hardware without cloud dependency.
Strengths
  • Multimodal — vision + audio in tiny package
  • MediaPipe optimized for mobile/edge
  • Runs on phones and low-end hardware
  • Apache 2.0 — fully open
Weaknesses
  • Limited general reasoning at 2–4B scale
  • Narrower language coverage than larger Gemma
Ollama ollama pull gemma3n
BigCode / Hugging Face
StarCoder2 (3B / 7B)
3B / 7B ~3–5 GB 16K ctx BigCode OpenRAIL-M
Dedicated code completion model trained on The Stack v2. Supports 600+ programming languages with fill-in-the-middle capability. Lightweight enough for real-time IDE integration on consumer hardware.
Strengths
  • 600+ language support — widest code coverage
  • Fill-in-the-middle for IDE completion
  • Lightweight — real-time on CPU
  • Trained on ethically-sourced code (The Stack v2)
Weaknesses
  • Code-only — no general chat ability
  • 16K context (smaller than competitors)
Ollama ollama pull starcoder2:3b

Small Models — 9–14B Parameters

Microsoft — Phi
Phi-4 (14B)
14B ~11 GB 128K ctx MIT Tool Use
Microsoft's reasoning champion at 14B parameters. Tops charts for math and reasoning in its size class (84%+ MMLU). A compact powerhouse that fits comfortably on an RTX 4090 with room to spare for context.
Strengths
  • Exceptional reasoning/math — tops 14B class
  • 84%+ MMLU, strong GPQA scores
  • 128K context, fast inference
  • MIT license — fully open
Weaknesses
  • Less creative / broad general knowledge than Llama/Qwen
  • Narrower training data focus
Best For
  • STEM tasks, math tutoring
  • Research assistants on mid-range GPUs
  • Technical document analysis
  • Code review and generation
VRAM Breakdown
  • Q4: ~11 GB — fits RTX 4060 Ti 16GB
  • Q8: ~16 GB — fits RTX 4090
  • FP16: ~28 GB
Ollama ollama pull phi4:14b
Alibaba — Qwen
Qwen3 (14B)
14B ~10.7 GB 128K ctx Apache 2.0
Alibaba's 14B dense model with dual-mode thinking. Excels at multilingual tasks, coding, and long-context processing. A top community recommendation for users with 16GB VRAM GPUs.
Strengths
  • Top-tier multilingual and coding for 14B class
  • Dual-mode: fast + thinking
  • 128K context window
  • Apache 2.0 license
Weaknesses
  • Occasional fast-mode inconsistency
  • Slightly below Phi-4 on pure math
Ollama ollama pull qwen3:14b
DeepSeek
DeepSeek-R1 (14B distilled)
14B ~11 GB 64K ctx MIT Tool Use
The 14B distillation of DeepSeek-R1, bringing near-SOTA reasoning and coding ability to a single consumer GPU. Thinking mode enables complex multi-step problem solving.
Strengths
  • Near-SOTA coding and reasoning for size
  • RL-trained thinking traces
  • Excellent for agentic workflows
Weaknesses
  • Distilled — some quality loss vs full R1
  • Can be verbose in thinking mode
Ollama ollama pull deepseek-r1:14b
NVIDIA — Nemotron
Nemotron Nano 12B v2 VL
12B ~9 GB 128K ctx Vision+Video NVIDIA Open Tool Use
NVIDIA's multimodal reasoning model designed for document intelligence, video understanding, and visual Q&A. Hybrid Transformer-Mamba architecture combines accuracy with memory efficiency.
Strengths
  • Multi-image and video understanding
  • Leading OCR and document intelligence
  • Hybrid architecture — efficient memory
  • Reasoning mode toggle via system prompt
Weaknesses
  • Reasoning not supported for video inputs
  • Newer — smaller community ecosystem
Ollama ollama pull nemotron-nano:12b
Mistral AI
Mistral Small 3.1 (24B)
24B ~15 GB 128K ctx Apache 2.0 Tool Use
Updated with multimodal vision support and improved text performance. Outperforms Gemma 3 27B and GPT-4o Mini on most benchmarks while hitting 150 tok/s inference. Runs on a single RTX 4090 or 32GB Mac.
Strengths
  • Vision + text multimodal in one model
  • Outperforms Gemma 3 and GPT-4o Mini
  • 150 tok/s — very fast inference
  • Apache 2.0 — commercial friendly
Weaknesses
  • Heavier than 14B models — needs 16GB+ VRAM
  • Not as strong on pure math as Phi-4
Ollama ollama pull mistral-small3.1
NVIDIA / IBM
Granite 4.0 (350M → 32B MoE)
350M–32B 1B–9B active 0.5–20 GB 128K ctx Apache 2.0 Tool Use
IBM's enterprise-grade model family with a novel hybrid Mamba-2 architecture for faster inference and lower memory. Spans edge to server: Nano (350M/1B), Micro (3B), Tiny (7B MoE/1B active), Small (32B MoE/9B active). Trained on 15T tokens with strong tool calling and 12-language support.
Strengths
  • Hybrid Mamba-2 arch — faster inference, lower memory
  • Edge to server in one family (350M to 32B)
  • Strong instruction following and tool calling
  • Apache 2.0 — clean enterprise licensing
Weaknesses
  • Mamba-2 backend support still maturing in some tools
  • Smaller community than Llama/Qwen ecosystems
  • MoE variants need compatible inference stacks
Ollama ollama pull granite4 ollama pull granite4:small-h
MiniMax
MiniMax M2.5 (45.9B MoE / 8.6B active)
45.9B total 8.6B active ~12 GB (Q4) 128K ctx MoE Apache 2.0
MiniMax's dual-mode thinking model with MoE efficiency. Only 8.6B active parameters keep VRAM low while the full 45.9B architecture delivers strong multilingual reasoning. Competitive with much larger dense models on general benchmarks.
Strengths
  • Dual-mode thinking — fast and deep reasoning
  • MoE efficiency — only 8.6B active params
  • Strong multilingual performance
  • 128K context, Apache 2.0
Weaknesses
  • Smaller community than Qwen/Llama ecosystems
  • Fewer fine-tunes and adapters available
Ollama ollama pull minimax-m2.5
Cohere — Command
Command A (111B MoE / ~28B active)
111B total ~28B active ~25 GB (Q4) 256K ctx MoE CC-BY-NC Tool Use
Cohere's enterprise-focused MoE model built for RAG and agentic workflows. 256K context with strong grounded generation and citation support. MoE architecture keeps active params at ~28B for efficient inference on a single 48GB GPU.
Strengths
  • 256K context — massive document processing
  • Enterprise RAG with inline citations
  • MoE — fits on single 48GB GPU (Q4)
  • Strong agentic tool-use capabilities
Weaknesses
  • CC-BY-NC — no commercial use without agreement
  • 25GB+ VRAM even with MoE efficiency
  • Smaller fine-tune ecosystem
Ollama ollama pull command-a

Medium Models — 15–35B Parameters

OpenAI — GPT-OSS
GPT-OSS (20B)
20B ~12 GB 128K ctx Open Tool Use
OpenAI's open-weight model, designed with GPT-style structured reasoning and tool-use capabilities. Agent-friendly with strong structured output generation. Runs well quantized on 24GB cards.
Strengths
  • GPT-like structured reasoning and tool-use
  • Agent-friendly design with function calling
  • Runs well quantized on consumer GPUs
  • Strong structured outputs
Weaknesses
  • Newer — smaller community than Llama
  • 128K context limit (no 1M option)
Ollama ollama pull gpt-oss:20b
Mistral AI
Devstral Small 2 (24B)
24B ~15 GB 128K ctx Apache 2.0
Mistral's dedicated coding model optimized for software engineering and agentic coding tasks. Top SWE-bench scores in its size class with strong multi-file code understanding.
Strengths
  • 68% SWE-bench — exceptional coding
  • Strong agentic task completion
  • Multi-file code understanding
Weaknesses
  • Coding-focused — less general purpose
  • Not ideal for creative or chat tasks
Ollama ollama pull devstral-small
Google — Gemma
Gemma 3 (27B)
27B ~22.5 GB 128K ctx Vision Gemma Tool Use
Google's strong all-rounder with multimodal vision capabilities. Efficient for its size with good performance across reasoning, coding, and multilingual tasks. Fits on a 24GB GPU at Q4.
Strengths
  • Well-balanced across all benchmarks
  • Built-in vision support
  • Efficient for size, good throughput
  • 140+ language support
Weaknesses
  • Tight fit on 24GB at full context
  • Not absolute SOTA on any single benchmark
Ollama ollama pull gemma3:27b
Zhipu AI (Z.ai) — GLM
GLM-4.7-Flash (30B MoE / 3B active)
30B total 3B active ~20 GB 200K ctx MoE Open
The efficiency king for coding on consumer hardware. A 30B MoE model that only activates 3B parameters per token, delivering 59.2% SWE-bench at 60–80 t/s. Community calls it "best 70B-or-less model" for UI generation and tool calling. Interleaved thinking between actions.
Strengths
  • 59.2% SWE-bench — top for local coding
  • 60–80 t/s at 4-bit — very fast
  • Interleaved + preserved thinking modes
  • Excellent tool-use and agentic capabilities
  • Runs on RTX 3090/4090 and Mac M-series
Weaknesses
  • Chat template issues with some runtimes
  • Needs 24GB+ VRAM for good experience
  • Smaller community than Qwen/Llama
Ollama ollama pull glm4:latest llama.cpp recommended — use --jinja flag
NVIDIA — Nemotron
Nemotron 3 Nano (30B-A3B)
31.6B total 3.6B active ~20 GB 1M ctx Hybrid MoE NVIDIA Open Tool Use
NVIDIA's hybrid Mamba-Transformer MoE model designed for agentic AI. 1M token context window with 4x faster inference than predecessors. Activates only 3.6B of 31.6B parameters. 91% Math 500 score — the highest among peers. Open weights, training data, AND recipes.
Strengths
  • 91% Math Index — top in class
  • 1M token context window
  • 3.3x faster throughput than Qwen3-30B
  • Hybrid Mamba-Transformer architecture
  • Fully open: weights + data + recipes
Weaknesses
  • Automatic CPU offloading can reduce speed
  • Context cliff at high token counts
  • Newer Mamba architecture — less battle-tested
Ollama ollama pull nemotron-nano
NVIDIA — Nemotron
Nemotron 3 Super (120B-A12B)
120B MoE ~64 GB 1M ctx NVIDIA Open Tool Use
NVIDIA's hybrid Mamba-Transformer MoE model with 120.6B total params but only 12.7B active per forward pass. 7x higher throughput than previous gen. Supports multi-token prediction for faster generation. Built for agentic reasoning and collaborative multi-agent workflows. 1M token context via linear attention.
Strengths
  • 7x throughput improvement over previous gen
  • 1M token context window
  • Multi-token prediction for faster inference
  • Excellent agentic reasoning and tool use
Weaknesses
  • Requires 64GB+ VRAM/RAM at minimum
  • New architecture may have limited tooling support initially
  • Enterprise-focused — less community fine-tuning
Best For
  • Multi-agent orchestration
  • Enterprise agentic AI workflows
  • IT ticket automation and complex reasoning
  • High-throughput production inference
VRAM Breakdown
  • Q4: ~40 GB (12B active params)
  • Q8: ~64 GB
  • FP16: ~256 GB (multi-GPU)
  • Runs on Mac Studio M4 Ultra 192GB
OLLAMA ollama pull nemotron-3-super
NVIDIA — Nemotron
Nemotron 3 Omni
MoE ~24 GB 1M ctx Vision NVIDIA Open
NVIDIA's omni-understanding model for AI agents that need natural conversations, complex reasoning, and advanced visual capabilities. Analyzes video content, documents, and images. Built on the Nemotron 3 architecture with multimodal extensions.
Strengths
  • Video and document understanding
  • Natural conversational abilities
  • Complex visual reasoning
  • 1M token context for long documents
Weaknesses
  • New release — limited community benchmarks
  • NVIDIA GPU optimization bias
  • Less tested on consumer hardware
Best For
  • Document and video analysis agents
  • Visual question answering
  • Multimodal enterprise workflows
  • Real-time conversational AI with vision
VRAM Breakdown
  • Q4: ~16-24 GB (estimated)
  • Optimized for NVIDIA GPUs
  • vLLM and TensorRT-LLM support
NVIDIA pip install nemo-toolkit[all]
Alibaba — Qwen
Qwen3 (32B)
32B ~22 GB 128K ctx Apache 2.0
The community's top recommendation for 24GB GPUs. Exceptional multilingual, coding, and long-context stability — maintains 33–53 t/s at 48K tokens with zero offloading. The best long-context stability king in its class.
Strengths
  • Best long-context stability — 100% GPU at 48K
  • Exceptional multilingual and coding
  • Dual-mode thinking
  • Apache 2.0 license
Weaknesses
  • Tight fit on 24GB — context limited
  • Dense model — heavier than MoE alternatives
Ollama ollama pull qwen3:32b
DeepSeek
DeepSeek-R1 (32B)
32B ~22 GB 64K ctx MIT
Near-SOTA coding/math/reasoning with thinking mode at a size that fits on a single 24GB GPU. KV cache efficiency via MLA architecture. The go-to for developer tools and agentic workflows.
Strengths
  • Near-SOTA reasoning with thinking traces
  • KV cache efficiency (MLA architecture)
  • Excellent for agentic/developer workflows
  • MIT license
Weaknesses
  • May require vLLM patches for optimal speed
  • Verbose thinking traces eat context
Ollama ollama pull deepseek-r1:32b
Alibaba — Qwen
QwQ (32B)
32B ~22 GB 128K ctx Apache 2.0 Tool Use
Qwen's dedicated reasoning model, trained specifically for deep chain-of-thought problem solving. Excels at complex multi-step math and logic puzzles where extended reasoning is needed.
Strengths
  • Purpose-built for deep reasoning
  • Exceptional on math competitions (AIME)
  • 128K context for long reasoning chains
Weaknesses
  • Not a generalist — specialized for reasoning
  • Very verbose reasoning traces
Ollama ollama pull qwq
Allen AI (Ai2) — OLMo
OLMo 3.1 (7B / 32B)
7B / 32B 5–20 GB 65K ctx Apache 2.0
The only truly open-source model family — not just open weights, but full training code, Dolma 3 dataset (6T tokens), intermediate checkpoints, reward models, and OlmoTrace for auditing outputs back to training data. Think variant competitive with Qwen3 32B on reasoning.
Strengths
  • Fully open: code, data, checkpoints, logs — auditable end-to-end
  • Think variant strong on math, code, and reasoning
  • OlmoTrace lets you trace outputs to training data
  • Apache 2.0 — clean for enterprise and compliance
Weaknesses
  • Slightly behind Qwen3 on general chat tasks
  • Smaller community ecosystem than Llama/Qwen
  • 65K context (vs 128K+ on competitors)
Best For
  • Compliance environments requiring full auditability
  • Research where training data provenance matters
  • Reasoning tasks (Think variant)
  • Organizations that need Apache 2.0 + full transparency
VRAM Breakdown
  • 7B Q4: ~5 GB VRAM
  • 32B Q4: ~18–20 GB VRAM
  • 32B Q8: ~34 GB VRAM
Ollama ollama pull olmo-3.1 ollama pull olmo-3.1:7b
Alibaba — Qwen
Qwen3.5 (32B)
32B ~20 GB (Q4) 128K ctx Apache 2.0 Tool Use
The latest generation of Alibaba's Qwen series, surpassing Qwen3 on reasoning, coding, and multilingual benchmarks. A strong all-rounder that fits on a single RTX 4090 at Q4 quantization.
Strengths
  • Surpasses Qwen3 32B across benchmarks
  • Strong reasoning and coding performance
  • Fits on 24GB GPU at Q4
  • Apache 2.0 — fully commercial
Weaknesses
  • Newer — fewer fine-tunes than Qwen3
  • Tight fit on 24GB cards with long contexts
Ollama ollama pull qwen3.5:32b
TII — Falcon
Falcon 3 (7B / 10B)
7B / 10B ~8–10 GB Apache 2.0
TII's latest Falcon generation delivering state-of-the-art results for its size class. Strong multilingual capabilities with efficient inference. A solid contender in the 7–10B parameter range.
Strengths
  • State-of-the-art for 7–10B size class
  • Strong multilingual performance
  • Efficient inference on consumer GPUs
  • Apache 2.0 — fully open
Weaknesses
  • Smaller ecosystem than Llama/Qwen
  • Fewer community fine-tunes available
Ollama ollama pull falcon3:7b ollama pull falcon3:10b

Large Models — 36–80B Parameters

Meta — Llama
Llama 3.3 (70B)
70B 45–50 GB 128K ctx Community Tool Use
Near-405B performance in a 70B package. The most versatile large dense model with the biggest ecosystem of fine-tunes, adapters, and tooling. Production-grade quality for virtually any general task.
Strengths
  • Near-405B performance on many tasks
  • Massive ecosystem — thousands of fine-tunes
  • Versatile generalist with strong tool-calling
  • Production-grade quality
Weaknesses
  • 45–50 GB VRAM — needs multi-GPU or offloading
  • Safety alignment can feel restrictive for creative tasks
Ollama ollama pull llama3.3:70b
Alibaba — Qwen
Qwen3 (72B)
72B ~50 GB 128K ctx Apache 2.0
Alibaba's flagship dense model. Frontier-level multilingual and coding performance with the full dual-mode thinking system. Apache 2.0 licensed for commercial use.
Strengths
  • Frontier-level multilingual/coding/reasoning
  • Dual-mode thinking system
  • Apache 2.0 — fully commercial
Weaknesses
  • ~50 GB VRAM — multi-GPU needed
  • HF approval may be required
Ollama ollama pull qwen3:72b
DeepSeek
DeepSeek-R1 (70B)
70B ~45 GB 128K ctx MIT Tool Use
The 70B distillation of DeepSeek-R1, delivering top-tier reasoning and coding with thinking traces. High-stakes problem solving with MIT licensing.
Strengths
  • Top reasoning and coding at 70B tier
  • Thinking mode for complex problems
  • MIT license — fully permissive
Weaknesses
  • 45+ GB VRAM requirement
  • Verbose thinking traces
Ollama ollama pull deepseek-r1:70b

Frontier Models — 80B+ Parameters

Meta — Llama 4
Llama 4 Scout (109B MoE / 17B active)
109B total 17B active 55+ GB (Q4) 10M ctx MoE 16 experts Vision Llama 4 Tool Use
Meta's revolutionary MoE model with an industry-leading 10M token context window. Only 17B parameters active per token across 16 experts, enabling near-70B quality at a fraction of the inference cost. Natively multimodal with text and image processing. Ultra-quantized versions (1.78-bit) fit on a single 24GB GPU.
Strengths
  • 10M token context — industry leading
  • MoE: 70B+ quality at 17B inference cost
  • Natively multimodal (text + vision)
  • 1.78-bit quant fits on 24GB (~20 t/s)
  • Fits on single H100 at INT4
Weaknesses
  • Full weights need 216GB VRAM
  • Initial reviews show room for improvement vs hype
  • Extreme quant degrades quality noticeably
Ollama ollama pull llama4-scout
OpenAI — GPT-OSS
GPT-OSS (120B MoE / 5B active)
120B total 5B active 60+ GB 128K ctx MoE Open Tool Use
OpenAI's largest open model with 62.7% SWE-bench. MoE architecture activates only 5B of 120B parameters per token. Strong GPT-style reasoning with advanced tool-use and structured outputs.
Strengths
  • 62.7% SWE-bench — strong coding
  • GPT-style reasoning quality
  • Advanced tool-use and agents
Weaknesses
  • Needs 48GB+ VRAM
  • Newer ecosystem — fewer fine-tunes
Ollama ollama pull gpt-oss:120b
Alibaba — Qwen
Qwen3 (235B MoE / 22B active)
235B total 22B active 55+ GB 128K ctx MoE Apache 2.0 Tool Use
Near-frontier multilingual and long-context performance via Mixture of Experts. Quality Index of 57 — among the top open models. Activates 22B of 235B parameters per token for efficient inference.
Strengths
  • Near-frontier quality (Quality Index: 57)
  • 22B active params — efficient MoE
  • Exceptional multilingual/long-context
  • Apache 2.0 license
Weaknesses
  • 55+ GB VRAM minimum
  • Multi-GPU needed for full power
Ollama ollama pull qwen3:235b
Alibaba — Qwen
Qwen3.5 (397B MoE / 17B active)
397B total 17B active 230+ GB 262K ctx MoE 512 experts Apache 2.0 Tool Use
Qwen3.5 flagship — hybrid Gated DeltaNet + sparse MoE with 512 experts. Native vision-language fusion, 262K context (extensible to 1M), 201 languages. 3.5M downloads in 5 weeks. MMLU-Pro 87.8%, LiveCodeBench v6 83.6%, MMMU 85.0%.
Strengths
  • Frontier performance — 87.8% MMLU-Pro, 83.6% LiveCodeBench
  • Native multimodal — text + vision fused
  • 262K–1M context, 201 languages
  • Apache 2.0 — fully commercial
Weaknesses
  • 230+ GB — multi-GPU cluster required (8-way TP)
  • Needs SGLang/vLLM — no simple Ollama yet
SGLang / vLLM python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --tp-size 8
Meta — Llama 4
Llama 4 Maverick (400B MoE / 17B active)
400B total 17B active 200+ GB (Q4) 1M ctx MoE 128 experts Vision Llama 4 Tool Use
Meta's highest-performance open model. 128 experts with 1M context window. Competes with GPT-4o class models. Natively multimodal with text and image understanding. Requires multi-GPU setup for production use.
Strengths
  • GPT-4o class performance
  • 128 experts, 1M context
  • Natively multimodal
  • Co-distilled from Llama Behemoth
Weaknesses
  • 200+ GB VRAM — requires 4+ H100s at Q4
  • 1.78-bit quant needs 2x48GB (~40 t/s)
  • Extremely resource intensive
Ollama ollama pull llama4-maverick
Zhipu AI (Z.ai) — GLM
GLM-4.7 Full (355B MoE / 32B active)
355B total 32B active 205+ GB 200K ctx MoE Open Tool Use
Zhipu's flagship MoE model with interleaved thinking, preserved thinking, and turn-level thinking. 73.8% SWE-bench, 66.7% SWE-bench Multilingual. Exceptional for agentic coding and complex multi-step tasks.
Strengths
  • 73.8% SWE-bench — top-tier coding
  • Advanced thinking modes (interleaved, preserved)
  • Strong multilingual agentic coding
  • Competitive with Sonnet 3.5 for coding
Weaknesses
  • 205+ GB minimum — needs multi-GPU cluster
  • 2-bit GGUF needs 135GB + 128GB RAM
llama.cpp / vLLM recommended ollama pull glm4.7
Xiaomi — MiMo
MiMo-V2-Flash (309B MoE / 15B active)
309B total 15B active 180+ GB 256K ctx MoE MIT
Xiaomi's breakout frontier model that shocked the community — mistaken for DeepSeek V4 on OpenRouter. Hybrid SWA/Global attention with 5:1 ratio slashes KV-cache 6x. Multi-Token Prediction triples generation speed. 73.4% SWE-bench, 94.1% AIME 2025.
Strengths
  • 73.4% SWE-bench, 94.1% AIME 2025
  • Only 15B active — extremely efficient MoE
  • 6x KV-cache reduction via hybrid attention
  • MIT license — fully permissive
Weaknesses
  • Multi-GPU required (8-way TP recommended)
  • Needs SGLang or KTransformers — no simple Ollama
SGLang / KTransformers python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-V2-Flash --tp-size 8
DeepSeek
DeepSeek V3.2 / R1 (671B MoE / 37B active)
671B total 37B active 300+ GB 128K ctx MoE MIT Tool Use
DeepSeek's full-scale MoE model — one of the most capable open models ever released. Rivals GPT-5 class on coding/reasoning benchmarks. MLA architecture provides extreme KV cache efficiency. Quality Index: 57.
Strengths
  • Near-frontier on all benchmarks
  • MLA architecture — extreme KV cache efficiency
  • 37B active params — efficient despite size
  • MIT license — fully permissive
Weaknesses
  • Datacenter-scale hardware required
  • Requires patched vLLM for optimal speed
  • 300+ GB VRAM minimum
vLLM / SGLang recommended for production ollama pull deepseek-v3.2-exp
Zhipu AI (Z.ai) — GLM
GLM-5 (744B MoE / 40B active)
744B total 40B active 1.5 TB (FP16) 200K ctx MoE MIT Tool Use
The largest open-weight model as of Feb 2026. 744B parameters (40B active) trained on 28.5T tokens. 86% GPQA-Diamond (graduate-level reasoning), 90% HumanEval. DeepSeek Sparse Attention for long-context efficiency. 2-bit GGUF: 241GB — fits 256GB unified memory Mac.
Strengths
  • 86% GPQA-Diamond — exceptional reasoning
  • 90% HumanEval — top coding
  • Quality Index: 49.64 — #1 open source
  • MIT license — fully open
  • 2-bit GGUF fits 256GB Mac
Weaknesses
  • FP16 needs 1.5TB VRAM (8x H200 minimum)
  • 2-bit quant: ~5 t/s with offloading
  • Consumer-unfriendly at full quality
llama.cpp / vLLM (8xH200+ for full quality) ollama pull glm5
Mistral AI
Mistral Large 3 (675B MoE)
675B total 300+ GB 128K ctx MoE Mistral Tool Use
Mistral's largest model to date. Positions itself as one of the strongest open-weight choices for advanced reasoning and high-end self-hosted assistants. Premium quality local inference.
Strengths
  • Premium reasoning and creative quality
  • Strong European language support
  • Advanced instruction following
Weaknesses
  • Datacenter-scale hardware required
  • Restrictive license vs Apache 2.0 models
vLLM / SGLang ollama pull mistral-large

Specialty — Coding, Vision, Embedding

Alibaba — Qwen
Qwen3-Coder (480B MoE / 35B active)
480B total 35B active 200+ GB 256K ctx MoE Apache 2.0
Alibaba's dedicated agentic coding model with massive 480B MoE architecture. 55.4% SWE-bench. Optimized for large-scale code generation and software engineering tasks.
HF / vLLM ollama pull qwen3-coder
DeepSeek
DeepSeek Coder V2 (16B / 236B)
16B / 236B 10–50+ GB 128K ctx MIT Tool Use
Purpose-built for code generation with MoE efficiency. Excellent for code completion, generation, and review. The 16B version fits on consumer GPUs.
Ollama ollama pull deepseek-coder-v2
Moonshot AI — Kimi
Kimi K2.5
MoE Varies 262K ctx Open
Moonshot's thinking-focused model with systematic reasoning for research and planning tasks. Exceptional multi-step reasoning with 262K context window. Strong math competition performance.
Ollama / vLLM ollama pull kimi-k2.5
Cohere — Command R
Command R+ (104B)
104B ~60 GB (Q4) 128K ctx CC-BY-NC Tool Use
Purpose-built for RAG and multi-step tool use. Generates grounded responses with inline citations. Highly efficient multilingual tokenizer (10+ languages) means lower cost per token for non-English content. Outperforms GPT-4 Turbo on tool-use benchmarks.
Strengths
  • Best-in-class RAG — grounded generation with citations
  • Zero-shot multi-step tool use
  • Efficient tokenizer cuts cost for multilingual workloads
  • Strong enterprise workflow integration
Weaknesses
  • CC-BY-NC license — no commercial use without Cohere agreement
  • 104B needs 60+ GB VRAM (Q4) — multi-GPU or 80GB cards
  • Older architecture — not MoE, so VRAM scales linearly
Ollama ollama pull command-r-plus
Mistral — Codestral
Codestral 25.05 (25B)
25B ~16 GB (Q4) 256K ctx Non-commercial Tool Use
Mistral's dedicated coding model with 256K context and 80+ programming language support. Built for agentic coding workflows with strong code generation, review, and refactoring capabilities.
Strengths
  • 256K context — handles entire codebases
  • 80+ programming languages
  • Strong agentic coding capabilities
  • Fits on 24GB GPU at Q4
Weaknesses
  • Non-commercial license — research/personal only
  • Code-focused — weaker on general tasks
Ollama ollama pull codestral
Alibaba — Qwen
Qwen3-Coder (32B Dense)
32B ~20 GB (Q4) 256K ctx Apache 2.0 Tool Use
The consumer-friendly dense variant of Qwen3-Coder. All 32B parameters active (no MoE), delivering strong agentic coding on a single RTX 4090. Excellent for software engineering tasks and IDE integration.
Strengths
  • Dense 32B — no MoE complexity
  • 256K context for large codebases
  • Strong agentic coding performance
  • Apache 2.0 — fully commercial
Weaknesses
  • Tight fit on 24GB GPUs with long context
  • Code-focused — use Qwen3.5 for general tasks
Ollama ollama pull qwen3-coder:32b
Various — Vision Models
LLaVA / Qwen2.5-VL / InternVL2.5 / Llama 3.2-Vision
7B–72B +2–5 GB overhead 128K ctx Vision
Vision-language models that add image understanding to base LLMs. LLaVA pioneered the approach. Qwen2.5-VL leads benchmarks with document OCR and video understanding. InternVL2.5 (Shanghai AI Lab) is the top open-source vision model for complex reasoning over images. Llama 3.2-Vision rounds out Meta's multimodal offering.
Top Picks
  • Qwen2.5-VL: Best all-around — OCR, video, charts, documents
  • InternVL2.5: Strongest visual reasoning and multi-image
  • Llama 3.2-Vision: Best Meta ecosystem integration
  • LLaVA: Lightweight, great for experimentation
Notes
  • Add 2–5 GB VRAM overhead on top of base model
  • Larger vision models (72B) need 48 GB+ VRAM
  • Quality varies heavily by size — 7B vision ≠ 72B vision
Ollama ollama pull llava ollama pull llama3.2-vision
Hugging Face
SmolVLM (256M / 500M / 2B)
256M–2B <1 GB Apache 2.0 Vision
Ultra-tiny vision-language models that run on as little as 1GB VRAM. Capable of document OCR, image captioning, and visual question answering at a fraction of the cost of larger VLMs. Perfect for edge deployment.
Strengths
  • Runs on 1GB VRAM — smallest viable VLM
  • Document OCR and image understanding
  • Multiple sizes for flexibility
  • Apache 2.0 — fully open
Weaknesses
  • Limited reasoning at tiny scale
  • Lower accuracy than larger VLMs
pip pip install transformers
Ai2 (Allen AI)
Molmo (7B / 72B)
7B / 72B 5–50 GB Apache 2.0 Vision
Ai2's fully open vision model with unique pointing and grounding capabilities. Can identify and locate objects in images with spatial coordinates. Fully open — weights, data, and code all available under Apache 2.0.
Strengths
  • Pointing/grounding — locates objects in images
  • Fully open: weights, data, and code
  • 7B version runs on consumer GPUs
  • Apache 2.0 — fully commercial
Weaknesses
  • 72B version needs datacenter hardware
  • Smaller ecosystem than Qwen2.5-VL
pip pip install transformers
Various — Embeddings
nomic-embed-text / bge-m3 / snowflake-arctic-embed
137M–335M <1 GB 8K tokens Apache 2.0
Lightweight embedding models for RAG pipelines, semantic search, and memory systems. Run alongside any LLM with negligible VRAM overhead. Essential for building retrieval-augmented generation systems.
Ollama ollama pull nomic-embed-text ollama pull bge-m3

Speech-to-Text (STT / ASR)

OpenAI — Whisper
Whisper Large V3 / V3 Turbo
1.55B / 809M 6–10 GB 99+ languages MIT
The gold standard for multilingual speech recognition. 99+ languages, automatic language detection, phrase-level timestamps, and punctuation. V3 Turbo cuts decoder layers from 32 to 4, delivering 6x faster inference with only 1–2% accuracy loss. 7.4% WER average on mixed benchmarks.
Strengths
  • 99+ language support — best multilingual STT
  • Automatic language identification
  • Phrase-level timestamps
  • Turbo variant: 6x faster, 809M params
  • Handles noise and accents well
Weaknesses
  • Large V3 needs ~10 GB VRAM
  • Not streaming-native (batch-oriented)
  • Can hallucinate on silence or music
pip pip install openai-whisper or faster-whisper for CTranslate2
Hugging Face
Distil-Whisper
756M ~3 GB MIT
6x faster than Whisper Large V3 with only 1% WER degradation. Distilled from Whisper for low-latency transcription. Ideal for real-time applications on consumer hardware.
pip pip install transformers accelerate
NVIDIA — NeMo
Parakeet TDT (0.6B / 1.1B)
0.6B / 1.1B 2–4 GB CC-BY-4.0
NVIDIA's speed-optimized ASR model. RTFx near 2,000x — processes audio dramatically faster than Whisper. RNN-Transducer architecture enables streaming recognition with minimal latency. Trained on 65,000 hours of English audio.
Strengths
  • Among the fastest open ASR models
  • Streaming-capable — real-time transcription
  • 65K hours of training data
  • Optimized for NVIDIA GPUs
Weaknesses
  • English-only
  • Ranks lower on pure accuracy vs Whisper
  • Speed-optimized — accuracy tradeoff
pip pip install nemo_toolkit[asr]
NVIDIA / IBM
Canary Qwen 2.5B / Granite Speech 3.3 8B
2.5B / 8B 4–10 GB Various
Top-accuracy English STT models. Canary Qwen combines speech recognition with the Qwen language model for superior contextual understanding. IBM Granite Speech brings enterprise-grade accuracy. Best for strict accuracy requirements.
NeMo / HF Transformers pip install nemo_toolkit[asr]
Useful Sensors
Moonshine (Tiny / Base)
27M / 61M <0.5 GB MIT
Ultra-lightweight edge ASR model. Outperforms Whisper Tiny and Small despite being significantly smaller. Designed for smartphones, IoT, and offline environments where every MB counts.
pip pip install moonshine
Useful Sensors
Moonshine Gen 2 (0.3B / 0.5B)
0.3B / 0.5B <0.5 GB MIT
Second-generation edge ASR with streaming real-time transcription. Improved accuracy over the original Moonshine while maintaining ultra-low latency. Ideal for always-on voice interfaces and IoT devices.
Strengths
  • Streaming real-time ASR — sub-100ms latency
  • Edge-optimized — runs on microcontrollers
  • Improved accuracy over Moonshine v1
  • MIT license
Weaknesses
  • English-focused
  • Lower accuracy than Whisper on complex audio
pip pip install moonshine
OpenAI — Whisper
Whisper Large V4 (1.5B)
1.5B ~10 GB 99+ languages MIT
The latest iteration of OpenAI's Whisper, improving accuracy over V3 across multilingual benchmarks. Retains full 99+ language support with better handling of accents, noise, and domain-specific terminology.
Strengths
  • Improved accuracy over Whisper V3
  • 99+ languages — best multilingual coverage
  • Better noise and accent handling
  • MIT license — fully open
Weaknesses
  • Needs ~10 GB VRAM
  • Not streaming-native (batch-oriented)
  • Still hallucination-prone on silence
pip pip install openai-whisper

Text-to-Speech (TTS / Voice Synthesis)

Canopy Labs
Orpheus TTS (3B)
3B ~4 GB Apache 2.0
The breakthrough TTS model of late 2025. Human-like emotional speech that rivals ElevenLabs — laughing, crying, whispering on command. Real-time on modern GPUs. State-of-the-art naturalness with emotional control, completely free and local.
Strengths
  • State-of-the-art naturalness — rivals ElevenLabs
  • Emotional control (laughing, crying, whispering)
  • Real-time on modern GPUs
  • Apache 2.0 — fully commercial
Weaknesses
  • English-focused (multilingual expanding)
  • Needs GPU for real-time speeds
  • Newer — smaller community than Piper
pip pip install orpheus-tts
Kokoro
Kokoro-82M
82M <1 GB Apache 2.0
The breakout star of 2026 local TTS. Only 82M parameters yet delivers neural-quality speech with breathing and natural pauses. Runs on CPU, Apple Silicon, or any GPU. Shockingly good for its tiny size.
Strengths
  • 82M params — runs on anything
  • Neural quality (breathing, pausing)
  • CPU and Apple Silicon capable
  • Near-zero VRAM requirement
Weaknesses
  • Limited voice cloning ability
  • Fewer emotional controls than Orpheus
pip pip install kokoro
Fish Audio
Fish Speech V1.5 (S1-mini)
~500M 2–4 GB Apache 2.0
The go-to open-source model for voice cloning across languages. Handles code-switching (e.g., Spanglish) better than most paid APIs. Strong multilingual voice synthesis with zero-shot cloning from short reference audio.
Strengths
  • Best open-source voice cloning
  • Excellent cross-language code-switching
  • Zero-shot cloning from ~10s audio
  • Strong multilingual support
Weaknesses
  • Higher VRAM than Kokoro/Piper
  • Quality varies by language
pip pip install fish-speech
Coqui AI
XTTS v2 / Coqui TTS
~450M 2–4 GB 17 languages MPL-2.0
High-quality multilingual TTS with voice cloning from a 6-second reference. 17 languages supported. The broadest toolkit in open-source TTS with pre-trained voices, fine-tuning, and extensive documentation. Runs well on MacBook Air with 16GB.
Strengths
  • 17 language support out-of-box
  • Voice cloning from 6-second reference
  • 1100+ pre-trained voices
  • Extensive documentation and community
Weaknesses
  • Higher latency than Piper/MeloTTS
  • MPL license (some restrictions)
pip pip install TTS
Open Home Foundation
Piper TTS
Various (~15M–60M) <0.5 GB 40+ languages MIT
Ultra-fast, ultra-lightweight neural TTS designed for offline and embedded use. Sub-second latency, 40+ languages, runs on Raspberry Pi. The default TTS for Home Assistant. Doesn't clone voices but offers many pre-trained speakers.
Strengths
  • Fastest open TTS — sub-second latency
  • Runs on Raspberry Pi / embedded
  • 40+ languages, 100+ voices
  • Home Assistant integration
Weaknesses
  • No voice cloning
  • Less natural than Orpheus/XTTS
  • Fixed voice catalog
pip pip install piper-tts
Suno AI
Bark
~900M 4–6 GB MIT
Not just TTS — Bark generates music, sound effects, and non-verbal vocalizations (laughs, sighs, throat clears). The most creative and fun model in the list. Can generate background music and ambient sounds alongside speech.
Strengths
  • Speech + music + sound effects
  • Non-verbal vocalizations
  • Creative and expressive
  • MIT license
Weaknesses
  • High latency — not real-time
  • Unpredictable output quality
  • GPU-heavy for best results
pip pip install git+https://github.com/suno-ai/bark.git
MyShell
MeloTTS
~200M <1 GB 5 languages MIT
Lightweight and remarkably consistent TTS. Maintains low latency even with long texts — processes short texts in under a second. 5 languages with natural prosody. Ideal for low-resource devices and consistent production use.
pip pip install melotts
Resemble AI
Chatterbox TTS (0.4B)
0.4B ~2 GB MIT
Zero-shot voice cloning TTS with emotional control. Clone any voice from a short audio sample and generate speech with adjustable emotion, pitch, and speaking style. Compact and fast on consumer hardware.
Strengths
  • Zero-shot voice cloning from short samples
  • Emotional control — adjust tone and style
  • Compact 0.4B — fast inference
  • MIT license — fully open
Weaknesses
  • English-focused
  • Cloned voice quality depends on input sample
pip pip install chatterbox-tts
Nari Labs
Dia TTS (1.6B)
1.6B ~3 GB Apache 2.0
Multi-speaker dialogue TTS that generates natural conversations between multiple speakers with integrated sound effects. Perfect for audiobook production, podcast generation, and interactive storytelling.
Strengths
  • Multi-speaker dialogue generation
  • Integrated sound effects
  • Natural conversational flow
  • Apache 2.0 — fully commercial
Weaknesses
  • Larger than single-speaker TTS models
  • Sound effects limited to trained set
pip pip install dia-tts
Amphion
MaskGCT (0.4B)
0.4B ~2 GB MIT
Non-autoregressive TTS using masked generative codec transformers for fast zero-shot multilingual speech synthesis. Generates speech in parallel rather than sequentially, enabling significantly faster inference.
Strengths
  • Non-autoregressive — very fast generation
  • Zero-shot multilingual voice cloning
  • Compact 0.4B parameters
  • MIT license
Weaknesses
  • Newer — smaller community
  • Non-AR can sacrifice some prosody quality
pip pip install amphion

Embedding, Search & Retrieval

Nomic AI
Nomic Embed Text V2
475M (305M active) <1 GB 8K tokens MoE Apache 2.0
First MoE embedding model. Trained on 1.6B multilingual pairs across 100+ languages. Supports flexible output dimensions (256–768) via Matryoshka learning. Competitive with models twice its size on BEIR and MIRACL benchmarks. 86.2% top-5 accuracy.
Strengths
  • MoE — efficient with 305M active params
  • 100+ languages, 100+ code languages
  • Flexible dimensions (256–768)
  • Top BEIR/MIRACL scores for size
Weaknesses
  • Can drop on noisy/multilingual data
  • Needs prefix prompts for optimal results
Ollama ollama pull nomic-embed-text
BAAI (Beijing Academy)
BGE-M3
~570M <1 GB 8K tokens MIT
The Swiss army knife of embedding models. M3 = Multi-functionality (dense + sparse + ColBERT retrieval), Multi-linguality (100+ languages), Multi-granularity (up to 8K tokens). SOTA on MIRACL and MKQA. The first model to unify all three retrieval methods.
Strengths
  • Dense + sparse + ColBERT in one model
  • 100+ languages — best cross-lingual
  • 8K token input length
  • SOTA on multilingual benchmarks
Weaknesses
  • Slightly slower than single-mode models
  • Requires prompt engineering for best results
Ollama ollama pull bge-m3
Alibaba — Qwen
Qwen3-Embedding (0.6B / 4B / 8B)
0.6B–8B 0.5–7 GB 32K tokens Apache 2.0
Alibaba's instruction-aware embedding models built on Qwen3. Support user-defined task instructions for 1–5% accuracy improvement. Flexible output dimensions (32–1024). 100+ natural and programming languages. The 4B and 8B variants outperform most competitors.
pip pip install sentence-transformers
Snowflake
Arctic Embed (Various sizes)
22M–335M <0.5 GB 512 tokens Apache 2.0
Snowflake's optimized embedding suite. Multiple sizes from 22M to 335M for different accuracy/speed tradeoffs. Strong English retrieval performance with focus on enterprise search workloads.
Ollama ollama pull snowflake-arctic-embed
BAAI (Beijing Academy)
BGE Reranker v2-M3 / Gemma-based
~570M–9B 0.5–8 GB 8K tokens Apache 2.0
The most popular open-source rerankers for RAG pipelines. Cross-encoder architecture processes query + document together for precise relevance scoring. Adds 100–500ms latency but significantly improves retrieval quality. Run after initial embedding search to rerank top-K candidates.
Strengths
  • Dramatically improves RAG accuracy
  • Runs on consumer hardware
  • Multiple sizes for speed/accuracy tradeoff
  • Apache 2.0 — no licensing fees
Weaknesses
  • Adds 100–500ms latency per query
  • Cross-encoder — can't precompute
  • Larger variants need significant VRAM
pip pip install FlagEmbedding
Stanford / AnswerAI
ColBERT v2 / ColBERTv2.0
~110M <1 GB 512 tokens MIT
Late-interaction retrieval model using token-level embeddings for scalable BERT-based search in milliseconds. Superior to traditional single-vector embeddings while maintaining speed. Uses MaxSim operator for efficient contextual matching across large datasets.
pip pip install colbert-ai
Jina AI
Jina Embeddings v3 (0.6B)
0.6B <1 GB 8K tokens Apache 2.0
Task-specific embedding model using LoRA adapters for retrieval, classification, separation, and text-matching tasks. Automatically selects the optimal adapter based on query intent, improving accuracy across diverse embedding use cases.
Strengths
  • Task-specific LoRA adapters — optimized per use case
  • 8K context — handles longer documents
  • Strong multilingual support
  • Apache 2.0 — fully open
Weaknesses
  • Larger than simpler embedding models
  • LoRA switching adds minor complexity
pip pip install jina-embeddings

Image Generation (Diffusion Models)

Stability AI
Stable Diffusion 1.5
860M 4–6 GB 512×512 native CreativeML Open
The model that started the local image generation revolution. Still viable in 2026 for low-VRAM setups. The largest ecosystem of LoRAs, checkpoints, and fine-tunes. Thousands of community models on CivitAI. Best entry point for learning diffusion workflows.
Strengths
  • Runs on 4 GB VRAM — most accessible
  • Largest LoRA/checkpoint ecosystem ever
  • Fastest generation times
  • Mature tooling (A1111, ComfyUI)
Weaknesses
  • 512×512 native — needs upscaling
  • Weaker prompt adherence vs newer models
  • Poor text rendering in images
ComfyUI / A1111 pip install diffusers transformers accelerate
Stability AI
Stable Diffusion XL (SDXL 1.0)
3.5B (UNet) 8–12 GB 1024×1024 native CreativeML Open
The most widely used open-source image model in existence. 1024×1024 native with dramatically better prompt adherence, photorealism, and composition than SD 1.5. Deepest community ecosystem — thousands of fine-tuned checkpoints (Juggernaut XL, RealVis, DreamShaper) and LoRAs on CivitAI.
Strengths
  • 1024×1024 native resolution
  • Largest fine-tune ecosystem (Juggernaut, RealVis, etc.)
  • Excellent photorealism with right checkpoint
  • ControlNet, inpainting, upscaling support
  • Commercial use allowed
Weaknesses
  • 8 GB minimum, 12 GB recommended
  • Refiner adds complexity and VRAM
  • Text rendering still inconsistent
Variants SDXL 1.0 · SDXL Turbo · SDXL Lightning
SDXL Lightning: 1–4 step generation. SDXL Turbo: real-time. Use ComfyUI or Forge WebUI.
Stability AI
Stable Diffusion 3.5 (Medium / Large / Turbo)
2.5B–8B 12–24 GB 1024×1024 Stability Community
Stability's latest architecture with improved text rendering inside images and better prompt understanding. Uses MMDiT (Multi-Modal Diffusion Transformer). Better typography than SDXL but smaller community fine-tune ecosystem. Medium variant fits 12 GB, Large needs 24 GB.
Strengths
  • Improved text rendering in images
  • Better prompt fidelity than SDXL
  • Multiple size variants
  • Compatible with existing SD tooling
Weaknesses
  • 12–24 GB VRAM requirement
  • Much smaller LoRA/fine-tune ecosystem
  • Community license — not fully open
ComfyUI pip install diffusers transformers
Black Forest Labs
FLUX.1 (Schnell / Dev)
12B 8–24 GB 1024×1024+ Apache 2.0 (Schnell) / NC (Dev)
The next generation from the original Stable Diffusion creators. 12B parameter DiT architecture delivering Midjourney-level quality locally. Best text rendering of any open model. Schnell = 4-step fast generation (Apache 2.0 commercial). Dev = 28–35 step high quality (non-commercial). GGUF quants run on 6–8 GB.
Strengths
  • Midjourney-level quality — locally
  • Best text rendering in open models
  • Schnell: 4 steps, Apache 2.0 commercial
  • GGUF/NF4 quants: 6–8 GB VRAM
  • Excellent anatomy and photorealism
Weaknesses
  • Full FP16: needs 24 GB VRAM
  • Dev license is non-commercial
  • Smaller fine-tune ecosystem than SDXL
  • No A1111 support — ComfyUI or Forge only
ComfyUI git clone https://github.com/comfyanonymous/ComfyUI.git
NF4 quant: 8 GB VRAM, slight detail loss. FP8: 16 GB, near-lossless. Full BF16: 24 GB. LoRA training needs 24 GB+.
Black Forest Labs
FLUX.2 (Dev / Pro / Flex / Klein)
4B–32B 13–24 GB 1024×1024+ Various
Released November 2025. Production-grade successor to FLUX.1 with state-of-the-art quality. Klein (4B) fits 13–16 GB. Dev (32B) is open-weight. Pro rivals top proprietary models. Flex variant gives fine-grained control over generation parameters.
Strengths
  • State-of-the-art open image quality
  • Klein 4B variant fits consumer GPUs
  • Exceptional prompt fidelity
  • Flex: developer-friendly parameter control
Weaknesses
  • Dev 32B needs significant VRAM
  • Pro is API-only
  • Commercial licensing required for some variants
ComfyUI / Diffusers pip install diffusers transformers
Zhipu AI
Z-Image-Turbo
~3B ≤16 GB 1024×1024+ Apache 2.0
Sub-second inference on enterprise GPUs, comfortable on 16 GB consumer cards. Matches or exceeds FLUX.2 Dev and HunyuanImage 3.0 on benchmarks at a fraction of the compute. Standout bilingual text rendering (English + Chinese) with high clarity. Apache 2.0 — fully commercial.
Strengths
  • Sub-second generation — fastest quality model
  • Fits 16 GB consumer GPUs
  • Best bilingual text rendering (EN/CN)
  • Apache 2.0 — fully commercial
  • Beats models 10x its size
Weaknesses
  • New — small community ecosystem
  • Fewer LoRAs/fine-tunes available
Diffusers / ComfyUI pip install diffusers
Stability AI
Stable Cascade
~5B (staged) 8–16 GB 1024×1024 Stability Community
Three-stage architecture (Stage A/B/C) that compresses the latent space dramatically — 40% less VRAM than SDXL for comparable quality. Much improved text rendering inside images. Faster training with fewer data requirements.
Diffusers pip install diffusers
PixArt (Alpha Lab)
PixArt-α / PixArt-Σ
600M 4–8 GB 1024×1024+ Apache 2.0
Remarkably efficient DiT model — only 600M params yet competitive with SDXL. PixArt-Σ supports 4K resolution generation. Trained with 90% less compute than SD. Excellent for resource-constrained setups that still need high-quality output.
Diffusers pip install diffusers
Tencent — Hunyuan
Hunyuan Image (1.5B DiT)
1.5B 8–12 GB 1024×1024+ Apache 2.0
Tencent's bilingual (English/Chinese) DiT image generator with high-quality output. Strong prompt adherence and aesthetic quality competitive with FLUX at a smaller parameter count.
Strengths
  • Bilingual EN/CN prompt support
  • High aesthetic quality
  • Smaller than FLUX — faster inference
  • Apache 2.0 — fully open
Weaknesses
  • Smaller LoRA ecosystem than SD/FLUX
  • Less community tooling support
Diffusers pip install diffusers
NVIDIA / MIT
Sana (1.6B)
1.6B 6–10 GB Up to 4096×4096 Apache 2.0
Efficient DiT model capable of generating up to 4K resolution images with very fast generation times. Uses linear attention for efficient scaling to high resolutions without the quadratic memory cost of standard transformers.
Strengths
  • Up to 4K resolution generation
  • Very fast — efficient linear attention
  • Compact 1.6B parameters
  • Apache 2.0 — fully open
Weaknesses
  • Newer — limited community fine-tunes
  • Less refined than FLUX on complex prompts
Diffusers pip install diffusers
Various
ESRGAN / Real-ESRGAN / 4x-UltraSharp
~16M 1–2 GB BSD / Apache
AI upscaling models essential for any image generation pipeline. Real-ESRGAN handles 4x upscaling with face enhancement. 4x-UltraSharp is the community favorite for sharp detail. Tiny VRAM footprint — pairs with any generation model.
pip pip install realesrgan

Video Generation (Text/Image-to-Video)

Alibaba — Wan
Wan 2.1 / 2.2 (1.3B / 5B / 14B / 27B MoE)
1.3B–27B 8–24+ GB Apache 2.0
The most versatile open-source video model suite. Text-to-video, image-to-video, video editing, and video-to-audio. The 1.3B runs on consumer GPUs (~8 GB), while the 14B delivers SOTA quality that rivals commercial models. Wan 2.2 adds MoE architecture (27B total / 14B active) for improved detail with same inference cost. Bilingual English/Chinese text generation in video.
Strengths
  • 1.3B variant fits on almost any GPU (~8 GB)
  • SOTA among open-source video models
  • T2V, I2V, video editing, V2A — full pipeline
  • ComfyUI + Diffusers integration
  • FP8 and GGUF quants available
Weaknesses
  • 14B needs offloading on 24 GB (4+ min per clip)
  • 480p is more stable than 720p on 1.3B
  • Complex face consistency can vary
Diffusers pip install diffusers ComfyUI recommended
Tencent — Hunyuan
HunyuanVideo (13B)
13B 14–80 GB Community
The largest open-source video generation model. Cinema-quality output with exceptional motion coherence and facial realism. Uses a "dual-stream to single-stream" transformer with a causal 3D VAE. Supports FP8 weights for lower VRAM. Has spawned community fine-tunes like SkyReels V1 for cinematic human-centric content.
Strengths
  • Best face/human rendering among open models
  • Cinema-grade temporal consistency
  • Camera motion control (zoom, pan, tilt)
  • xDiT multi-GPU parallelism support
  • FP8 mode runs on 14 GB with offloading
Weaknesses
  • 8–15 min per clip on RTX 4090
  • Full precision needs 80 GB+
  • Prompt engineering required for best results
ComfyUI pip install diffusers
Lightricks
LTX Video (2B / 13B)
2B / 13B 12–24 GB Open
The speed champion — generates 30fps video at 1216×704 faster than real-time on RTX 4090. First DiT-based model optimized for rapid iteration. Multiple variants: 13B dev, 13B distilled, 2B distilled, and FP8 builds. Includes spatial and temporal upscalers.
Strengths
  • 5–10 second generation on RTX 4090
  • 30fps output at up to 1216×704
  • FP8 and distilled variants for efficiency
  • Built-in upscaler pipeline
Weaknesses
  • Lower quality ceiling than Wan/Hunyuan
  • Struggles with close-up faces
  • Best with concise 10–20 word prompts
Diffusers pip install diffusers
Zhipu AI — CogVideo
CogVideoX (2B / 5B)
2B / 5B 8–16 GB Apache 2.0
Solid mid-range video generation with 3D Causal VAE technology. Generates 6-second 720×480 clips with strong prompt adherence. CogVideoX 1.5 supports 10-second videos at higher resolution and I2V generation at any resolution. Good LoRA fine-tuning support via CogKit framework.
Strengths
  • Excellent detail and semantic accuracy
  • Strong Diffusers integration
  • LoRA fine-tuning with CogKit
  • 5B fits on consumer 16 GB GPUs
  • Supports quantized inference (TorchAO)
Weaknesses
  • Close-up faces can struggle
  • 6–12 min generation on RTX 4090
  • Fixed resolution modes
pip pip install cogkit
Genmo AI
Mochi 1 (10B)
10B 60–80 GB Apache 2.0
The largest dedicated text-to-video model under Apache 2.0. Asymmetric Diffusion Transformer (AsymmDiT) architecture with a custom VAE that compresses video to 128× smaller. Exceptional motion fluidity at 30fps and strong photorealistic faces in slow-motion shots. Best for high-fidelity short clips.
Strengths
  • Best natural motion quality among open models
  • Apache 2.0 — fully commercial
  • 30fps photorealistic output
  • Fine-tuning with custom video datasets
Weaknesses
  • Needs 60–80 GB VRAM (A100/H100 territory)
  • 480p max resolution currently
  • 10–20 min per clip on RTX 4090 (with offload)
pip pip install diffusers

Music Generation (Text-to-Music / Audio)

Ace Studio
ACE-Step (0.9B)
0.9B ~4 GB Up to 4 min Apache 2.0
Text-to-music and singing voice generation model that can produce full songs up to 4 minutes long. Supports lyrics input for vocal tracks and instrumental generation from text descriptions.
Strengths
  • Full songs up to 4 minutes
  • Text-to-music + singing voice
  • Lyrics-conditioned vocal generation
  • Apache 2.0 — fully commercial
Weaknesses
  • Music quality below commercial services
  • Limited style control granularity
pip pip install ace-step
Stability AI
Stable Audio Open (1.3B)
1.3B ~6 GB 47s stereo CC-BY-SA
Stability AI's open-weight audio generation model producing up to 47 seconds of stereo audio from text descriptions. Supports music, sound effects, and ambient soundscapes with good fidelity.
Strengths
  • 47s stereo audio generation
  • Music, SFX, and ambient sounds
  • Good audio fidelity
  • Open weights
Weaknesses
  • CC-BY-SA — share-alike requirement
  • 47s max length — no long-form
  • Less coherent on complex musical structures
Diffusers pip install diffusers
Meta — AudioCraft
MusicGen (1.5B / 3.3B)
1.5B / 3.3B 6–12 GB MIT
Meta's text-to-music model from the AudioCraft suite. Generates high-quality music in various styles from text descriptions or melody conditioning. Well-established with strong community support and tooling.
Strengths
  • High-quality music generation
  • Melody conditioning — hum a tune, get a track
  • Multiple sizes for speed/quality tradeoff
  • MIT license — fully open
Weaknesses
  • 30s max length per generation
  • 3.3B version needs 12+ GB VRAM
  • No vocal/lyrics generation
pip pip install audiocraft

3D Generation (Text/Image-to-3D)

Microsoft
TRELLIS 2 (2B)
2B ~8 GB MIT
Microsoft's image and text-to-3D generation model. Produces high-quality 3D assets exportable as GLB and OBJ files for direct use in game engines and 3D applications.
Strengths
  • Image + text-to-3D generation
  • GLB/OBJ export — game-engine ready
  • High-quality mesh output
  • MIT license — fully open
Weaknesses
  • Complex scenes can lack detail
  • Texturing quality varies by input
pip pip install trellis3d
Tencent — Hunyuan
Hunyuan3D 2.0
Various 8–16 GB Tencent
Tencent's high-fidelity 3D generation system from text or image input. Produces detailed meshes with PBR textures suitable for professional 3D workflows and game asset pipelines.
Strengths
  • High-fidelity 3D mesh generation
  • PBR texture output
  • Text and image input support
  • Professional-quality assets
Weaknesses
  • Tencent license — check terms for commercial use
  • Higher VRAM requirements
pip pip install hunyuan3d
Tripo / Stability AI
TripoSR (1B)
1B ~4 GB MIT
Ultra-fast single-image-to-3D reconstruction. Generates a 3D mesh from a single photograph in under one second on modern GPUs. Ideal for rapid prototyping and batch 3D asset creation.
Strengths
  • Sub-1-second 3D generation
  • Single image input — no multi-view needed
  • Compact 1B parameters
  • MIT license — fully open
Weaknesses
  • Single-view — back of objects can be inaccurate
  • Lower detail than multi-step pipelines
pip pip install triposr

💬 Suggest an Improvement

Missing a model? Found incorrect info? Have a feature request? Help make this reference better for everyone.

✓ Thanks! Your suggestion has been submitted.

❓ Frequently Asked Questions

Click to expand ▼

How much VRAM do I need to run a local LLM?

It depends on model size and quantization. At Q4_K_M (the community default): 7B models need 5–7 GB, 14B models need 10–11 GB, 32B models need 20–22 GB, and 70B models need 45–50 GB. The formula approximates: VRAM ≈ Parameters_B × 0.5 + KV_overhead. Consumer GPUs like the RTX 4090 (24 GB) comfortably accommodate 32B models.

What is the best local LLM in 2026?

Top choices vary by hardware tier. For 24 GB GPUs: Qwen3 32B (Apache 2.0 license, dual-mode thinking, multilingual/coding excellence). For 16 GB: Phi-4 14B (MIT-licensed, exceptional reasoning/math). For 8 GB or less: Qwen3 8B or Llama 3.2 3B. Frontier-grade models include GLM-5 744B and DeepSeek V3.2 671B (requiring datacenter infrastructure).

What is Q4_K_M quantization?

Q4_K_M is a 4.8-bit quantization format that reduces model size by approximately 75% while preserving 99.5% of quality. It represents the community standard for local LLM operation via Ollama and llama.cpp. Think of it as "JPEG at 80% quality — barely distinguishable from original but dramatically smaller."

Can I run AI models locally without internet?

Yes. Once downloaded, all models operate entirely offline with zero cloud dependency. Tools like Ollama, LM Studio, and llama.cpp enable local inference. This approach suits air-gapped environments, HIPAA compliance requirements, and privacy-sensitive workflows.