Local AI Model Index

Every major open-weight model you can run locally via Ollama, llama.cpp, or vLLM — cataloged.

Updated March 2026
73+
Models Cataloged
16
Model Families
0.3–1500
GB VRAM Range
82M–744B
Parameter Range
📐
VRAM ≈ (Params_B × 0.5) + KV_overhead
Rule of thumb for Q4_K_M quantization at 8K context. Add 20–50% for 32K–128K contexts. MoE models load all params but only activate a fraction per token.
Ad · Support this free resource
Tier 1 — Entry Level
4–8 GB VRAM / CPU
Laptops, iGPUs, M-series Macs. Models up to ~8B. Basic chat, summaries, simple automation. ~30–80 t/s on 8B models, ~15–25 t/s on M-series Macs.
Tier 2 — Mid Range
12–24 GB VRAM
RTX 4070/4080/4090. Models 14B–32B. Strong coding, reasoning, agents. ~40–60 t/s on 14B, ~20–35 t/s on 32B. Sweet spot for power users.
Tier 3 — High End
32–80 GB VRAM
Multi-GPU or A100/H100. Models 70B+. Near-cloud quality. ~15–30 t/s on 70B, ~40+ t/s on A100. Production-grade local inference.
Tier 4 — Datacenter
80+ GB / Multi-Node
H100/H200 clusters. Full-scale MoE giants (200B–744B). Frontier-level capabilities. ~50–100+ t/s with tensor parallelism.

🚀 Getting Started — From Zero to Running in 5 Minutes

Click to expand ▼

Every model on this page can be running on your machine in minutes. Here's the path most people take:

1
Install Ollama
One installer, all platforms. Handles model downloading, GPU detection, quantization, and an OpenAI-compatible API on localhost:11434.
curl -fsSL https://ollama.com/install.sh | sh
Or download from ollama.com for Windows/Mac.
2
Pull Your First Model
One command downloads and runs it. Ollama auto-selects the best quantization for your hardware.
ollama run llama3.2:3b
~2 GB download. You're chatting in under a minute.
3
Add a Web UI
Ollama's CLI works, but a web interface makes it actually enjoyable. Open WebUI is the community standard.
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main
Then open localhost:3000. ChatGPT-like interface, your machine.
4
Scale Up
Ready for more? Pull bigger models, add RAG with your documents, or set up image/video generation.
ollama pull qwen3:32b
Browse the catalog below to find what fits your GPU.

Frontend Tools — What You Actually Interact With

Models are engines. These tools are the steering wheel. Each serves a different workflow — most people end up using 2–3 of them.

Ollama Essential
The foundation. CLI tool that downloads, manages, and serves LLMs locally. Built on llama.cpp. OpenAI-compatible API on port 11434 means everything else can plug into it.
Best for: Backend engine, API integration, developers who live in the terminal. Pairs with every UI below.
Open WebUI Most Popular
ChatGPT-like web interface for Ollama. Multi-user support, document upload, RAG, web search, plugins, and conversation history. 45K+ GitHub stars. Docker deploy.
Best for: Daily driver chat UI. Teams sharing a local AI server. Anyone who wants the ChatGPT experience privately.
LM Studio Beginner Friendly
Desktop app with built-in model browser and one-click downloads from Hugging Face. No Docker or CLI needed. Excellent Apple Silicon optimization. MCP tool support.
Best for: Non-technical users, Mac users, people who want zero CLI. Also provides a local API server.
AnythingLLM RAG Focused
Desktop + server app built around document Q&A. Drag-and-drop file ingestion, workspace-based chat, built-in vector store, and no-code agent builder. Connects to Ollama or cloud APIs.
Best for: Chatting with your own documents. Consulting firms with multiple client knowledge bases. RAG without writing code.
ComfyUI Creative
Node-based visual workflow editor for image and video generation. The standard tool for Stable Diffusion, FLUX, Wan, and HunyuanVideo. Handles model loading, LoRA stacking, and pipeline orchestration.
Best for: All image/video generation on this page. Visual pipelines. LoRA workflows. The creative production stack.
Jan / GPT4All / LobeChat Alternatives
Jan: zero-config desktop app, bundles everything, no Ollama needed. GPT4All: Nomic's lightweight chat app with LocalDocs for file Q&A. LobeChat: polished mobile-first PWA with voice and multimodal support.
Best for: Jan for non-technical family/colleagues. GPT4All for simple local file chat. LobeChat for mobile-first users.
Pinokio One-Click
The "Steam for AI." A browser-based launcher that one-click installs and manages local AI apps — ComfyUI, Ollama, Open WebUI, Stable Diffusion, text-generation-webui, and dozens more. No terminal, no Python environments, no dependency conflicts. It installs the tools; the models come from this page.
Best for: Absolute beginners who want to skip all setup. Artists and creators who just want to generate. Installing multiple AI tools without breaking your system.
Text Generation WebUI Power User
Oobabooga's feature-rich web interface. Multiple backends (llama.cpp, transformers, ExLlama2, AutoGPTQ, AWQ). LoRA loading, fine-grained parameter control, extensions system, multimodal support. Zero telemetry, full privacy.
Best for: Advanced users who want maximum control over inference parameters. LoRA experimentation. Running models in formats beyond GGUF (GPTQ, AWQ, safetensors).

📐 Quantization — What Happens When You Compress a Model

Click to expand ▼
Think of it like JPEG compression. A photo saved at JPEG 100% is huge but perfect. At 80% it's half the size and you can't tell the difference. At 40% you start seeing artifacts. At 10% it's a blurry mess — but it loads instantly on any device.

Quantization works the same way for AI models. The original 16-bit weights are the "full resolution." Compressing to 4-bit cuts VRAM by 75% and is barely distinguishable from full quality. Go below 3-bit and things start breaking.

The Compression Table — 70B Model (e.g., Llama 3.3)

Every VRAM estimate on this page uses Q4_K_M as the baseline — the community default. Here's how the same 70B model looks at each compression level:

Level Bits Size on Disk VRAM Quality Like JPEG...
FP16 16-bit ~140 GB ~142 GB Perfect — baseline 100% — lossless
Q8_0 8-bit ~70 GB ~74 GB Essentially identical 95% — can't tell
Q6_K 6.6-bit ~58 GB ~62 GB Extremely close 90% — pixel peeping only
Q4_K_M ★ 4.8-bit ~42 GB ~46 GB 99.5% preserved 80% — the sweet spot
Q3_K_M 3.9-bit ~34 GB ~38 GB Subtle degradation 60% — trained eyes notice
Q2_K 2.6-bit ~24 GB ~28 GB Noticeable loss 30% — artifacts visible
IQ2_XS 2.3-bit ~21 GB ~25 GB Significant loss 15% — it runs, barely

What Breaks First

Not all tasks degrade equally. Here's what you lose first as you compress harder — from most sensitive to most resilient:

Task Type
Sensitivity
What Happens Below Q4
Math & Logic
● Critical
Arithmetic errors, proof steps skipped, wrong answers to multi-step problems
Structured Code
● High
Syntax errors, wrong function signatures, broken JSON/API outputs
Reasoning Chains
● Medium
Loses track of multi-step logic, contradicts itself, shallow analysis
Instruction Following
● Medium
Ignores constraints, drifts from format, misses parts of complex prompts
Creative Writing
● Low
Slightly less nuanced vocabulary, more repetitive patterns
Casual Chat
● Minimal
Basically unaffected — conversational quality holds well even at Q3

Real-World Scenarios

Theory is nice, but here's what the tradeoffs actually look like with models from this page:

Scenario 1 — RTX 4090 (24 GB)
Quality vs. Power: 32B@Q8 or 70B@Q4?

You have two strong options with 24 GB:

  • Qwen3 32B @ Q8 — Fits easily (~20 GB). Near-perfect quality. Excellent for coding, reasoning, and long-context work where accuracy matters.
  • Llama 3.3 70B @ Q4_K_M — Tight fit (~46 GB needs offloading, or use Q3 at ~38 GB). More raw knowledge and capability, but quantization costs you some math precision and complex reasoning.
Rule of thumb: A smaller model at higher quant often beats a larger model at lower quant — especially for coding and structured output. The 32B@Q8 will generate cleaner code than the 70B@Q3.
Scenario 2 — 16 GB GPU
Coding Assistant: 14B@Q8 or 30B MoE@Q4?

Two paths to a capable coding assistant:

  • Phi-4 14B @ Q8 — ~16 GB, fills the card. Full-quality reasoning and coding. 128K context. Rock solid for structured tasks.
  • GLM-4.7-Flash 30B MoE @ Q4 — ~20 GB total but only 3B active. 59.2% SWE-bench. Needs some CPU offloading but the MoE speed advantage means it's still fast.
Verdict: GLM-4.7 wins for coding specifically — MoE architecture means speed stays high even with offloading, and its SWE-bench score crushes Phi-4. For general-purpose reasoning, Phi-4@Q8 is the safer choice.
Scenario 3 — Stretching the Limits
Running Llama 4 Scout (109B) on 24 GB

The extreme case — a 109B model on consumer hardware:

  • IQ2_XS (1.78-bit) quant brings it down to ~24 GB VRAM
  • The 10M context window still works but KV cache eats VRAM fast
  • MoE helps — only 17B params activate per token, so speed is still usable
  • Quality: general chat and simple tasks work fine. Complex coding and math noticeably degrade.
Verdict: It's impressive that it runs at all. Fine for casual use and testing. For production agent work, you'd want this model at Q4+ on multi-GPU — or pick a smaller model at higher quant.
Technical Notes — Formats, MoE, Context Scaling
GGUF vs. GPTQ vs. AWQ vs. EXL2 GGUF = llama.cpp/Ollama format. Best for CPU+GPU split. Most models available here. GPTQ = GPU-only, fast batch inference. AWQ = activation-aware, better quality at same bits. EXL2 = variable bits-per-layer, optimal quality. For Ollama: GGUF. For vLLM: AWQ/GPTQ.
MoE Memory Trap MoE models load ALL parameters into VRAM but only activate a subset per token. A 236B MoE with 22B active still needs ~120 GB at Q4 because every expert weight must be resident. Speed is set by active params. VRAM is set by total params.
Context Length Eats VRAM KV cache grows with context. At 8K: +1–2 GB. At 32K: +4–8 GB. At 128K: +20–40 GB. Models with GQA or MLA (DeepSeek, Llama 3) are much more efficient. Nemotron's 1M context uses linear attention to keep it manageable.
CPU Offloading If a model doesn't fit entirely in VRAM, llama.cpp/Ollama automatically split layers between GPU and system RAM. Each offloaded layer is ~3–10x slower. A 70B@Q4 running 50/50 GPU/CPU is ~3x slower than full GPU — but it runs.
Ollama Quant Selection ollama pull llama3.3 gives you Q4_K_M by default. For specific quants: ollama pull llama3.3:70b-instruct-q8_0 or ollama pull llama3.3:70b-instruct-q2_K. Check available tags on ollama.com/library.
Unsloth Dynamic Quants Unsloth's "Dynamic 2.0" method keeps important layers at higher precision (8 or 16-bit) while compressing less critical ones further. A 4-bit Dynamic quant often matches standard Q5 quality. Look for "Unsloth" GGUF uploads on Hugging Face.

🧱 Recommended Stacks — Complete Local AI Setups by Hardware

Click to expand ▼

Stop thinking about individual models — here's what a complete local AI stack looks like at each hardware tier. Every model listed below is on this page with full details. All LLM VRAM at Q4_K_M unless noted.

RTX 4090 · 24 GB — Power User
Full-Stack Agent Workstation
LLMQwen3 32B @ Q8~20 GB
RAGNomic Embed Text V2~0.5 GB
STTWhisper V3 Turbo~3 GB
TTSKokoro 82M~0.3 GB
Peak VRAM (LLM + RAG concurrent)~20.5 GB
STT/TTS load on-demand, not concurrent with LLM. Swap LLM to Llama 3.3 70B@Q4 for broader knowledge (needs CPU offload). Add FLUX.1 Schnell for image gen — runs separately at ~8 GB.
RTX 4070 Ti Super · 16 GB — Mid Range
Coding & RAG Assistant
LLMPhi-4 14B @ Q8~16 GB
RAGSnowflake Arctic 335M~0.3 GB
STTDistil-Whisper~3 GB
TTSPiper (VITS)~0.1 GB
Peak VRAM (LLM + RAG)~16.3 GB
Tight fit — LLM fills the card. STT/TTS run after unloading LLM, or use CPU. Alt LLM: GLM-4.7-Flash 30B MoE@Q4 for coding (needs offload but faster via MoE).
RTX 3060/4060 · 12 GB — Budget
Capable Local Chat
LLMGemma 3 12B @ Q4~8 GB
RAGNomic Embed Text V2~0.5 GB
STTMoonshine Base~0.2 GB
TTSPiper (VITS)~0.1 GB
Peak VRAM (all concurrent)~8.8 GB
Room to breathe. Gemma 3 12B is surprisingly capable with vision. For image gen, SDXL Turbo fits at ~8 GB when LLM is unloaded.
Mac M-Series · 32 GB Unified — Apple Silicon
Unified Memory Advantage
LLMLlama 3.3 70B @ Q4~40 GB
RAGBGE-M3~0.5 GB
STTWhisper V3 Turbo~3 GB
TTSKokoro 82M~0.3 GB
Total memory footprint~44 GB
64 GB M-series runs 70B easily. 32 GB fits it with swap pressure — reduce context or use Qwen3 32B@Q6 for comfort. MLX framework recommended. Generation ~15–25 t/s on M3 Max.
Air-Gapped · 24 GB — Compliance Ready
Zero Cloud, Fully Licensed Stack
LLMQwen3 32B @ Q8 (Apache 2.0)~20 GB
RAGNomic Embed V2 (Apache 2.0)~0.5 GB
STTWhisper V3 Turbo (MIT)~3 GB
TTSPiper VITS (MIT)~0.1 GB
RerankBGE Reranker v2-M3 (MIT)~0.5 GB
Peak VRAM (LLM + RAG + Rerank)~21 GB
Every component Apache 2.0 or MIT — audit-ready for HIPAA, FedRAMP, and regulated environments. Qwen3 supports native tool calling for agentic workflows. No internet required post-deployment. Swap Qwen3 for Llama 3.3 70B@Q4 (Community license) if broader knowledge is needed and Meta's license terms are acceptable.
RTX 4090 · 24 GB — Creative & Content
Image + Video + Voice Production
ImageFLUX.1 Schnell (Apache 2.0)~8 GB
VideoWan 2.1 14B (Apache 2.0)~14 GB
UpscaleReal-ESRGAN 4x~1 GB
TTSOrpheus 3B~4 GB
STTWhisper V3 Turbo (MIT)~3 GB
Peak VRAM (one model at a time)~14 GB
Run sequentially, not concurrent — load Wan for video, unload, load FLUX for images. ComfyUI manages model swapping automatically. Wan 1.3B variant at ~8 GB for faster iteration. Add Qwen3 8B (~6 GB) as a prompt enhancer between generations. Full content pipeline: script → voice → images → video → upscale.
Ad · Support this free resource

Tiny Models — ≤ 8B Parameters

Meta — Llama
Llama 3.2 (1B / 3B)
1B / 3B 1.5–3.6 GB 128K ctx Community Tool Use
Meta's ultra-lightweight on-device models. Fast inference (50–80 t/s), solid instruction following and multilingual support. The 3B variant is a popular starting point for testing local AI setups.
Strengths
  • Extremely fast — runs on CPU-only setups
  • Strong instruction following for size
  • Massive fine-tune ecosystem (100K+ HF variants)
  • Good multilingual support
Weaknesses
  • Limited reasoning depth on complex tasks
  • Shallower knowledge than 7B+ peers
  • Can hallucinate on niche domains
Best For
  • Always-on chatbots and agents
  • Mobile / edge / low-power devices
  • Quick local automation scripts
  • Testing and prototyping pipelines
VRAM Breakdown
  • 1B Q4: ~1.5–2 GB VRAM
  • 3B Q4: ~3.6 GB VRAM
  • CPU-only: 8+ GB RAM recommended
Ollama ollama pull llama3.2:1b ollama pull llama3.2:3b
Google — Gemma
Gemma 3 (1B / 4B)
1B / 4B 2–4 GB 128K ctx Vision Gemma Tool Use
Google's highly efficient small model family trained for maximum quality-per-parameter. The 4B variant includes vision support and handles 140+ languages. Power-efficient enough for mobile and embedded deployments.
Strengths
  • Exceptional efficiency — punches above its weight
  • Multilingual support (140+ languages)
  • Vision-ready in 4B+ variants
  • Very power-efficient for edge/mobile
Weaknesses
  • Lower benchmark scores on heavy reasoning vs denser peers
  • Limited coding depth compared to Qwen/Phi
Best For
  • On-device classification and summaries
  • Multilingual text processing
  • Quick image understanding tasks
  • Edge deployments and IoT
VRAM Breakdown
  • 1B Q4: ~2 GB
  • 4B Q4: ~4 GB
  • Runs well on Apple M-series via MLX
Ollama ollama pull gemma3:1b ollama pull gemma3:4b
Alibaba — Qwen
Qwen3 (0.6B / 1.7B / 4B)
0.6B–4B 1–4 GB 32K ctx Apache 2.0 Tool Use
Alibaba's smallest Qwen3 variants bring dual-mode thinking (fast vs. chain-of-thought) even to tiny form factors. Excellent multilingual coverage (100+ languages) and surprising early coding/math ability for size.
Strengths
  • Outstanding multilingual (100+ languages)
  • Dual-mode: fast inference or thinking mode
  • Strong early coding/math for size
  • Apache 2.0 — fully permissive license
Weaknesses
  • Very small variants hallucinate on niche topics
  • Limited context window vs. larger siblings
Ollama ollama pull qwen3:0.6b ollama pull qwen3:4b
Microsoft — Phi
Phi-4-mini (3.8B)
3.8B ~3 GB 128K ctx MIT Tool Use
Microsoft's tiny powerhouse, specifically tuned for reasoning and math. Often beats 7–13B models on STEM benchmarks despite its compact size. Supports 128K context and function calling.
Strengths
  • Exceptional reasoning/math — beats larger models
  • 128K context window
  • Very fast inference
  • Function calling support
Weaknesses
  • Narrower general knowledge
  • Less creative writing ability
Ollama ollama pull phi4-mini
Alibaba — Qwen
Qwen3 (8B)
8B 6–7 GB 128K ctx Apache 2.0 Tool Use
The community's top all-around 8B model. Leads benchmarks in multilingual, coding, and long-context tasks for its size class. Dual-mode thinking enables both quick responses and deep chain-of-thought reasoning.
Strengths
  • Leading multilingual / coding / long-context for 8B
  • Dual-mode thinking (fast + deep)
  • Beats many larger models on benchmarks
  • Excellent community daily driver
Weaknesses
  • Occasional inconsistency in fast mode
  • Knowledge depth limited vs 14B+
Ollama ollama pull qwen3:8b
Meta — Llama
Llama 3.1 (8B)
8B ~6.2 GB 128K ctx Community Tool Use
Meta's ecosystem king at the 8B tier. The most fine-tuned open model in history with vast community support. Strong generalist with 128K context and tool-calling capabilities. Ideal as a fine-tuning base.
Strengths
  • Massive fine-tune ecosystem
  • Strong generalist — tool-calling, agents
  • 128K context window
  • Excellent fine-tuning base
Weaknesses
  • Safety alignment can refuse creative prompts
  • Slightly behind Qwen3 8B on benchmarks
Ollama ollama pull llama3.1:8b
Mistral AI
Mistral 7B / Nemo 12B
7B / 12B 5–9 GB 128K ctx Apache 2.0
Mistral's foundational models — Apache 2.0 licensed with outstanding instruction-following, creativity, and European-language performance. Nemo 12B (built with NVIDIA) brings 128K context in a very efficient package.
Strengths
  • Superb instruction following and creativity
  • Excellent European language performance
  • Fully permissive Apache 2.0 license
  • Very efficient inference
Weaknesses
  • Older base — less competitive on 2026 reasoning benchmarks
  • Smaller models limited on very complex tasks
Ollama ollama pull mistral ollama pull mistral-nemo
DeepSeek
DeepSeek-R1 Distilled (7B / 8B)
7B / 8B 5–7 GB 64K ctx MIT
Distilled versions of DeepSeek's R1 reasoning model, bringing reinforcement-learning "thinking" mode to tiny form factors. The thinking traces enable complex problem-solving that rivals much larger dense models.
Strengths
  • RL-based thinking mode — rivals o1-level reasoning
  • Excellent coding/math for size
  • MIT license — fully permissive
Weaknesses
  • Occasionally less coherent without thinking mode
  • Slower when thinking traces are long
Ollama ollama pull deepseek-r1:8b
Hugging Face
SmolLM2 (135M / 360M / 1.7B)
135M–1.7B 0.3–1.5 GB 8K ctx Apache 2.0 Tool Use
Hugging Face's ultra-compact language models designed for extreme edge deployment. The smallest viable LLMs for on-device inference where every megabyte counts.
Strengths
  • Incredibly small — runs on anything
  • Sub-1GB models available
  • Fast training/fine-tuning
Weaknesses
  • Very limited capability — basic tasks only
  • High hallucination rate
Ollama ollama pull smollm2:1.7b

Small Models — 9–14B Parameters

Microsoft — Phi
Phi-4 (14B)
14B ~11 GB 128K ctx MIT Tool Use
Microsoft's reasoning champion at 14B parameters. Tops charts for math and reasoning in its size class (84%+ MMLU). A compact powerhouse that fits comfortably on an RTX 4090 with room to spare for context.
Strengths
  • Exceptional reasoning/math — tops 14B class
  • 84%+ MMLU, strong GPQA scores
  • 128K context, fast inference
  • MIT license — fully open
Weaknesses
  • Less creative / broad general knowledge than Llama/Qwen
  • Narrower training data focus
Best For
  • STEM tasks, math tutoring
  • Research assistants on mid-range GPUs
  • Technical document analysis
  • Code review and generation
VRAM Breakdown
  • Q4: ~11 GB — fits RTX 4060 Ti 16GB
  • Q8: ~16 GB — fits RTX 4090
  • FP16: ~28 GB
Ollama ollama pull phi4:14b
Alibaba — Qwen
Qwen3 (14B)
14B ~10.7 GB 128K ctx Apache 2.0
Alibaba's 14B dense model with dual-mode thinking. Excels at multilingual tasks, coding, and long-context processing. A top community recommendation for users with 16GB VRAM GPUs.
Strengths
  • Top-tier multilingual and coding for 14B class
  • Dual-mode: fast + thinking
  • 128K context window
  • Apache 2.0 license
Weaknesses
  • Occasional fast-mode inconsistency
  • Slightly below Phi-4 on pure math
Ollama ollama pull qwen3:14b
DeepSeek
DeepSeek-R1 (14B distilled)
14B ~11 GB 64K ctx MIT Tool Use
The 14B distillation of DeepSeek-R1, bringing near-SOTA reasoning and coding ability to a single consumer GPU. Thinking mode enables complex multi-step problem solving.
Strengths
  • Near-SOTA coding and reasoning for size
  • RL-trained thinking traces
  • Excellent for agentic workflows
Weaknesses
  • Distilled — some quality loss vs full R1
  • Can be verbose in thinking mode
Ollama ollama pull deepseek-r1:14b
NVIDIA — Nemotron
Nemotron Nano 12B v2 VL
12B ~9 GB 128K ctx Vision+Video NVIDIA Open Tool Use
NVIDIA's multimodal reasoning model designed for document intelligence, video understanding, and visual Q&A. Hybrid Transformer-Mamba architecture combines accuracy with memory efficiency.
Strengths
  • Multi-image and video understanding
  • Leading OCR and document intelligence
  • Hybrid architecture — efficient memory
  • Reasoning mode toggle via system prompt
Weaknesses
  • Reasoning not supported for video inputs
  • Newer — smaller community ecosystem
Ollama ollama pull nemotron-nano:12b
Mistral AI
Mistral Small 3 (24B)
24B ~15 GB 128K ctx Apache 2.0 Tool Use
Sets the benchmark for sub-70B instruction-following. Excellent for creative writing, multilingual tasks, and coding. Strong European language support with Apache 2.0 licensing.
Strengths
  • Benchmark leader in the "small" LLM category
  • Strong coding and instruction following
  • Apache 2.0 — commercial friendly
Weaknesses
  • Heavier than 14B models — needs 16GB+ VRAM
  • Not as strong on pure math as Phi-4
Ollama ollama pull mistral-small
NVIDIA / IBM
Granite 4.0 (350M → 32B MoE)
350M–32B 1B–9B active 0.5–20 GB 128K ctx Apache 2.0 Tool Use
IBM's enterprise-grade model family with a novel hybrid Mamba-2 architecture for faster inference and lower memory. Spans edge to server: Nano (350M/1B), Micro (3B), Tiny (7B MoE/1B active), Small (32B MoE/9B active). Trained on 15T tokens with strong tool calling and 12-language support.
Strengths
  • Hybrid Mamba-2 arch — faster inference, lower memory
  • Edge to server in one family (350M to 32B)
  • Strong instruction following and tool calling
  • Apache 2.0 — clean enterprise licensing
Weaknesses
  • Mamba-2 backend support still maturing in some tools
  • Smaller community than Llama/Qwen ecosystems
  • MoE variants need compatible inference stacks
Ollama ollama pull granite4 ollama pull granite4:small-h

Medium Models — 15–35B Parameters

OpenAI — GPT-OSS
GPT-OSS (20B)
20B ~12 GB 128K ctx Open Tool Use
OpenAI's open-weight model, designed with GPT-style structured reasoning and tool-use capabilities. Agent-friendly with strong structured output generation. Runs well quantized on 24GB cards.
Strengths
  • GPT-like structured reasoning and tool-use
  • Agent-friendly design with function calling
  • Runs well quantized on consumer GPUs
  • Strong structured outputs
Weaknesses
  • Newer — smaller community than Llama
  • 128K context limit (no 1M option)
Ollama ollama pull gpt-oss:20b
Mistral AI
Devstral Small 2 (24B)
24B ~15 GB 128K ctx Apache 2.0
Mistral's dedicated coding model optimized for software engineering and agentic coding tasks. Top SWE-bench scores in its size class with strong multi-file code understanding.
Strengths
  • 68% SWE-bench — exceptional coding
  • Strong agentic task completion
  • Multi-file code understanding
Weaknesses
  • Coding-focused — less general purpose
  • Not ideal for creative or chat tasks
Ollama ollama pull devstral-small
Google — Gemma
Gemma 3 (27B)
27B ~22.5 GB 128K ctx Vision Gemma Tool Use
Google's strong all-rounder with multimodal vision capabilities. Efficient for its size with good performance across reasoning, coding, and multilingual tasks. Fits on a 24GB GPU at Q4.
Strengths
  • Well-balanced across all benchmarks
  • Built-in vision support
  • Efficient for size, good throughput
  • 140+ language support
Weaknesses
  • Tight fit on 24GB at full context
  • Not absolute SOTA on any single benchmark
Ollama ollama pull gemma3:27b
Zhipu AI (Z.ai) — GLM
GLM-4.7-Flash (30B MoE / 3B active)
30B total 3B active ~20 GB 200K ctx MoE Open
The efficiency king for coding on consumer hardware. A 30B MoE model that only activates 3B parameters per token, delivering 59.2% SWE-bench at 60–80 t/s. Community calls it "best 70B-or-less model" for UI generation and tool calling. Interleaved thinking between actions.
Strengths
  • 59.2% SWE-bench — top for local coding
  • 60–80 t/s at 4-bit — very fast
  • Interleaved + preserved thinking modes
  • Excellent tool-use and agentic capabilities
  • Runs on RTX 3090/4090 and Mac M-series
Weaknesses
  • Chat template issues with some runtimes
  • Needs 24GB+ VRAM for good experience
  • Smaller community than Qwen/Llama
Ollama ollama pull glm4:latest llama.cpp recommended — use --jinja flag
NVIDIA — Nemotron
Nemotron 3 Nano (30B-A3B)
31.6B total 3.6B active ~20 GB 1M ctx Hybrid MoE NVIDIA Open Tool Use
NVIDIA's hybrid Mamba-Transformer MoE model designed for agentic AI. 1M token context window with 4x faster inference than predecessors. Activates only 3.6B of 31.6B parameters. 91% Math 500 score — the highest among peers. Open weights, training data, AND recipes.
Strengths
  • 91% Math Index — top in class
  • 1M token context window
  • 3.3x faster throughput than Qwen3-30B
  • Hybrid Mamba-Transformer architecture
  • Fully open: weights + data + recipes
Weaknesses
  • Automatic CPU offloading can reduce speed
  • Context cliff at high token counts
  • Newer Mamba architecture — less battle-tested
Ollama ollama pull nemotron-nano
Alibaba — Qwen
Qwen3 (32B)
32B ~22 GB 128K ctx Apache 2.0
The community's top recommendation for 24GB GPUs. Exceptional multilingual, coding, and long-context stability — maintains 33–53 t/s at 48K tokens with zero offloading. The best long-context stability king in its class.
Strengths
  • Best long-context stability — 100% GPU at 48K
  • Exceptional multilingual and coding
  • Dual-mode thinking
  • Apache 2.0 license
Weaknesses
  • Tight fit on 24GB — context limited
  • Dense model — heavier than MoE alternatives
Ollama ollama pull qwen3:32b
DeepSeek
DeepSeek-R1 (32B)
32B ~22 GB 64K ctx MIT
Near-SOTA coding/math/reasoning with thinking mode at a size that fits on a single 24GB GPU. KV cache efficiency via MLA architecture. The go-to for developer tools and agentic workflows.
Strengths
  • Near-SOTA reasoning with thinking traces
  • KV cache efficiency (MLA architecture)
  • Excellent for agentic/developer workflows
  • MIT license
Weaknesses
  • May require vLLM patches for optimal speed
  • Verbose thinking traces eat context
Ollama ollama pull deepseek-r1:32b
Alibaba — Qwen
QwQ (32B)
32B ~22 GB 128K ctx Apache 2.0 Tool Use
Qwen's dedicated reasoning model, trained specifically for deep chain-of-thought problem solving. Excels at complex multi-step math and logic puzzles where extended reasoning is needed.
Strengths
  • Purpose-built for deep reasoning
  • Exceptional on math competitions (AIME)
  • 128K context for long reasoning chains
Weaknesses
  • Not a generalist — specialized for reasoning
  • Very verbose reasoning traces
Ollama ollama pull qwq
Allen AI (Ai2) — OLMo
OLMo 3.1 (7B / 32B)
7B / 32B 5–20 GB 65K ctx Apache 2.0
The only truly open-source model family — not just open weights, but full training code, Dolma 3 dataset (6T tokens), intermediate checkpoints, reward models, and OlmoTrace for auditing outputs back to training data. Think variant competitive with Qwen3 32B on reasoning.
Strengths
  • Fully open: code, data, checkpoints, logs — auditable end-to-end
  • Think variant strong on math, code, and reasoning
  • OlmoTrace lets you trace outputs to training data
  • Apache 2.0 — clean for enterprise and compliance
Weaknesses
  • Slightly behind Qwen3 on general chat tasks
  • Smaller community ecosystem than Llama/Qwen
  • 65K context (vs 128K+ on competitors)
Best For
  • Compliance environments requiring full auditability
  • Research where training data provenance matters
  • Reasoning tasks (Think variant)
  • Organizations that need Apache 2.0 + full transparency
VRAM Breakdown
  • 7B Q4: ~5 GB VRAM
  • 32B Q4: ~18–20 GB VRAM
  • 32B Q8: ~34 GB VRAM
Ollama ollama pull olmo-3.1 ollama pull olmo-3.1:7b

Large Models — 36–80B Parameters

Meta — Llama
Llama 3.3 (70B)
70B 45–50 GB 128K ctx Community Tool Use
Near-405B performance in a 70B package. The most versatile large dense model with the biggest ecosystem of fine-tunes, adapters, and tooling. Production-grade quality for virtually any general task.
Strengths
  • Near-405B performance on many tasks
  • Massive ecosystem — thousands of fine-tunes
  • Versatile generalist with strong tool-calling
  • Production-grade quality
Weaknesses
  • 45–50 GB VRAM — needs multi-GPU or offloading
  • Safety alignment can feel restrictive for creative tasks
Ollama ollama pull llama3.3:70b
Alibaba — Qwen
Qwen3 (72B)
72B ~50 GB 128K ctx Apache 2.0
Alibaba's flagship dense model. Frontier-level multilingual and coding performance with the full dual-mode thinking system. Apache 2.0 licensed for commercial use.
Strengths
  • Frontier-level multilingual/coding/reasoning
  • Dual-mode thinking system
  • Apache 2.0 — fully commercial
Weaknesses
  • ~50 GB VRAM — multi-GPU needed
  • HF approval may be required
Ollama ollama pull qwen3:72b
DeepSeek
DeepSeek-R1 (70B)
70B ~45 GB 128K ctx MIT Tool Use
The 70B distillation of DeepSeek-R1, delivering top-tier reasoning and coding with thinking traces. High-stakes problem solving with MIT licensing.
Strengths
  • Top reasoning and coding at 70B tier
  • Thinking mode for complex problems
  • MIT license — fully permissive
Weaknesses
  • 45+ GB VRAM requirement
  • Verbose thinking traces
Ollama ollama pull deepseek-r1:70b

Frontier Models — 80B+ Parameters

Meta — Llama 4
Llama 4 Scout (109B MoE / 17B active)
109B total 17B active 55+ GB (Q4) 10M ctx MoE 16 experts Vision Llama 4 Tool Use
Meta's revolutionary MoE model with an industry-leading 10M token context window. Only 17B parameters active per token across 16 experts, enabling near-70B quality at a fraction of the inference cost. Natively multimodal with text and image processing. Ultra-quantized versions (1.78-bit) fit on a single 24GB GPU.
Strengths
  • 10M token context — industry leading
  • MoE: 70B+ quality at 17B inference cost
  • Natively multimodal (text + vision)
  • 1.78-bit quant fits on 24GB (~20 t/s)
  • Fits on single H100 at INT4
Weaknesses
  • Full weights need 216GB VRAM
  • Initial reviews show room for improvement vs hype
  • Extreme quant degrades quality noticeably
Ollama ollama pull llama4-scout
OpenAI — GPT-OSS
GPT-OSS (120B MoE / 5B active)
120B total 5B active 60+ GB 128K ctx MoE Open Tool Use
OpenAI's largest open model with 62.7% SWE-bench. MoE architecture activates only 5B of 120B parameters per token. Strong GPT-style reasoning with advanced tool-use and structured outputs.
Strengths
  • 62.7% SWE-bench — strong coding
  • GPT-style reasoning quality
  • Advanced tool-use and agents
Weaknesses
  • Needs 48GB+ VRAM
  • Newer ecosystem — fewer fine-tunes
Ollama ollama pull gpt-oss:120b
Alibaba — Qwen
Qwen3 (235B MoE / 22B active)
235B total 22B active 55+ GB 128K ctx MoE Apache 2.0 Tool Use
Near-frontier multilingual and long-context performance via Mixture of Experts. Quality Index of 57 — among the top open models. Activates 22B of 235B parameters per token for efficient inference.
Strengths
  • Near-frontier quality (Quality Index: 57)
  • 22B active params — efficient MoE
  • Exceptional multilingual/long-context
  • Apache 2.0 license
Weaknesses
  • 55+ GB VRAM minimum
  • Multi-GPU needed for full power
Ollama ollama pull qwen3:235b
Meta — Llama 4
Llama 4 Maverick (400B MoE / 17B active)
400B total 17B active 200+ GB (Q4) 1M ctx MoE 128 experts Vision Llama 4 Tool Use
Meta's highest-performance open model. 128 experts with 1M context window. Competes with GPT-4o class models. Natively multimodal with text and image understanding. Requires multi-GPU setup for production use.
Strengths
  • GPT-4o class performance
  • 128 experts, 1M context
  • Natively multimodal
  • Co-distilled from Llama Behemoth
Weaknesses
  • 200+ GB VRAM — requires 4+ H100s at Q4
  • 1.78-bit quant needs 2x48GB (~40 t/s)
  • Extremely resource intensive
Ollama ollama pull llama4-maverick
Zhipu AI (Z.ai) — GLM
GLM-4.7 Full (355B MoE / 32B active)
355B total 32B active 205+ GB 200K ctx MoE Open Tool Use
Zhipu's flagship MoE model with interleaved thinking, preserved thinking, and turn-level thinking. 73.8% SWE-bench, 66.7% SWE-bench Multilingual. Exceptional for agentic coding and complex multi-step tasks.
Strengths
  • 73.8% SWE-bench — top-tier coding
  • Advanced thinking modes (interleaved, preserved)
  • Strong multilingual agentic coding
  • Competitive with Sonnet 3.5 for coding
Weaknesses
  • 205+ GB minimum — needs multi-GPU cluster
  • 2-bit GGUF needs 135GB + 128GB RAM
llama.cpp / vLLM recommended ollama pull glm4.7
DeepSeek
DeepSeek V3.2 / R1 (671B MoE / 37B active)
671B total 37B active 300+ GB 128K ctx MoE MIT Tool Use
DeepSeek's full-scale MoE model — one of the most capable open models ever released. Rivals GPT-5 class on coding/reasoning benchmarks. MLA architecture provides extreme KV cache efficiency. Quality Index: 57.
Strengths
  • Near-frontier on all benchmarks
  • MLA architecture — extreme KV cache efficiency
  • 37B active params — efficient despite size
  • MIT license — fully permissive
Weaknesses
  • Datacenter-scale hardware required
  • Requires patched vLLM for optimal speed
  • 300+ GB VRAM minimum
vLLM / SGLang recommended for production ollama pull deepseek-v3.2-exp
Zhipu AI (Z.ai) — GLM
GLM-5 (744B MoE / 40B active)
744B total 40B active 1.5 TB (FP16) 200K ctx MoE MIT Tool Use
The largest open-weight model as of Feb 2026. 744B parameters (40B active) trained on 28.5T tokens. 86% GPQA-Diamond (graduate-level reasoning), 90% HumanEval. DeepSeek Sparse Attention for long-context efficiency. 2-bit GGUF: 241GB — fits 256GB unified memory Mac.
Strengths
  • 86% GPQA-Diamond — exceptional reasoning
  • 90% HumanEval — top coding
  • Quality Index: 49.64 — #1 open source
  • MIT license — fully open
  • 2-bit GGUF fits 256GB Mac
Weaknesses
  • FP16 needs 1.5TB VRAM (8x H200 minimum)
  • 2-bit quant: ~5 t/s with offloading
  • Consumer-unfriendly at full quality
llama.cpp / vLLM (8xH200+ for full quality) ollama pull glm5
Mistral AI
Mistral Large 3 (675B MoE)
675B total 300+ GB 128K ctx MoE Mistral Tool Use
Mistral's largest model to date. Positions itself as one of the strongest open-weight choices for advanced reasoning and high-end self-hosted assistants. Premium quality local inference.
Strengths
  • Premium reasoning and creative quality
  • Strong European language support
  • Advanced instruction following
Weaknesses
  • Datacenter-scale hardware required
  • Restrictive license vs Apache 2.0 models
vLLM / SGLang ollama pull mistral-large

Specialty — Coding, Vision, Embedding

Alibaba — Qwen
Qwen3-Coder (480B MoE / 35B active)
480B total 35B active 200+ GB 256K ctx MoE Apache 2.0
Alibaba's dedicated agentic coding model with massive 480B MoE architecture. 55.4% SWE-bench. Optimized for large-scale code generation and software engineering tasks.
HF / vLLM ollama pull qwen3-coder
DeepSeek
DeepSeek Coder V2 (16B / 236B)
16B / 236B 10–50+ GB 128K ctx MIT Tool Use
Purpose-built for code generation with MoE efficiency. Excellent for code completion, generation, and review. The 16B version fits on consumer GPUs.
Ollama ollama pull deepseek-coder-v2
Moonshot AI — Kimi
Kimi K2.5
MoE Varies 262K ctx Open
Moonshot's thinking-focused model with systematic reasoning for research and planning tasks. Exceptional multi-step reasoning with 262K context window. Strong math competition performance.
Ollama / vLLM ollama pull kimi-k2.5
Cohere — Command R
Command R+ (104B)
104B ~60 GB (Q4) 128K ctx CC-BY-NC Tool Use
Purpose-built for RAG and multi-step tool use. Generates grounded responses with inline citations. Highly efficient multilingual tokenizer (10+ languages) means lower cost per token for non-English content. Outperforms GPT-4 Turbo on tool-use benchmarks.
Strengths
  • Best-in-class RAG — grounded generation with citations
  • Zero-shot multi-step tool use
  • Efficient tokenizer cuts cost for multilingual workloads
  • Strong enterprise workflow integration
Weaknesses
  • CC-BY-NC license — no commercial use without Cohere agreement
  • 104B needs 60+ GB VRAM (Q4) — multi-GPU or 80GB cards
  • Older architecture — not MoE, so VRAM scales linearly
Ollama ollama pull command-r-plus
Various — Vision Models
LLaVA / Qwen2.5-VL / InternVL2.5 / Llama 3.2-Vision
7B–72B +2–5 GB overhead 128K ctx Vision
Vision-language models that add image understanding to base LLMs. LLaVA pioneered the approach. Qwen2.5-VL leads benchmarks with document OCR and video understanding. InternVL2.5 (Shanghai AI Lab) is the top open-source vision model for complex reasoning over images. Llama 3.2-Vision rounds out Meta's multimodal offering.
Top Picks
  • Qwen2.5-VL: Best all-around — OCR, video, charts, documents
  • InternVL2.5: Strongest visual reasoning and multi-image
  • Llama 3.2-Vision: Best Meta ecosystem integration
  • LLaVA: Lightweight, great for experimentation
Notes
  • Add 2–5 GB VRAM overhead on top of base model
  • Larger vision models (72B) need 48 GB+ VRAM
  • Quality varies heavily by size — 7B vision ≠ 72B vision
Ollama ollama pull llava ollama pull llama3.2-vision
Various — Embeddings
nomic-embed-text / bge-m3 / snowflake-arctic-embed
137M–335M <1 GB 8K tokens Apache 2.0
Lightweight embedding models for RAG pipelines, semantic search, and memory systems. Run alongside any LLM with negligible VRAM overhead. Essential for building retrieval-augmented generation systems.
Ollama ollama pull nomic-embed-text ollama pull bge-m3

Speech-to-Text (STT / ASR)

OpenAI — Whisper
Whisper Large V3 / V3 Turbo
1.55B / 809M 6–10 GB 99+ languages MIT
The gold standard for multilingual speech recognition. 99+ languages, automatic language detection, phrase-level timestamps, and punctuation. V3 Turbo cuts decoder layers from 32 to 4, delivering 6x faster inference with only 1–2% accuracy loss. 7.4% WER average on mixed benchmarks.
Strengths
  • 99+ language support — best multilingual STT
  • Automatic language identification
  • Phrase-level timestamps
  • Turbo variant: 6x faster, 809M params
  • Handles noise and accents well
Weaknesses
  • Large V3 needs ~10 GB VRAM
  • Not streaming-native (batch-oriented)
  • Can hallucinate on silence or music
pip pip install openai-whisper or faster-whisper for CTranslate2
Hugging Face
Distil-Whisper
756M ~3 GB MIT
6x faster than Whisper Large V3 with only 1% WER degradation. Distilled from Whisper for low-latency transcription. Ideal for real-time applications on consumer hardware.
pip pip install transformers accelerate
NVIDIA — NeMo
Parakeet TDT (0.6B / 1.1B)
0.6B / 1.1B 2–4 GB CC-BY-4.0
NVIDIA's speed-optimized ASR model. RTFx near 2,000x — processes audio dramatically faster than Whisper. RNN-Transducer architecture enables streaming recognition with minimal latency. Trained on 65,000 hours of English audio.
Strengths
  • Among the fastest open ASR models
  • Streaming-capable — real-time transcription
  • 65K hours of training data
  • Optimized for NVIDIA GPUs
Weaknesses
  • English-only
  • Ranks lower on pure accuracy vs Whisper
  • Speed-optimized — accuracy tradeoff
pip pip install nemo_toolkit[asr]
NVIDIA / IBM
Canary Qwen 2.5B / Granite Speech 3.3 8B
2.5B / 8B 4–10 GB Various
Top-accuracy English STT models. Canary Qwen combines speech recognition with the Qwen language model for superior contextual understanding. IBM Granite Speech brings enterprise-grade accuracy. Best for strict accuracy requirements.
NeMo / HF Transformers pip install nemo_toolkit[asr]
Useful Sensors
Moonshine (Tiny / Base)
27M / 61M <0.5 GB MIT
Ultra-lightweight edge ASR model. Outperforms Whisper Tiny and Small despite being significantly smaller. Designed for smartphones, IoT, and offline environments where every MB counts.
pip pip install moonshine

Text-to-Speech (TTS / Voice Synthesis)

Canopy Labs
Orpheus TTS (3B)
3B ~4 GB Apache 2.0
The breakthrough TTS model of late 2025. Human-like emotional speech that rivals ElevenLabs — laughing, crying, whispering on command. Real-time on modern GPUs. State-of-the-art naturalness with emotional control, completely free and local.
Strengths
  • State-of-the-art naturalness — rivals ElevenLabs
  • Emotional control (laughing, crying, whispering)
  • Real-time on modern GPUs
  • Apache 2.0 — fully commercial
Weaknesses
  • English-focused (multilingual expanding)
  • Needs GPU for real-time speeds
  • Newer — smaller community than Piper
pip pip install orpheus-tts
Kokoro
Kokoro-82M
82M <1 GB Apache 2.0
The breakout star of 2026 local TTS. Only 82M parameters yet delivers neural-quality speech with breathing and natural pauses. Runs on CPU, Apple Silicon, or any GPU. Shockingly good for its tiny size.
Strengths
  • 82M params — runs on anything
  • Neural quality (breathing, pausing)
  • CPU and Apple Silicon capable
  • Near-zero VRAM requirement
Weaknesses
  • Limited voice cloning ability
  • Fewer emotional controls than Orpheus
pip pip install kokoro
Fish Audio
Fish Speech V1.5 (S1-mini)
~500M 2–4 GB Apache 2.0
The go-to open-source model for voice cloning across languages. Handles code-switching (e.g., Spanglish) better than most paid APIs. Strong multilingual voice synthesis with zero-shot cloning from short reference audio.
Strengths
  • Best open-source voice cloning
  • Excellent cross-language code-switching
  • Zero-shot cloning from ~10s audio
  • Strong multilingual support
Weaknesses
  • Higher VRAM than Kokoro/Piper
  • Quality varies by language
pip pip install fish-speech
Coqui AI
XTTS v2 / Coqui TTS
~450M 2–4 GB 17 languages MPL-2.0
High-quality multilingual TTS with voice cloning from a 6-second reference. 17 languages supported. The broadest toolkit in open-source TTS with pre-trained voices, fine-tuning, and extensive documentation. Runs well on MacBook Air with 16GB.
Strengths
  • 17 language support out-of-box
  • Voice cloning from 6-second reference
  • 1100+ pre-trained voices
  • Extensive documentation and community
Weaknesses
  • Higher latency than Piper/MeloTTS
  • MPL license (some restrictions)
pip pip install TTS
Open Home Foundation
Piper TTS
Various (~15M–60M) <0.5 GB 40+ languages MIT
Ultra-fast, ultra-lightweight neural TTS designed for offline and embedded use. Sub-second latency, 40+ languages, runs on Raspberry Pi. The default TTS for Home Assistant. Doesn't clone voices but offers many pre-trained speakers.
Strengths
  • Fastest open TTS — sub-second latency
  • Runs on Raspberry Pi / embedded
  • 40+ languages, 100+ voices
  • Home Assistant integration
Weaknesses
  • No voice cloning
  • Less natural than Orpheus/XTTS
  • Fixed voice catalog
pip pip install piper-tts
Suno AI
Bark
~900M 4–6 GB MIT
Not just TTS — Bark generates music, sound effects, and non-verbal vocalizations (laughs, sighs, throat clears). The most creative and fun model in the list. Can generate background music and ambient sounds alongside speech.
Strengths
  • Speech + music + sound effects
  • Non-verbal vocalizations
  • Creative and expressive
  • MIT license
Weaknesses
  • High latency — not real-time
  • Unpredictable output quality
  • GPU-heavy for best results
pip pip install git+https://github.com/suno-ai/bark.git
MyShell
MeloTTS
~200M <1 GB 5 languages MIT
Lightweight and remarkably consistent TTS. Maintains low latency even with long texts — processes short texts in under a second. 5 languages with natural prosody. Ideal for low-resource devices and consistent production use.
pip pip install melotts

Embedding, Search & Retrieval

Nomic AI
Nomic Embed Text V2
475M (305M active) <1 GB 8K tokens MoE Apache 2.0
First MoE embedding model. Trained on 1.6B multilingual pairs across 100+ languages. Supports flexible output dimensions (256–768) via Matryoshka learning. Competitive with models twice its size on BEIR and MIRACL benchmarks. 86.2% top-5 accuracy.
Strengths
  • MoE — efficient with 305M active params
  • 100+ languages, 100+ code languages
  • Flexible dimensions (256–768)
  • Top BEIR/MIRACL scores for size
Weaknesses
  • Can drop on noisy/multilingual data
  • Needs prefix prompts for optimal results
Ollama ollama pull nomic-embed-text
BAAI (Beijing Academy)
BGE-M3
~570M <1 GB 8K tokens MIT
The Swiss army knife of embedding models. M3 = Multi-functionality (dense + sparse + ColBERT retrieval), Multi-linguality (100+ languages), Multi-granularity (up to 8K tokens). SOTA on MIRACL and MKQA. The first model to unify all three retrieval methods.
Strengths
  • Dense + sparse + ColBERT in one model
  • 100+ languages — best cross-lingual
  • 8K token input length
  • SOTA on multilingual benchmarks
Weaknesses
  • Slightly slower than single-mode models
  • Requires prompt engineering for best results
Ollama ollama pull bge-m3
Alibaba — Qwen
Qwen3-Embedding (0.6B / 4B / 8B)
0.6B–8B 0.5–7 GB 32K tokens Apache 2.0
Alibaba's instruction-aware embedding models built on Qwen3. Support user-defined task instructions for 1–5% accuracy improvement. Flexible output dimensions (32–1024). 100+ natural and programming languages. The 4B and 8B variants outperform most competitors.
pip pip install sentence-transformers
Snowflake
Arctic Embed (Various sizes)
22M–335M <0.5 GB 512 tokens Apache 2.0
Snowflake's optimized embedding suite. Multiple sizes from 22M to 335M for different accuracy/speed tradeoffs. Strong English retrieval performance with focus on enterprise search workloads.
Ollama ollama pull snowflake-arctic-embed
BAAI (Beijing Academy)
BGE Reranker v2-M3 / Gemma-based
~570M–9B 0.5–8 GB 8K tokens Apache 2.0
The most popular open-source rerankers for RAG pipelines. Cross-encoder architecture processes query + document together for precise relevance scoring. Adds 100–500ms latency but significantly improves retrieval quality. Run after initial embedding search to rerank top-K candidates.
Strengths
  • Dramatically improves RAG accuracy
  • Runs on consumer hardware
  • Multiple sizes for speed/accuracy tradeoff
  • Apache 2.0 — no licensing fees
Weaknesses
  • Adds 100–500ms latency per query
  • Cross-encoder — can't precompute
  • Larger variants need significant VRAM
pip pip install FlagEmbedding
Stanford / AnswerAI
ColBERT v2 / ColBERTv2.0
~110M <1 GB 512 tokens MIT
Late-interaction retrieval model using token-level embeddings for scalable BERT-based search in milliseconds. Superior to traditional single-vector embeddings while maintaining speed. Uses MaxSim operator for efficient contextual matching across large datasets.
pip pip install colbert-ai

Image Generation (Diffusion Models)

Stability AI
Stable Diffusion 1.5
860M 4–6 GB 512×512 native CreativeML Open
The model that started the local image generation revolution. Still viable in 2026 for low-VRAM setups. The largest ecosystem of LoRAs, checkpoints, and fine-tunes. Thousands of community models on CivitAI. Best entry point for learning diffusion workflows.
Strengths
  • Runs on 4 GB VRAM — most accessible
  • Largest LoRA/checkpoint ecosystem ever
  • Fastest generation times
  • Mature tooling (A1111, ComfyUI)
Weaknesses
  • 512×512 native — needs upscaling
  • Weaker prompt adherence vs newer models
  • Poor text rendering in images
ComfyUI / A1111 pip install diffusers transformers accelerate
Stability AI
Stable Diffusion XL (SDXL 1.0)
3.5B (UNet) 8–12 GB 1024×1024 native CreativeML Open
The most widely used open-source image model in existence. 1024×1024 native with dramatically better prompt adherence, photorealism, and composition than SD 1.5. Deepest community ecosystem — thousands of fine-tuned checkpoints (Juggernaut XL, RealVis, DreamShaper) and LoRAs on CivitAI.
Strengths
  • 1024×1024 native resolution
  • Largest fine-tune ecosystem (Juggernaut, RealVis, etc.)
  • Excellent photorealism with right checkpoint
  • ControlNet, inpainting, upscaling support
  • Commercial use allowed
Weaknesses
  • 8 GB minimum, 12 GB recommended
  • Refiner adds complexity and VRAM
  • Text rendering still inconsistent
Variants SDXL 1.0 · SDXL Turbo · SDXL Lightning
SDXL Lightning: 1–4 step generation. SDXL Turbo: real-time. Use ComfyUI or Forge WebUI.
Stability AI
Stable Diffusion 3.5 (Medium / Large / Turbo)
2.5B–8B 12–24 GB 1024×1024 Stability Community
Stability's latest architecture with improved text rendering inside images and better prompt understanding. Uses MMDiT (Multi-Modal Diffusion Transformer). Better typography than SDXL but smaller community fine-tune ecosystem. Medium variant fits 12 GB, Large needs 24 GB.
Strengths
  • Improved text rendering in images
  • Better prompt fidelity than SDXL
  • Multiple size variants
  • Compatible with existing SD tooling
Weaknesses
  • 12–24 GB VRAM requirement
  • Much smaller LoRA/fine-tune ecosystem
  • Community license — not fully open
ComfyUI pip install diffusers transformers
Black Forest Labs
FLUX.1 (Schnell / Dev)
12B 8–24 GB 1024×1024+ Apache 2.0 (Schnell) / NC (Dev)
The next generation from the original Stable Diffusion creators. 12B parameter DiT architecture delivering Midjourney-level quality locally. Best text rendering of any open model. Schnell = 4-step fast generation (Apache 2.0 commercial). Dev = 28–35 step high quality (non-commercial). GGUF quants run on 6–8 GB.
Strengths
  • Midjourney-level quality — locally
  • Best text rendering in open models
  • Schnell: 4 steps, Apache 2.0 commercial
  • GGUF/NF4 quants: 6–8 GB VRAM
  • Excellent anatomy and photorealism
Weaknesses
  • Full FP16: needs 24 GB VRAM
  • Dev license is non-commercial
  • Smaller fine-tune ecosystem than SDXL
  • No A1111 support — ComfyUI or Forge only
ComfyUI git clone https://github.com/comfyanonymous/ComfyUI.git
NF4 quant: 8 GB VRAM, slight detail loss. FP8: 16 GB, near-lossless. Full BF16: 24 GB. LoRA training needs 24 GB+.
Black Forest Labs
FLUX.2 (Dev / Pro / Flex / Klein)
4B–32B 13–24 GB 1024×1024+ Various
Released November 2025. Production-grade successor to FLUX.1 with state-of-the-art quality. Klein (4B) fits 13–16 GB. Dev (32B) is open-weight. Pro rivals top proprietary models. Flex variant gives fine-grained control over generation parameters.
Strengths
  • State-of-the-art open image quality
  • Klein 4B variant fits consumer GPUs
  • Exceptional prompt fidelity
  • Flex: developer-friendly parameter control
Weaknesses
  • Dev 32B needs significant VRAM
  • Pro is API-only
  • Commercial licensing required for some variants
ComfyUI / Diffusers pip install diffusers transformers
Zhipu AI
Z-Image-Turbo
~3B ≤16 GB 1024×1024+ Apache 2.0
Sub-second inference on enterprise GPUs, comfortable on 16 GB consumer cards. Matches or exceeds FLUX.2 Dev and HunyuanImage 3.0 on benchmarks at a fraction of the compute. Standout bilingual text rendering (English + Chinese) with high clarity. Apache 2.0 — fully commercial.
Strengths
  • Sub-second generation — fastest quality model
  • Fits 16 GB consumer GPUs
  • Best bilingual text rendering (EN/CN)
  • Apache 2.0 — fully commercial
  • Beats models 10x its size
Weaknesses
  • New — small community ecosystem
  • Fewer LoRAs/fine-tunes available
Diffusers / ComfyUI pip install diffusers
Stability AI
Stable Cascade
~5B (staged) 8–16 GB 1024×1024 Stability Community
Three-stage architecture (Stage A/B/C) that compresses the latent space dramatically — 40% less VRAM than SDXL for comparable quality. Much improved text rendering inside images. Faster training with fewer data requirements.
Diffusers pip install diffusers
PixArt (Alpha Lab)
PixArt-α / PixArt-Σ
600M 4–8 GB 1024×1024+ Apache 2.0
Remarkably efficient DiT model — only 600M params yet competitive with SDXL. PixArt-Σ supports 4K resolution generation. Trained with 90% less compute than SD. Excellent for resource-constrained setups that still need high-quality output.
Diffusers pip install diffusers
Various
ESRGAN / Real-ESRGAN / 4x-UltraSharp
~16M 1–2 GB BSD / Apache
AI upscaling models essential for any image generation pipeline. Real-ESRGAN handles 4x upscaling with face enhancement. 4x-UltraSharp is the community favorite for sharp detail. Tiny VRAM footprint — pairs with any generation model.
pip pip install realesrgan

Video Generation (Text/Image-to-Video)

Alibaba — Wan
Wan 2.1 / 2.2 (1.3B / 5B / 14B / 27B MoE)
1.3B–27B 8–24+ GB Apache 2.0
The most versatile open-source video model suite. Text-to-video, image-to-video, video editing, and video-to-audio. The 1.3B runs on consumer GPUs (~8 GB), while the 14B delivers SOTA quality that rivals commercial models. Wan 2.2 adds MoE architecture (27B total / 14B active) for improved detail with same inference cost. Bilingual English/Chinese text generation in video.
Strengths
  • 1.3B variant fits on almost any GPU (~8 GB)
  • SOTA among open-source video models
  • T2V, I2V, video editing, V2A — full pipeline
  • ComfyUI + Diffusers integration
  • FP8 and GGUF quants available
Weaknesses
  • 14B needs offloading on 24 GB (4+ min per clip)
  • 480p is more stable than 720p on 1.3B
  • Complex face consistency can vary
Diffusers pip install diffusers ComfyUI recommended
Tencent — Hunyuan
HunyuanVideo (13B)
13B 14–80 GB Community
The largest open-source video generation model. Cinema-quality output with exceptional motion coherence and facial realism. Uses a "dual-stream to single-stream" transformer with a causal 3D VAE. Supports FP8 weights for lower VRAM. Has spawned community fine-tunes like SkyReels V1 for cinematic human-centric content.
Strengths
  • Best face/human rendering among open models
  • Cinema-grade temporal consistency
  • Camera motion control (zoom, pan, tilt)
  • xDiT multi-GPU parallelism support
  • FP8 mode runs on 14 GB with offloading
Weaknesses
  • 8–15 min per clip on RTX 4090
  • Full precision needs 80 GB+
  • Prompt engineering required for best results
ComfyUI pip install diffusers
Lightricks
LTX Video (2B / 13B)
2B / 13B 12–24 GB Open
The speed champion — generates 30fps video at 1216×704 faster than real-time on RTX 4090. First DiT-based model optimized for rapid iteration. Multiple variants: 13B dev, 13B distilled, 2B distilled, and FP8 builds. Includes spatial and temporal upscalers.
Strengths
  • 5–10 second generation on RTX 4090
  • 30fps output at up to 1216×704
  • FP8 and distilled variants for efficiency
  • Built-in upscaler pipeline
Weaknesses
  • Lower quality ceiling than Wan/Hunyuan
  • Struggles with close-up faces
  • Best with concise 10–20 word prompts
Diffusers pip install diffusers
Zhipu AI — CogVideo
CogVideoX (2B / 5B)
2B / 5B 8–16 GB Apache 2.0
Solid mid-range video generation with 3D Causal VAE technology. Generates 6-second 720×480 clips with strong prompt adherence. CogVideoX 1.5 supports 10-second videos at higher resolution and I2V generation at any resolution. Good LoRA fine-tuning support via CogKit framework.
Strengths
  • Excellent detail and semantic accuracy
  • Strong Diffusers integration
  • LoRA fine-tuning with CogKit
  • 5B fits on consumer 16 GB GPUs
  • Supports quantized inference (TorchAO)
Weaknesses
  • Close-up faces can struggle
  • 6–12 min generation on RTX 4090
  • Fixed resolution modes
pip pip install cogkit
Genmo AI
Mochi 1 (10B)
10B 60–80 GB Apache 2.0
The largest dedicated text-to-video model under Apache 2.0. Asymmetric Diffusion Transformer (AsymmDiT) architecture with a custom VAE that compresses video to 128× smaller. Exceptional motion fluidity at 30fps and strong photorealistic faces in slow-motion shots. Best for high-fidelity short clips.
Strengths
  • Best natural motion quality among open models
  • Apache 2.0 — fully commercial
  • 30fps photorealistic output
  • Fine-tuning with custom video datasets
Weaknesses
  • Needs 60–80 GB VRAM (A100/H100 territory)
  • 480p max resolution currently
  • 10–20 min per clip on RTX 4090 (with offload)
pip pip install diffusers

💬 Suggest an Improvement

Missing a model? Found incorrect info? Have a feature request? Help make this reference better for everyone.

✓ Thanks! Your suggestion has been submitted.
Ad · Support this free resource
Copied to clipboard!