Local AI Model Index
Deep dives, one click away
The hard stuff — install paths, quantization, complete stacks, weekly changelog — each lives on its own page.
What's new
Eight weeks of model releases and feature launches — Nemotron 3, GLM-5 744B, Music + 3D categories, Community Analytics integration.
8 entries · weekly cadence View timelineCommunity pulse
Real HuggingFace download counts and likes. What's trending, what's rising, what's discussed — refreshed every 6 hours.
Trending: Qwen3 32B · +18% momentum See community dataFrom zero to running
Install Ollama, pull a model, add a UI, scale up. Plus the eight frontend tools worth knowing — Open WebUI, LM Studio, ComfyUI, Pinokio.
4 steps · 8 tools compared Open the guideWhat happens when you compress a model
The JPEG analogy. The seven-row compression table from FP16 down to IQ2_XS. The "what breaks first" sensitivity matrix. Three real-world tradeoff scenarios for RTX 4090 / 16 GB / Llama 4 Scout 109B.
7 quant levels · 6 task types · 3 verdicts Read the deep diveSix pre-vetted stacks by hardware tier
Stop picking individual models — pick a complete stack. RTX 4090 power-user, 16 GB coding, 12 GB budget, Mac M-series, air-gapped compliance, ComfyUI creative production. Filter by tier, copy the stack, ship.
6 stacks · LLM + RAG + STT + TTS Browse the stacksTiny Models — ≤ 8B Parameters
- Exceptional efficiency — punches above its weight
- Multilingual support (140+ languages)
- Vision-ready in 4B+ variants
- Very power-efficient for edge/mobile
- Lower benchmark scores on heavy reasoning vs denser peers
- Limited coding depth compared to Qwen/Phi
- On-device classification and summaries
- Multilingual text processing
- Quick image understanding tasks
- Edge deployments and IoT
- 1B Q4: ~2 GB
- 4B Q4: ~4 GB
- Runs well on Apple M-series via MLX
ollama pull gemma3:1b
ollama pull gemma3:4b
- Outstanding multilingual (100+ languages)
- Dual-mode: fast inference or thinking mode
- Strong early coding/math for size
- Apache 2.0 — fully permissive license
- Very small variants hallucinate on niche topics
- Limited context window vs. larger siblings
ollama pull qwen3:0.6b
ollama pull qwen3:4b
- 262K native context, extensible to 1M tokens
- Native multimodal — vision + text fused
- 201 languages supported
- Apache 2.0 — fully commercial
- Very new — limited community fine-tunes
- Requires newer inference engines for hybrid attention
ollama pull qwen3.5:4b
ollama pull qwen3.5:27b
- Exceptional reasoning/math — beats larger models
- 128K context window
- Very fast inference
- Function calling support
- Narrower general knowledge
- Less creative writing ability
ollama pull phi4-mini
- Leading multilingual / coding / long-context for 8B
- Dual-mode thinking (fast + deep)
- Beats many larger models on benchmarks
- Excellent community daily driver
- Occasional inconsistency in fast mode
- Knowledge depth limited vs 14B+
ollama pull qwen3:8b
- Superb instruction following and creativity
- Excellent European language performance
- Fully permissive Apache 2.0 license
- Very efficient inference
- Older base — less competitive on 2026 reasoning benchmarks
- Smaller models limited on very complex tasks
ollama pull mistral
ollama pull mistral-nemo
- RL-based thinking mode — rivals o1-level reasoning
- Excellent coding/math for size
- MIT license — fully permissive
- Occasionally less coherent without thinking mode
- Slower when thinking traces are long
ollama pull deepseek-r1:8b
- Matches o1-mini — MATH-500: 95.8%, AIME 2024: 68.2%
- Outperforms 32B models in reasoning
- MTP speculative decoding — very fast
- MIT license — fully permissive
- Requires trust_remote_code for deployment
- 48K context — smaller than competitors
ollama pull mimo:7b
- Incredibly small — runs on anything
- Sub-1GB models available
- Fast training/fine-tuning
- Very limited capability — basic tasks only
- High hallucination rate
ollama pull smollm2:1.7b
- Outperforms Llama 3.2 3B and Qwen2.5 3B
- Dual-mode reasoning (think/no_think)
- 128K context via YaRN extrapolation
- Fully open — weights + training recipe
- Falls behind Qwen3 4B on math tasks
- 6 languages only (EN, FR, ES, DE, IT, PT)
ollama pull smollm3:3b
- Multimodal — vision + audio in tiny package
- MediaPipe optimized for mobile/edge
- Runs on phones and low-end hardware
- Apache 2.0 — fully open
- Limited general reasoning at 2–4B scale
- Narrower language coverage than larger Gemma
ollama pull gemma3n
- 600+ language support — widest code coverage
- Fill-in-the-middle for IDE completion
- Lightweight — real-time on CPU
- Trained on ethically-sourced code (The Stack v2)
- Code-only — no general chat ability
- 16K context (smaller than competitors)
ollama pull starcoder2:3b
Small Models — 9–14B Parameters
- Exceptional reasoning/math — tops 14B class
- 84%+ MMLU, strong GPQA scores
- 128K context, fast inference
- MIT license — fully open
- Less creative / broad general knowledge than Llama/Qwen
- Narrower training data focus
- STEM tasks, math tutoring
- Research assistants on mid-range GPUs
- Technical document analysis
- Code review and generation
- Q4: ~11 GB — fits RTX 4060 Ti 16GB
- Q8: ~16 GB — fits RTX 4090
- FP16: ~28 GB
ollama pull phi4:14b
- Top-tier multilingual and coding for 14B class
- Dual-mode: fast + thinking
- 128K context window
- Apache 2.0 license
- Occasional fast-mode inconsistency
- Slightly below Phi-4 on pure math
ollama pull qwen3:14b
- Near-SOTA coding and reasoning for size
- RL-trained thinking traces
- Excellent for agentic workflows
- Distilled — some quality loss vs full R1
- Can be verbose in thinking mode
ollama pull deepseek-r1:14b
- Multi-image and video understanding
- Leading OCR and document intelligence
- Hybrid architecture — efficient memory
- Reasoning mode toggle via system prompt
- Reasoning not supported for video inputs
- Newer — smaller community ecosystem
ollama pull nemotron-nano:12b
- Vision + text multimodal in one model
- Outperforms Gemma 3 and GPT-4o Mini
- 150 tok/s — very fast inference
- Apache 2.0 — commercial friendly
- Heavier than 14B models — needs 16GB+ VRAM
- Not as strong on pure math as Phi-4
ollama pull mistral-small3.1
- Hybrid Mamba-2 arch — faster inference, lower memory
- Edge to server in one family (350M to 32B)
- Strong instruction following and tool calling
- Apache 2.0 — clean enterprise licensing
- Mamba-2 backend support still maturing in some tools
- Smaller community than Llama/Qwen ecosystems
- MoE variants need compatible inference stacks
ollama pull granite4
ollama pull granite4:small-h
- Dual-mode thinking — fast and deep reasoning
- MoE efficiency — only 8.6B active params
- Strong multilingual performance
- 128K context, Apache 2.0
- Smaller community than Qwen/Llama ecosystems
- Fewer fine-tunes and adapters available
ollama pull minimax-m2.5
- 256K context — massive document processing
- Enterprise RAG with inline citations
- MoE — fits on single 48GB GPU (Q4)
- Strong agentic tool-use capabilities
- CC-BY-NC — no commercial use without agreement
- 25GB+ VRAM even with MoE efficiency
- Smaller fine-tune ecosystem
ollama pull command-a
Medium Models — 15–35B Parameters
- GPT-like structured reasoning and tool-use
- Agent-friendly design with function calling
- Runs well quantized on consumer GPUs
- Strong structured outputs
- Newer — smaller community than Llama
- 128K context limit (no 1M option)
ollama pull gpt-oss:20b
- 68% SWE-bench — exceptional coding
- Strong agentic task completion
- Multi-file code understanding
- Coding-focused — less general purpose
- Not ideal for creative or chat tasks
ollama pull devstral-small
- Well-balanced across all benchmarks
- Built-in vision support
- Efficient for size, good throughput
- 140+ language support
- Tight fit on 24GB at full context
- Not absolute SOTA on any single benchmark
ollama pull gemma3:27b
- 59.2% SWE-bench — top for local coding
- 60–80 t/s at 4-bit — very fast
- Interleaved + preserved thinking modes
- Excellent tool-use and agentic capabilities
- Runs on RTX 3090/4090 and Mac M-series
- Chat template issues with some runtimes
- Needs 24GB+ VRAM for good experience
- Smaller community than Qwen/Llama
ollama pull glm4:latest
llama.cpp recommended — use --jinja flag
- 91% Math Index — top in class
- 1M token context window
- 3.3x faster throughput than Qwen3-30B
- Hybrid Mamba-Transformer architecture
- Fully open: weights + data + recipes
- Automatic CPU offloading can reduce speed
- Context cliff at high token counts
- Newer Mamba architecture — less battle-tested
ollama pull nemotron-nano
- 7x throughput improvement over previous gen
- 1M token context window
- Multi-token prediction for faster inference
- Excellent agentic reasoning and tool use
- Requires 64GB+ VRAM/RAM at minimum
- New architecture may have limited tooling support initially
- Enterprise-focused — less community fine-tuning
- Multi-agent orchestration
- Enterprise agentic AI workflows
- IT ticket automation and complex reasoning
- High-throughput production inference
- Q4: ~40 GB (12B active params)
- Q8: ~64 GB
- FP16: ~256 GB (multi-GPU)
- Runs on Mac Studio M4 Ultra 192GB
ollama pull nemotron-3-super
- Video and document understanding
- Natural conversational abilities
- Complex visual reasoning
- 1M token context for long documents
- New release — limited community benchmarks
- NVIDIA GPU optimization bias
- Less tested on consumer hardware
- Document and video analysis agents
- Visual question answering
- Multimodal enterprise workflows
- Real-time conversational AI with vision
- Q4: ~16-24 GB (estimated)
- Optimized for NVIDIA GPUs
- vLLM and TensorRT-LLM support
pip install nemo-toolkit[all]
- Best long-context stability — 100% GPU at 48K
- Exceptional multilingual and coding
- Dual-mode thinking
- Apache 2.0 license
- Tight fit on 24GB — context limited
- Dense model — heavier than MoE alternatives
ollama pull qwen3:32b
- Near-SOTA reasoning with thinking traces
- KV cache efficiency (MLA architecture)
- Excellent for agentic/developer workflows
- MIT license
- May require vLLM patches for optimal speed
- Verbose thinking traces eat context
ollama pull deepseek-r1:32b
- Purpose-built for deep reasoning
- Exceptional on math competitions (AIME)
- 128K context for long reasoning chains
- Not a generalist — specialized for reasoning
- Very verbose reasoning traces
ollama pull qwq
- Fully open: code, data, checkpoints, logs — auditable end-to-end
- Think variant strong on math, code, and reasoning
- OlmoTrace lets you trace outputs to training data
- Apache 2.0 — clean for enterprise and compliance
- Slightly behind Qwen3 on general chat tasks
- Smaller community ecosystem than Llama/Qwen
- 65K context (vs 128K+ on competitors)
- Compliance environments requiring full auditability
- Research where training data provenance matters
- Reasoning tasks (Think variant)
- Organizations that need Apache 2.0 + full transparency
- 7B Q4: ~5 GB VRAM
- 32B Q4: ~18–20 GB VRAM
- 32B Q8: ~34 GB VRAM
ollama pull olmo-3.1
ollama pull olmo-3.1:7b
- Surpasses Qwen3 32B across benchmarks
- Strong reasoning and coding performance
- Fits on 24GB GPU at Q4
- Apache 2.0 — fully commercial
- Newer — fewer fine-tunes than Qwen3
- Tight fit on 24GB cards with long contexts
ollama pull qwen3.5:32b
- State-of-the-art for 7–10B size class
- Strong multilingual performance
- Efficient inference on consumer GPUs
- Apache 2.0 — fully open
- Smaller ecosystem than Llama/Qwen
- Fewer community fine-tunes available
ollama pull falcon3:7b
ollama pull falcon3:10b
Large Models — 36–80B Parameters
- Frontier-level multilingual/coding/reasoning
- Dual-mode thinking system
- Apache 2.0 — fully commercial
- ~50 GB VRAM — multi-GPU needed
- HF approval may be required
ollama pull qwen3:72b
- Top reasoning and coding at 70B tier
- Thinking mode for complex problems
- MIT license — fully permissive
- 45+ GB VRAM requirement
- Verbose thinking traces
ollama pull deepseek-r1:70b
Frontier Models — 80B+ Parameters
- 62.7% SWE-bench — strong coding
- GPT-style reasoning quality
- Advanced tool-use and agents
- Needs 48GB+ VRAM
- Newer ecosystem — fewer fine-tunes
ollama pull gpt-oss:120b
- Near-frontier quality (Quality Index: 57)
- 22B active params — efficient MoE
- Exceptional multilingual/long-context
- Apache 2.0 license
- 55+ GB VRAM minimum
- Multi-GPU needed for full power
ollama pull qwen3:235b
- Frontier performance — 87.8% MMLU-Pro, 83.6% LiveCodeBench
- Native multimodal — text + vision fused
- 262K–1M context, 201 languages
- Apache 2.0 — fully commercial
- 230+ GB — multi-GPU cluster required (8-way TP)
- Needs SGLang/vLLM — no simple Ollama yet
python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B --tp-size 8
- 73.8% SWE-bench — top-tier coding
- Advanced thinking modes (interleaved, preserved)
- Strong multilingual agentic coding
- Competitive with Sonnet 3.5 for coding
- 205+ GB minimum — needs multi-GPU cluster
- 2-bit GGUF needs 135GB + 128GB RAM
ollama pull glm4.7
- 73.4% SWE-bench, 94.1% AIME 2025
- Only 15B active — extremely efficient MoE
- 6x KV-cache reduction via hybrid attention
- MIT license — fully permissive
- Multi-GPU required (8-way TP recommended)
- Needs SGLang or KTransformers — no simple Ollama
python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-V2-Flash --tp-size 8
- Near-frontier on all benchmarks
- MLA architecture — extreme KV cache efficiency
- 37B active params — efficient despite size
- MIT license — fully permissive
- Datacenter-scale hardware required
- Requires patched vLLM for optimal speed
- 300+ GB VRAM minimum
ollama pull deepseek-v3.2-exp
- 86% GPQA-Diamond — exceptional reasoning
- 90% HumanEval — top coding
- Quality Index: 49.64 — #1 open source
- MIT license — fully open
- 2-bit GGUF fits 256GB Mac
- FP16 needs 1.5TB VRAM (8x H200 minimum)
- 2-bit quant: ~5 t/s with offloading
- Consumer-unfriendly at full quality
ollama pull glm5
- Premium reasoning and creative quality
- Strong European language support
- Advanced instruction following
- Datacenter-scale hardware required
- Restrictive license vs Apache 2.0 models
ollama pull mistral-large
Specialty — Coding, Vision, Embedding
ollama pull qwen3-coder
ollama pull deepseek-coder-v2
ollama pull kimi-k2.5
- Best-in-class RAG — grounded generation with citations
- Zero-shot multi-step tool use
- Efficient tokenizer cuts cost for multilingual workloads
- Strong enterprise workflow integration
- CC-BY-NC license — no commercial use without Cohere agreement
- 104B needs 60+ GB VRAM (Q4) — multi-GPU or 80GB cards
- Older architecture — not MoE, so VRAM scales linearly
ollama pull command-r-plus
- 256K context — handles entire codebases
- 80+ programming languages
- Strong agentic coding capabilities
- Fits on 24GB GPU at Q4
- Non-commercial license — research/personal only
- Code-focused — weaker on general tasks
ollama pull codestral
- Dense 32B — no MoE complexity
- 256K context for large codebases
- Strong agentic coding performance
- Apache 2.0 — fully commercial
- Tight fit on 24GB GPUs with long context
- Code-focused — use Qwen3.5 for general tasks
ollama pull qwen3-coder:32b
- Qwen2.5-VL: Best all-around — OCR, video, charts, documents
- InternVL2.5: Strongest visual reasoning and multi-image
- Llama 3.2-Vision: Best Meta ecosystem integration
- LLaVA: Lightweight, great for experimentation
- Add 2–5 GB VRAM overhead on top of base model
- Larger vision models (72B) need 48 GB+ VRAM
- Quality varies heavily by size — 7B vision ≠ 72B vision
ollama pull llava
ollama pull llama3.2-vision
- Runs on 1GB VRAM — smallest viable VLM
- Document OCR and image understanding
- Multiple sizes for flexibility
- Apache 2.0 — fully open
- Limited reasoning at tiny scale
- Lower accuracy than larger VLMs
pip install transformers
- Pointing/grounding — locates objects in images
- Fully open: weights, data, and code
- 7B version runs on consumer GPUs
- Apache 2.0 — fully commercial
- 72B version needs datacenter hardware
- Smaller ecosystem than Qwen2.5-VL
pip install transformers
ollama pull nomic-embed-text
ollama pull bge-m3
Speech-to-Text (STT / ASR)
- 99+ language support — best multilingual STT
- Automatic language identification
- Phrase-level timestamps
- Turbo variant: 6x faster, 809M params
- Handles noise and accents well
- Large V3 needs ~10 GB VRAM
- Not streaming-native (batch-oriented)
- Can hallucinate on silence or music
pip install openai-whisper
or faster-whisper for CTranslate2
pip install transformers accelerate
- Among the fastest open ASR models
- Streaming-capable — real-time transcription
- 65K hours of training data
- Optimized for NVIDIA GPUs
- English-only
- Ranks lower on pure accuracy vs Whisper
- Speed-optimized — accuracy tradeoff
pip install nemo_toolkit[asr]
pip install nemo_toolkit[asr]
pip install moonshine
- Streaming real-time ASR — sub-100ms latency
- Edge-optimized — runs on microcontrollers
- Improved accuracy over Moonshine v1
- MIT license
- English-focused
- Lower accuracy than Whisper on complex audio
pip install moonshine
- Improved accuracy over Whisper V3
- 99+ languages — best multilingual coverage
- Better noise and accent handling
- MIT license — fully open
- Needs ~10 GB VRAM
- Not streaming-native (batch-oriented)
- Still hallucination-prone on silence
pip install openai-whisper
Text-to-Speech (TTS / Voice Synthesis)
- State-of-the-art naturalness — rivals ElevenLabs
- Emotional control (laughing, crying, whispering)
- Real-time on modern GPUs
- Apache 2.0 — fully commercial
- English-focused (multilingual expanding)
- Needs GPU for real-time speeds
- Newer — smaller community than Piper
pip install orpheus-tts
- 82M params — runs on anything
- Neural quality (breathing, pausing)
- CPU and Apple Silicon capable
- Near-zero VRAM requirement
- Limited voice cloning ability
- Fewer emotional controls than Orpheus
pip install kokoro
- Best open-source voice cloning
- Excellent cross-language code-switching
- Zero-shot cloning from ~10s audio
- Strong multilingual support
- Higher VRAM than Kokoro/Piper
- Quality varies by language
pip install fish-speech
- 17 language support out-of-box
- Voice cloning from 6-second reference
- 1100+ pre-trained voices
- Extensive documentation and community
- Higher latency than Piper/MeloTTS
- MPL license (some restrictions)
pip install TTS
- Fastest open TTS — sub-second latency
- Runs on Raspberry Pi / embedded
- 40+ languages, 100+ voices
- Home Assistant integration
- No voice cloning
- Less natural than Orpheus/XTTS
- Fixed voice catalog
pip install piper-tts
- Speech + music + sound effects
- Non-verbal vocalizations
- Creative and expressive
- MIT license
- High latency — not real-time
- Unpredictable output quality
- GPU-heavy for best results
pip install git+https://github.com/suno-ai/bark.git
pip install melotts
- Zero-shot voice cloning from short samples
- Emotional control — adjust tone and style
- Compact 0.4B — fast inference
- MIT license — fully open
- English-focused
- Cloned voice quality depends on input sample
pip install chatterbox-tts
- Multi-speaker dialogue generation
- Integrated sound effects
- Natural conversational flow
- Apache 2.0 — fully commercial
- Larger than single-speaker TTS models
- Sound effects limited to trained set
pip install dia-tts
- Non-autoregressive — very fast generation
- Zero-shot multilingual voice cloning
- Compact 0.4B parameters
- MIT license
- Newer — smaller community
- Non-AR can sacrifice some prosody quality
pip install amphion
Embedding, Search & Retrieval
- MoE — efficient with 305M active params
- 100+ languages, 100+ code languages
- Flexible dimensions (256–768)
- Top BEIR/MIRACL scores for size
- Can drop on noisy/multilingual data
- Needs prefix prompts for optimal results
ollama pull nomic-embed-text
- Dense + sparse + ColBERT in one model
- 100+ languages — best cross-lingual
- 8K token input length
- SOTA on multilingual benchmarks
- Slightly slower than single-mode models
- Requires prompt engineering for best results
ollama pull bge-m3
pip install sentence-transformers
ollama pull snowflake-arctic-embed
- Dramatically improves RAG accuracy
- Runs on consumer hardware
- Multiple sizes for speed/accuracy tradeoff
- Apache 2.0 — no licensing fees
- Adds 100–500ms latency per query
- Cross-encoder — can't precompute
- Larger variants need significant VRAM
pip install FlagEmbedding
pip install colbert-ai
- Task-specific LoRA adapters — optimized per use case
- 8K context — handles longer documents
- Strong multilingual support
- Apache 2.0 — fully open
- Larger than simpler embedding models
- LoRA switching adds minor complexity
pip install jina-embeddings
Image Generation (Diffusion Models)
- Runs on 4 GB VRAM — most accessible
- Largest LoRA/checkpoint ecosystem ever
- Fastest generation times
- Mature tooling (A1111, ComfyUI)
- 512×512 native — needs upscaling
- Weaker prompt adherence vs newer models
- Poor text rendering in images
pip install diffusers transformers accelerate
- 1024×1024 native resolution
- Largest fine-tune ecosystem (Juggernaut, RealVis, etc.)
- Excellent photorealism with right checkpoint
- ControlNet, inpainting, upscaling support
- Commercial use allowed
- 8 GB minimum, 12 GB recommended
- Refiner adds complexity and VRAM
- Text rendering still inconsistent
SDXL 1.0 · SDXL Turbo · SDXL Lightning
- Improved text rendering in images
- Better prompt fidelity than SDXL
- Multiple size variants
- Compatible with existing SD tooling
- 12–24 GB VRAM requirement
- Much smaller LoRA/fine-tune ecosystem
- Community license — not fully open
pip install diffusers transformers
- Midjourney-level quality — locally
- Best text rendering in open models
- Schnell: 4 steps, Apache 2.0 commercial
- GGUF/NF4 quants: 6–8 GB VRAM
- Excellent anatomy and photorealism
- Full FP16: needs 24 GB VRAM
- Dev license is non-commercial
- Smaller fine-tune ecosystem than SDXL
- No A1111 support — ComfyUI or Forge only
git clone https://github.com/comfyanonymous/ComfyUI.git
- State-of-the-art open image quality
- Klein 4B variant fits consumer GPUs
- Exceptional prompt fidelity
- Flex: developer-friendly parameter control
- Dev 32B needs significant VRAM
- Pro is API-only
- Commercial licensing required for some variants
pip install diffusers transformers
- Sub-second generation — fastest quality model
- Fits 16 GB consumer GPUs
- Best bilingual text rendering (EN/CN)
- Apache 2.0 — fully commercial
- Beats models 10x its size
- New — small community ecosystem
- Fewer LoRAs/fine-tunes available
pip install diffusers
pip install diffusers
pip install diffusers
- Bilingual EN/CN prompt support
- High aesthetic quality
- Smaller than FLUX — faster inference
- Apache 2.0 — fully open
- Smaller LoRA ecosystem than SD/FLUX
- Less community tooling support
pip install diffusers
- Up to 4K resolution generation
- Very fast — efficient linear attention
- Compact 1.6B parameters
- Apache 2.0 — fully open
- Newer — limited community fine-tunes
- Less refined than FLUX on complex prompts
pip install diffusers
pip install realesrgan
Video Generation (Text/Image-to-Video)
- 1.3B variant fits on almost any GPU (~8 GB)
- SOTA among open-source video models
- T2V, I2V, video editing, V2A — full pipeline
- ComfyUI + Diffusers integration
- FP8 and GGUF quants available
- 14B needs offloading on 24 GB (4+ min per clip)
- 480p is more stable than 720p on 1.3B
- Complex face consistency can vary
pip install diffusers
ComfyUI recommended
- Best face/human rendering among open models
- Cinema-grade temporal consistency
- Camera motion control (zoom, pan, tilt)
- xDiT multi-GPU parallelism support
- FP8 mode runs on 14 GB with offloading
- 8–15 min per clip on RTX 4090
- Full precision needs 80 GB+
- Prompt engineering required for best results
pip install diffusers
- 5–10 second generation on RTX 4090
- 30fps output at up to 1216×704
- FP8 and distilled variants for efficiency
- Built-in upscaler pipeline
- Lower quality ceiling than Wan/Hunyuan
- Struggles with close-up faces
- Best with concise 10–20 word prompts
pip install diffusers
- Excellent detail and semantic accuracy
- Strong Diffusers integration
- LoRA fine-tuning with CogKit
- 5B fits on consumer 16 GB GPUs
- Supports quantized inference (TorchAO)
- Close-up faces can struggle
- 6–12 min generation on RTX 4090
- Fixed resolution modes
pip install cogkit
- Best natural motion quality among open models
- Apache 2.0 — fully commercial
- 30fps photorealistic output
- Fine-tuning with custom video datasets
- Needs 60–80 GB VRAM (A100/H100 territory)
- 480p max resolution currently
- 10–20 min per clip on RTX 4090 (with offload)
pip install diffusers
Music Generation (Text-to-Music / Audio)
- Full songs up to 4 minutes
- Text-to-music + singing voice
- Lyrics-conditioned vocal generation
- Apache 2.0 — fully commercial
- Music quality below commercial services
- Limited style control granularity
pip install ace-step
- 47s stereo audio generation
- Music, SFX, and ambient sounds
- Good audio fidelity
- Open weights
- CC-BY-SA — share-alike requirement
- 47s max length — no long-form
- Less coherent on complex musical structures
pip install diffusers
3D Generation (Text/Image-to-3D)
- Image + text-to-3D generation
- GLB/OBJ export — game-engine ready
- High-quality mesh output
- MIT license — fully open
- Complex scenes can lack detail
- Texturing quality varies by input
pip install trellis3d
- High-fidelity 3D mesh generation
- PBR texture output
- Text and image input support
- Professional-quality assets
- Tencent license — check terms for commercial use
- Higher VRAM requirements
pip install hunyuan3d
- Sub-1-second 3D generation
- Single image input — no multi-view needed
- Compact 1B parameters
- MIT license — fully open
- Single-view — back of objects can be inaccurate
- Lower detail than multi-step pipelines
pip install triposr
💬 Suggest an Improvement
Missing a model? Found incorrect info? Have a feature request? Help make this reference better for everyone.
❓ Frequently Asked Questions
Click to expand ▼How much VRAM do I need to run a local LLM?
It depends on model size and quantization. At Q4_K_M (the community default): 7B models need 5–7 GB, 14B models need 10–11 GB, 32B models need 20–22 GB, and 70B models need 45–50 GB. The formula approximates: VRAM ≈ Parameters_B × 0.5 + KV_overhead. Consumer GPUs like the RTX 4090 (24 GB) comfortably accommodate 32B models.
What is the best local LLM in 2026?
Top choices vary by hardware tier. For 24 GB GPUs: Qwen3 32B (Apache 2.0 license, dual-mode thinking, multilingual/coding excellence). For 16 GB: Phi-4 14B (MIT-licensed, exceptional reasoning/math). For 8 GB or less: Qwen3 8B or Llama 3.2 3B. Frontier-grade models include GLM-5 744B and DeepSeek V3.2 671B (requiring datacenter infrastructure).
What is Q4_K_M quantization?
Q4_K_M is a 4.8-bit quantization format that reduces model size by approximately 75% while preserving 99.5% of quality. It represents the community standard for local LLM operation via Ollama and llama.cpp. Think of it as "JPEG at 80% quality — barely distinguishable from original but dramatically smaller."
Can I run AI models locally without internet?
Yes. Once downloaded, all models operate entirely offline with zero cloud dependency. Tools like Ollama, LM Studio, and llama.cpp enable local inference. This approach suits air-gapped environments, HIPAA compliance requirements, and privacy-sensitive workflows.