LLMfit: 206 Models, One Command to Find What Runs on Your Hardware
Stop guessing which LLM your hardware can run. This Rust-based terminal tool detects your system, scores 206 models across quality, speed, fit, and context — and tells you exactly what works before you download a single byte.
📺 Watch the Full Video Guide
Visual walkthrough of LLMfit's TUI, scoring system, and how to find the perfect model for your hardware.
🎬 Watch Video Guide🎧 Listen to this article
Everyone running local models has done the dance. You hear about a new LLM, download 8 GB of weights, wait twenty minutes, launch it, and discover it either doesn't fit in VRAM or crawls at three tokens per second. Then you try a smaller quantization, then a different model, and before you know it you've burned an afternoon on trial and error.
LLMfit kills that loop. It's a Rust-based terminal utility that detects your hardware, scans a database of 206 models across 57 providers, and tells you exactly which ones will run well on your machine — before you download a single byte.
Built by Alex Jones and written entirely in Rust, LLMfit ships with an interactive TUI (terminal user interface) and a classic CLI mode. It supports multi-GPU setups, Mixture-of-Experts architectures, dynamic quantization selection, speed estimation, and direct Ollama integration for one-click model downloads.
What Is LLMfit?
LLMfit is a command-line tool that answers one question: "Which LLM models will actually run well on my specific hardware?"
It works in three steps:
- Scans your hardware — Multi-GPU VRAM (aggregated across cards), CPU cores, RAM, and backend detection for CUDA, Metal, ROCm, SYCL, or CPU-only with ARM/x86 distinction.
- Evaluates every model — For each of the 206 models, it calculates optimal quantization level, estimates tokens-per-second, determines whether the model fits in VRAM or needs CPU offloading, and classifies the fit as Perfect, Good, Marginal, or Too Tight.
- Scores and ranks — Each model gets four scores (0–100) for Quality, Speed, Fit, and Context, weighted by your use case. The result is a ranked list of models that will actually work on your machine.
The model database is sourced from the HuggingFace API, stored in data/hf_models.json, and embedded at compile time. It covers Meta Llama, Mistral, Qwen, Google Gemma, Microsoft Phi, DeepSeek, IBM Granite, xAI Grok, Cohere, BigCode, and dozens more providers.
Key Features
Multi-Dimensional Scoring
Unlike simple "fits / doesn't fit" tools, LLMfit evaluates models across four dimensions:
| Dimension | What It Measures |
|---|---|
| Quality | Parameter count, model family reputation, quantization penalty, task alignment |
| Speed | Estimated tokens/sec based on backend, params, and quantization |
| Fit | Memory utilization efficiency (sweet spot: 50–80% of available memory) |
| Context | Context window capability vs. target for the use case |
Weights vary by use-case category. Chat weights Speed higher (0.35), while Reasoning weights Quality higher (0.55). This means a coding assistant and a chatbot will get different top recommendations on the same hardware.
Mixture-of-Experts (MoE) Support
Models like Mixtral 8x7B and DeepSeek-V3 use MoE architectures where only a subset of experts activates per token. LLMfit detects this automatically and calculates the effective VRAM requirement. For example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token, reducing VRAM from 23.9 GB to ~6.6 GB with expert offloading.
Dynamic Quantization Selection
Instead of assuming a fixed quantization, LLMfit walks a hierarchy from Q8_0 (best quality) down to Q2_K (most compressed), picking the highest quality that fits in available memory. If nothing fits at full context, it tries again at half context.
Speed Estimation
LLMfit uses empirically derived baseline constants for each backend:
| Backend | Baseline (tokens/sec) |
|---|---|
| CUDA (NVIDIA) | 220 |
| ROCm (AMD) | 180 |
| Metal (Apple Silicon) | 160 |
| SYCL (Intel Arc) | 100 |
| CPU ARM | 90 |
| CPU x86 | 70 |
These are adjusted for model size, quantization level, CPU offload penalties (0.5× for partial, 0.3× for CPU-only), and MoE expert switching overhead (0.8×). Not benchmarks — but directionally useful when choosing between twenty candidate models.
How the Scoring Works
The composite score combines all four dimensions with use-case-specific weights. Here's how different categories prioritize:
| Use Case | Quality | Speed | Fit | Context |
|---|---|---|---|---|
| General | 0.30 | 0.25 | 0.25 | 0.20 |
| Coding | 0.35 | 0.30 | 0.20 | 0.15 |
| Reasoning | 0.55 | 0.15 | 0.15 | 0.15 |
| Chat | 0.20 | 0.35 | 0.25 | 0.20 |
| Multimodal | 0.35 | 0.20 | 0.25 | 0.20 |
| Embedding | 0.25 | 0.30 | 0.30 | 0.15 |
Models classified as "Too Tight" (not enough VRAM or system RAM) are always ranked at the bottom, regardless of their quality scores. The fit levels are:
- Perfect — Recommended memory met on GPU. Requires GPU acceleration.
- Good — Fits with headroom. Best achievable for MoE offload or CPU+GPU.
- Marginal — Tight fit, or CPU-only (CPU-only always caps here).
- Too Tight — Not enough VRAM or system RAM anywhere.
Hardware Detection
LLMfit auto-detects your hardware using platform-specific methods:
- NVIDIA — Multi-GPU support via
nvidia-smi. Aggregates VRAM across all detected GPUs. Falls back to VRAM estimation from GPU model name if reporting fails. - AMD — Detected via
rocm-smi. - Intel Arc — Discrete VRAM via sysfs, integrated via lspci.
- Apple Silicon — Unified memory via
system_profiler. VRAM equals system RAM, so a MacBook Pro with 36 GB sees models scored against the full 36 GB. - CPU-only — Falls back to system RAM with ARM/x86 distinction for speed estimation.
For systems where GPU VRAM autodetection fails (VMs, passthrough setups, broken nvidia-smi), the --memory flag lets you override manually:
# Override with 32 GB VRAM
llmfit --memory=32G
# Works with all modes
llmfit --memory=24G --cli
llmfit --memory=24G fit --perfect -n 5
Dynamic Quantization
Quantization reduces model precision to fit in less memory. LLMfit doesn't just pick one — it walks the full hierarchy to find the best quality that fits your hardware:
Quantization Hierarchy (Best → Most Compressed)
Q8_0 → Q6_K → Q5_K_M → Q5_K_S → Q4_K_M → Q4_K_S → Q4_0 → Q3_K_M → Q3_K_S → Q2_K
LLMfit tries each level top-down. If nothing fits at full context, it retries at half context before giving up.
This is particularly valuable because the difference between Q8_0 and Q4_K_M can be 50% less memory with only marginal quality loss for most tasks. LLMfit finds that sweet spot automatically.
The TUI Experience
LLMfit's default interface is a full terminal UI built on ratatui. Your system specs appear at the top. Models fill a scrollable table sorted by composite score. Each row shows score, estimated tok/s, best quantization, run mode, memory usage, and use-case category.
Key bindings make navigation fast:
| Key | Action |
|---|---|
/ | Search (partial match on name, provider, params, use case) |
f | Cycle fit filter: All → Runnable → Perfect → Good → Marginal |
s | Cycle sort: Score → Params → Mem% → Ctx → Date → Use Case |
p | Open provider filter popup |
t | Cycle color theme (6 built-in themes, saved automatically) |
d | Pull/download selected model via Ollama |
i | Toggle installed-first sorting (Ollama only) |
Enter | Toggle detail view for selected model |
Six built-in themes — Default, Dracula, Solarized, Nord, Monokai, and Gruvbox — let you match your terminal aesthetic. Your selection persists across sessions at ~/.config/llmfit/theme.
CLI Mode & Automation
For scripting and quick lookups, --cli gives classic table output. Subcommands target specific needs:
# Show detected hardware
llmfit system
# Top 5 perfectly fitting models
llmfit fit --perfect -n 5
# Search by name or size
llmfit search "qwen 8b"
# Detailed info on a specific model
llmfit info "Mistral-7B"
# Machine-readable recommendations (JSON)
llmfit recommend --json --limit 5
# Recommendations filtered by use case
llmfit recommend --json --use-case coding --limit 3
The recommend subcommand with --json is particularly useful for automation. Pipe it into scripts that auto-configure Ollama, or use it in CI/CD pipelines to select the right model for your deployment environment.
Ollama Integration
LLMfit connects to Ollama's local API at http://localhost:11434 (or a custom host via OLLAMA_HOST). If Ollama is running, you get:
- Install detection — Green ✓ marks for models you already have
- One-key download — Press
dto pull any model directly from the TUI - Progress tracking — Real-time download progress with animated indicators
- Remote support — Connect to Ollama on another machine via
OLLAMA_HOST="http://192.168.1.100:11434" llmfit
If Ollama isn't running, LLMfit works normally — the download features simply hide from the status bar.
💡 Pro Tip: Remote GPU Server
Run LLMfit on your laptop while Ollama serves from your GPU server. Use OLLAMA_HOST to score models on the server's hardware and download them remotely — all from your local terminal.
Use Cases
1. First-Time Local AI Setup
You just bought a GPU or got a new MacBook. Run llmfit, filter by "Perfect" fit, and know exactly which models to download. No Reddit trawling required.
2. Coding Assistant Selection
Filter by coding use case to see CodeLlama, StarCoder2, WizardCoder, Qwen2.5-Coder, and Qwen3-Coder ranked specifically for your hardware. Speed gets weighted higher because latency matters when you're waiting for code completions.
3. Multi-GPU Rig Optimization
LLMfit aggregates VRAM across multiple GPUs. If you have 2× RTX 3090 (48 GB total), you'll see models that wouldn't fit on a single card suddenly ranked as "Perfect."
4. Edge Deployment Planning
Use --memory to simulate different hardware. Test what fits on a Jetson Nano (8 GB), a Raspberry Pi 5 (8 GB ARM), or a cloud VM with specific VRAM allocations before provisioning.
5. Automated Model Selection
Use llmfit recommend --json --use-case coding --limit 3 in CI/CD pipelines to automatically select and deploy the best model for your infrastructure.
Getting Started
Installation
Four options, all straightforward:
# Homebrew (macOS/Linux)
brew tap AlexsJones/llmfit
brew install llmfit
# Cargo (cross-platform, requires Rust)
cargo install llmfit
# Quick install script
curl -fsSL https://llmfit.axjns.dev/install.sh | sh
# From source
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit && cargo build --release
First Run
Just type llmfit and the TUI launches. Your hardware is detected automatically. Browse, search, filter, and when you find a model you like — press d to download it via Ollama.
Quick Answers
# "What's the best model for my hardware?"
llmfit fit --perfect -n 1
# "Can I run Llama 3.1 70B?"
llmfit info "Llama-3.1-70B"
# "What coding models fit?"
llmfit recommend --json --use-case coding --limit 5
Competitors & Alternatives
| Tool | Approach | Key Difference |
|---|---|---|
| LLMfit | Hardware-aware model scoring | Multi-dimensional scoring, TUI, dynamic quantization, Ollama integration |
| Ollama | Model runner + registry | Runs models but doesn't recommend which ones fit your hardware |
| LM Studio | GUI model browser + runner | Desktop app with download + chat UI, but less scoring transparency |
| GPT4All | Desktop chat app | Focuses on ease of use over hardware optimization |
| Manual research | Reddit, HuggingFace browsing | Time-consuming, hardware-agnostic, easy to miss good options |
LLMfit's niche is clear: it's not a model runner. It's a model selector. It answers "what should I run?" so tools like Ollama and LM Studio can answer "how do I run it?"
Pros & Cons
✅ Pros
- Saves hours of trial-and-error model selection
- Multi-dimensional scoring is more useful than binary fits/doesn't-fit
- Excellent Apple Silicon support (unified memory = VRAM)
- MoE-aware — correctly scores Mixtral, DeepSeek-V3 effective memory
- Ollama integration for seamless download-from-TUI workflow
- Written in Rust — fast startup, single binary, no dependencies
- JSON output for scripting and automation
- 6 color themes (Dracula, Nord, Gruvbox, etc.)
- Free, open source (MIT license)
⚠️ Cons
- Speed estimates are theoretical, not actual benchmarks
- Model database is compiled-in — requires rebuild to add new models
- 206 models is comprehensive but can't cover every fine-tune or merge
- No Windows GUI — terminal-only (though CLI mode works in any terminal)
- Scoring weights are opinionated (may not match your priorities)
References
- LLMfit GitHub Repository — Source code, documentation, and model database
- LLMfit on crates.io — Rust package registry listing
- LLMfit Documentation on docs.rs — API documentation
- AwesomeAgents: LLMfit Review — "Stop Guessing Which LLM Your Hardware Can Actually Run"
- Show HN: LLMfit — Hacker News community discussion
- Ratatui — The TUI framework LLMfit is built on
- Ollama — Local LLM runner integrated with LLMfit
- HuggingFace — Source of LLMfit's 206-model database
- llama.cpp — Quantization formats (GGUF) used by LLMfit's scoring
- Rustup — Rust toolchain installer for building from source
- LLMfit Official Site — Quick install script and project homepage