LLMfit: 206 Models, One Command to Find What Runs on Your Hardware

Stop guessing which LLM your hardware can run. This Rust-based terminal tool detects your system, scores 206 models across quality, speed, fit, and context — and tells you exactly what works before you download a single byte.

📺 Watch the Full Video Guide

Visual walkthrough of LLMfit's TUI, scoring system, and how to find the perfect model for your hardware.

🎬 Watch Video Guide

🎧 Listen to this article

Everyone running local models has done the dance. You hear about a new LLM, download 8 GB of weights, wait twenty minutes, launch it, and discover it either doesn't fit in VRAM or crawls at three tokens per second. Then you try a smaller quantization, then a different model, and before you know it you've burned an afternoon on trial and error.

LLMfit kills that loop. It's a Rust-based terminal utility that detects your hardware, scans a database of 206 models across 57 providers, and tells you exactly which ones will run well on your machine — before you download a single byte.

206
Models in Database
57
Providers Covered
4
Scoring Dimensions
6
Use-Case Categories

Built by Alex Jones and written entirely in Rust, LLMfit ships with an interactive TUI (terminal user interface) and a classic CLI mode. It supports multi-GPU setups, Mixture-of-Experts architectures, dynamic quantization selection, speed estimation, and direct Ollama integration for one-click model downloads.

What Is LLMfit?

LLMfit is a command-line tool that answers one question: "Which LLM models will actually run well on my specific hardware?"

It works in three steps:

  1. Scans your hardware — Multi-GPU VRAM (aggregated across cards), CPU cores, RAM, and backend detection for CUDA, Metal, ROCm, SYCL, or CPU-only with ARM/x86 distinction.
  2. Evaluates every model — For each of the 206 models, it calculates optimal quantization level, estimates tokens-per-second, determines whether the model fits in VRAM or needs CPU offloading, and classifies the fit as Perfect, Good, Marginal, or Too Tight.
  3. Scores and ranks — Each model gets four scores (0–100) for Quality, Speed, Fit, and Context, weighted by your use case. The result is a ranked list of models that will actually work on your machine.

The model database is sourced from the HuggingFace API, stored in data/hf_models.json, and embedded at compile time. It covers Meta Llama, Mistral, Qwen, Google Gemma, Microsoft Phi, DeepSeek, IBM Granite, xAI Grok, Cohere, BigCode, and dozens more providers.

Key Features

Multi-Dimensional Scoring

Unlike simple "fits / doesn't fit" tools, LLMfit evaluates models across four dimensions:

Dimension What It Measures
Quality Parameter count, model family reputation, quantization penalty, task alignment
Speed Estimated tokens/sec based on backend, params, and quantization
Fit Memory utilization efficiency (sweet spot: 50–80% of available memory)
Context Context window capability vs. target for the use case

Weights vary by use-case category. Chat weights Speed higher (0.35), while Reasoning weights Quality higher (0.55). This means a coding assistant and a chatbot will get different top recommendations on the same hardware.

Mixture-of-Experts (MoE) Support

Models like Mixtral 8x7B and DeepSeek-V3 use MoE architectures where only a subset of experts activates per token. LLMfit detects this automatically and calculates the effective VRAM requirement. For example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token, reducing VRAM from 23.9 GB to ~6.6 GB with expert offloading.

Dynamic Quantization Selection

Instead of assuming a fixed quantization, LLMfit walks a hierarchy from Q8_0 (best quality) down to Q2_K (most compressed), picking the highest quality that fits in available memory. If nothing fits at full context, it tries again at half context.

Speed Estimation

LLMfit uses empirically derived baseline constants for each backend:

Backend Baseline (tokens/sec)
CUDA (NVIDIA)220
ROCm (AMD)180
Metal (Apple Silicon)160
SYCL (Intel Arc)100
CPU ARM90
CPU x8670

These are adjusted for model size, quantization level, CPU offload penalties (0.5× for partial, 0.3× for CPU-only), and MoE expert switching overhead (0.8×). Not benchmarks — but directionally useful when choosing between twenty candidate models.

How the Scoring Works

The composite score combines all four dimensions with use-case-specific weights. Here's how different categories prioritize:

Use Case Quality Speed Fit Context
General0.300.250.250.20
Coding0.350.300.200.15
Reasoning0.550.150.150.15
Chat0.200.350.250.20
Multimodal0.350.200.250.20
Embedding0.250.300.300.15

Models classified as "Too Tight" (not enough VRAM or system RAM) are always ranked at the bottom, regardless of their quality scores. The fit levels are:

  • Perfect — Recommended memory met on GPU. Requires GPU acceleration.
  • Good — Fits with headroom. Best achievable for MoE offload or CPU+GPU.
  • Marginal — Tight fit, or CPU-only (CPU-only always caps here).
  • Too Tight — Not enough VRAM or system RAM anywhere.

Hardware Detection

LLMfit auto-detects your hardware using platform-specific methods:

  • NVIDIA — Multi-GPU support via nvidia-smi. Aggregates VRAM across all detected GPUs. Falls back to VRAM estimation from GPU model name if reporting fails.
  • AMD — Detected via rocm-smi.
  • Intel Arc — Discrete VRAM via sysfs, integrated via lspci.
  • Apple Silicon — Unified memory via system_profiler. VRAM equals system RAM, so a MacBook Pro with 36 GB sees models scored against the full 36 GB.
  • CPU-only — Falls back to system RAM with ARM/x86 distinction for speed estimation.

For systems where GPU VRAM autodetection fails (VMs, passthrough setups, broken nvidia-smi), the --memory flag lets you override manually:

# Override with 32 GB VRAM
llmfit --memory=32G

# Works with all modes
llmfit --memory=24G --cli
llmfit --memory=24G fit --perfect -n 5

Dynamic Quantization

Quantization reduces model precision to fit in less memory. LLMfit doesn't just pick one — it walks the full hierarchy to find the best quality that fits your hardware:

Quantization Hierarchy (Best → Most Compressed)

Q8_0 → Q6_K → Q5_K_M → Q5_K_S → Q4_K_M → Q4_K_S → Q4_0 → Q3_K_M → Q3_K_S → Q2_K

LLMfit tries each level top-down. If nothing fits at full context, it retries at half context before giving up.

This is particularly valuable because the difference between Q8_0 and Q4_K_M can be 50% less memory with only marginal quality loss for most tasks. LLMfit finds that sweet spot automatically.

The TUI Experience

LLMfit's default interface is a full terminal UI built on ratatui. Your system specs appear at the top. Models fill a scrollable table sorted by composite score. Each row shows score, estimated tok/s, best quantization, run mode, memory usage, and use-case category.

Key bindings make navigation fast:

Key Action
/Search (partial match on name, provider, params, use case)
fCycle fit filter: All → Runnable → Perfect → Good → Marginal
sCycle sort: Score → Params → Mem% → Ctx → Date → Use Case
pOpen provider filter popup
tCycle color theme (6 built-in themes, saved automatically)
dPull/download selected model via Ollama
iToggle installed-first sorting (Ollama only)
EnterToggle detail view for selected model

Six built-in themes — Default, Dracula, Solarized, Nord, Monokai, and Gruvbox — let you match your terminal aesthetic. Your selection persists across sessions at ~/.config/llmfit/theme.

CLI Mode & Automation

For scripting and quick lookups, --cli gives classic table output. Subcommands target specific needs:

# Show detected hardware
llmfit system

# Top 5 perfectly fitting models
llmfit fit --perfect -n 5

# Search by name or size
llmfit search "qwen 8b"

# Detailed info on a specific model
llmfit info "Mistral-7B"

# Machine-readable recommendations (JSON)
llmfit recommend --json --limit 5

# Recommendations filtered by use case
llmfit recommend --json --use-case coding --limit 3

The recommend subcommand with --json is particularly useful for automation. Pipe it into scripts that auto-configure Ollama, or use it in CI/CD pipelines to select the right model for your deployment environment.

Ollama Integration

LLMfit connects to Ollama's local API at http://localhost:11434 (or a custom host via OLLAMA_HOST). If Ollama is running, you get:

  • Install detection — Green ✓ marks for models you already have
  • One-key download — Press d to pull any model directly from the TUI
  • Progress tracking — Real-time download progress with animated indicators
  • Remote support — Connect to Ollama on another machine via OLLAMA_HOST="http://192.168.1.100:11434" llmfit

If Ollama isn't running, LLMfit works normally — the download features simply hide from the status bar.

💡 Pro Tip: Remote GPU Server

Run LLMfit on your laptop while Ollama serves from your GPU server. Use OLLAMA_HOST to score models on the server's hardware and download them remotely — all from your local terminal.

Use Cases

1. First-Time Local AI Setup

You just bought a GPU or got a new MacBook. Run llmfit, filter by "Perfect" fit, and know exactly which models to download. No Reddit trawling required.

2. Coding Assistant Selection

Filter by coding use case to see CodeLlama, StarCoder2, WizardCoder, Qwen2.5-Coder, and Qwen3-Coder ranked specifically for your hardware. Speed gets weighted higher because latency matters when you're waiting for code completions.

3. Multi-GPU Rig Optimization

LLMfit aggregates VRAM across multiple GPUs. If you have 2× RTX 3090 (48 GB total), you'll see models that wouldn't fit on a single card suddenly ranked as "Perfect."

4. Edge Deployment Planning

Use --memory to simulate different hardware. Test what fits on a Jetson Nano (8 GB), a Raspberry Pi 5 (8 GB ARM), or a cloud VM with specific VRAM allocations before provisioning.

5. Automated Model Selection

Use llmfit recommend --json --use-case coding --limit 3 in CI/CD pipelines to automatically select and deploy the best model for your infrastructure.

Getting Started

Installation

Four options, all straightforward:

# Homebrew (macOS/Linux)
brew tap AlexsJones/llmfit
brew install llmfit

# Cargo (cross-platform, requires Rust)
cargo install llmfit

# Quick install script
curl -fsSL https://llmfit.axjns.dev/install.sh | sh

# From source
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit && cargo build --release

First Run

Just type llmfit and the TUI launches. Your hardware is detected automatically. Browse, search, filter, and when you find a model you like — press d to download it via Ollama.

Quick Answers

# "What's the best model for my hardware?"
llmfit fit --perfect -n 1

# "Can I run Llama 3.1 70B?"
llmfit info "Llama-3.1-70B"

# "What coding models fit?"
llmfit recommend --json --use-case coding --limit 5

Competitors & Alternatives

Tool Approach Key Difference
LLMfit Hardware-aware model scoring Multi-dimensional scoring, TUI, dynamic quantization, Ollama integration
Ollama Model runner + registry Runs models but doesn't recommend which ones fit your hardware
LM Studio GUI model browser + runner Desktop app with download + chat UI, but less scoring transparency
GPT4All Desktop chat app Focuses on ease of use over hardware optimization
Manual research Reddit, HuggingFace browsing Time-consuming, hardware-agnostic, easy to miss good options

LLMfit's niche is clear: it's not a model runner. It's a model selector. It answers "what should I run?" so tools like Ollama and LM Studio can answer "how do I run it?"

Pros & Cons

✅ Pros

  • Saves hours of trial-and-error model selection
  • Multi-dimensional scoring is more useful than binary fits/doesn't-fit
  • Excellent Apple Silicon support (unified memory = VRAM)
  • MoE-aware — correctly scores Mixtral, DeepSeek-V3 effective memory
  • Ollama integration for seamless download-from-TUI workflow
  • Written in Rust — fast startup, single binary, no dependencies
  • JSON output for scripting and automation
  • 6 color themes (Dracula, Nord, Gruvbox, etc.)
  • Free, open source (MIT license)

⚠️ Cons

  • Speed estimates are theoretical, not actual benchmarks
  • Model database is compiled-in — requires rebuild to add new models
  • 206 models is comprehensive but can't cover every fine-tune or merge
  • No Windows GUI — terminal-only (though CLI mode works in any terminal)
  • Scoring weights are opinionated (may not match your priorities)

References

  1. LLMfit GitHub Repository — Source code, documentation, and model database
  2. LLMfit on crates.io — Rust package registry listing
  3. LLMfit Documentation on docs.rs — API documentation
  4. AwesomeAgents: LLMfit Review — "Stop Guessing Which LLM Your Hardware Can Actually Run"
  5. Show HN: LLMfit — Hacker News community discussion
  6. Ratatui — The TUI framework LLMfit is built on
  7. Ollama — Local LLM runner integrated with LLMfit
  8. HuggingFace — Source of LLMfit's 206-model database
  9. llama.cpp — Quantization formats (GGUF) used by LLMfit's scoring
  10. Rustup — Rust toolchain installer for building from source
  11. LLMfit Official Site — Quick install script and project homepage