AI Models Open Source

🤖 OpenAI GPT-OSS: The Complete Guide

Architecture, benchmarks, hardware requirements, and who each model is actually for — everything you need to know about OpenAI's first open-weight GPT-class release since GPT-2.

By Karibe · March 30, 2026 · 12 min read

🎧

Listen to this article

AI-narrated · ~12 minutes

▶️ Prefer video? Watch on YouTube → We break down the architecture and run the benchmarks live.

OpenAI Goes Open — Why This Matters

On August 5, 2025, OpenAI did something it hadn't done since 2019: it released open-weight models. The last time this happened, the model was GPT-2 — a 1.5B parameter transformer that OpenAI famously held back for months out of concern about misuse. Six years and several generations of closed frontier models later, the company shipped two open-weight LLMs under Apache 2.0: gpt-oss-120b and gpt-oss-20b.

These are not hobbyist releases. GPT-OSS-120b benchmarks near o4-mini on core reasoning tasks. GPT-OSS-20b compares favorably to o3-mini. Both support 128K context, full chain-of-thought reasoning, tool use, and structured outputs. Both can be self-hosted, fine-tuned, and deployed commercially — with no royalties, no usage caps, and no vendor lock-in.

The models collectively accumulated 9 million downloads on HuggingFace within weeks of release. Enterprise partners — AI Sweden, Orange, Snowflake — announced immediate on-premises deployments. The open-source AI ecosystem lit up. This guide covers everything: what these models are, how they work under the hood, what hardware you need, and who should actually use them.

9M+

HuggingFace Downloads (weeks)

128K

Context Window

2019

Last OpenAI Open Release

Apache 2.0

License

What Is GPT-OSS? The Two Models

GPT-OSS ships as two distinct models optimized for different deployment tiers. Both share the same architectural family — Mixture of Experts with chain-of-thought reasoning — but differ dramatically in scale, hardware footprint, and target use case.

Property	gpt-oss-20b	gpt-oss-120b
Total Parameters	21B	117B
Active Params / Token	3.6B	5.1B
Transformer Layers	24	36
Expert Count	32 experts	128 experts
Active Experts / Token	4	4
Context Window	128K tokens	128K tokens
Quantization	MXFP4 (MoE weights)	MXFP4 (MoE weights)
Min VRAM	~12–16GB	~80–96GB
Reasoning Benchmark	≈ o3-mini	≈ o4-mini
License	Apache 2.0	Apache 2.0
Primary Tier	Edge / Consumer / On-device	Enterprise / Single H100

The key insight from these numbers: despite gpt-oss-120b having 117 billion total parameters, only 5.1 billion are active during any given token generation step. This is the MoE efficiency story — and it's why the model fits on a single 80GB GPU at all.

Architecture Deep Dive

Mixture of Experts (MoE)

Both GPT-OSS models use a Mixture of Experts architecture. In a standard dense transformer, every parameter fires for every token. In MoE, the feedforward layers are replaced with a collection of specialized "expert" sub-networks, and a learned routing function selects only a small subset for each token.

For gpt-oss-120b, there are 128 experts per MoE layer, and only 4 are activated per token. For gpt-oss-20b, there are 32 experts with 4 active. The result: you get the capacity and expressive power of a very large model, but the compute cost of a much smaller one. Inference is fast because you're only running 5.1B active parameters per forward pass, not 117B.

This architecture follows the same MoE playbook as DeepSeek-V3 and Qwen3.5, but OpenAI adds a novel twist: alternating dense and locally banded sparse attention. Dense attention layers handle global context integration at regular intervals; sparse attention with locally banded patterns handles the bulk of computation efficiently. This hybrid lets the model maintain global coherence without the quadratic cost of full attention at every layer.

MXFP4 Quantization

One of the most technically interesting aspects of GPT-OSS is its use of MXFP4 — MicroScaling FP4 — applied specifically to the MoE weights. Standard INT4 quantization often degrades model quality significantly. MXFP4 uses a block-scaled floating-point format that retains approximately 94% of the original model accuracy at 4.25 bits per parameter.

The practical result: gpt-oss-120b with MXFP4 quantization on its MoE layers fits within 80–96GB of GPU VRAM — the capacity of a single H100 80GB. Without quantization, a 117B-parameter model would require substantially more memory. The quantization is applied only to the MoE feedforward weights, not the attention layers, which preserves precision where it matters most for reasoning quality.

Chain-of-Thought and Adjustable Reasoning Effort

Both GPT-OSS models support full chain-of-thought reasoning, trained using reinforcement learning techniques derived from o3 and other OpenAI frontier models. The reasoning is not just bolted-on prompting — it's native to the model's training process.

Critically, reasoning effort is adjustable. You can dial from fast, shallow responses suitable for simple tasks to extended multi-step reasoning for complex problems. This mirrors the "thinking budget" concept introduced in frontier reasoning models, and it means you're not paying the compute cost of deep reasoning on every inference call — only when you need it.

Extended Context: YaRN + Sliding Window

The native context window for both models is 4,096 tokens, extended to 128K via YaRN (Yet Another RoPE Normalization) combined with a sliding window attention mechanism. YaRN rescales the rotary positional embeddings to handle longer sequences without catastrophic degradation; the sliding window bounds the attention compute cost at long contexts while preserving local coherence.

Tool Use and API Compatibility

Both models are natively trained for tool use: web search, Python code execution, and function calling. They are fully compatible with OpenAI's Responses API and Structured Outputs format. The Harmony response format — which supports multi-channel message delivery — is also supported, making these models drop-in compatible with existing OpenAI-based agent frameworks.

Performance Benchmarks

OpenAI's positioning is straightforward: gpt-oss-120b performs near o4-mini on core reasoning benchmarks; gpt-oss-20b performs comparably to o3-mini. Both outperform other open-weight models at their respective parameter scales.

HealthBench: A Standout Result

The most striking benchmark result is HealthBench, OpenAI's medical AI evaluation suite. GPT-OSS outperforms not just other open-weight models but proprietary ones — including o1 and GPT-4o — on this dataset. This is significant for healthcare and regulated industries: it means a locally deployable, HIPAA-compatible model can now outperform the cloud-based APIs that previously set the standard.

Model	Reasoning	HealthBench	Deployment
gpt-oss-120b	≈ o4-mini	Beats o1, GPT-4o	Single H100 80GB
gpt-oss-20b	≈ o3-mini	Strong	16GB consumer GPU
o4-mini (OpenAI)	Frontier	Baseline	API only
o3-mini (OpenAI)	Strong	Baseline	API only
GPT-4o	Strong	Below gpt-oss-120b	API only

💡 The Efficiency Story

gpt-oss-120b has 117B total parameters but only activates 5.1B per token. Compare this to a dense 70B model that activates all 70B per token — GPT-OSS is doing less compute per token while delivering better benchmark results. MoE efficiency is no longer theoretical.

Hardware Requirements

gpt-oss-20b: Consumer Tier

With MXFP4 quantization, gpt-oss-20b can run in approximately 12–16GB of GPU VRAM. This puts it within reach of widely-available consumer hardware:

Hardware	VRAM	Verdict
NVIDIA RTX 4090	24GB	✅ Comfortable — room for context
NVIDIA RTX 3090	24GB	✅ Comfortable
NVIDIA RTX 4080	16GB	⚠️ Tight — limited context length
Apple M2/M3 (16GB+)	16GB+	✅ Runs well via llama.cpp/MLX
Apple M3 Ultra (192GB)	192GB	✅ Both models simultaneously
AMD RX 7900 XTX (ROCm)	24GB	✅ ROCm support confirmed

Storage recommendation: NVMe SSD for fast model load times. gpt-oss-20b with MXFP4 weighs approximately 10–12GB on disk.

gpt-oss-120b: Prosumer / Enterprise Tier

The 120B model requires substantially more memory. With MXFP4 quantization applied to MoE weights, the VRAM requirement is approximately 80–96GB:

Hardware	VRAM	Verdict
NVIDIA H100 80GB (single)	80GB	✅ Ideal — purpose-built for this
NVIDIA A100 80GB × 2	160GB	✅ More than enough headroom
NVIDIA RTX 3090 × 4	96GB	✅ Multi-GPU DIY rig — works via tensor parallel
Apple M3 Ultra (192GB)	192GB	✅ Unified memory — excellent throughput
NVIDIA RTX 4090 × 2	48GB	⚠️ Insufficient — need heavier quantization
Dell Pro Max GB300 (NVL72)	1.5TB+	✅ Enterprise — multiple instances

For teams without H100 access, the 4× RTX 3090 route (PCIe, tensor-parallel via vLLM) is the most cost-effective path. Expect roughly 10–15 tokens/second in this configuration — fast enough for most interactive and batch use cases.

How to Run Locally

Ollama (Easiest)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run gpt-oss-20b (consumer GPUs)
ollama run openai/gpt-oss-20b

# Run gpt-oss-120b (H100 / multi-GPU)
ollama run openai/gpt-oss-120b

# OpenAI-compatible REST API (runs automatically with ollama serve)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain MoE in one paragraph"}]
  }'

vLLM (Production Serving)

pip install vllm

# Serve gpt-oss-20b
vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92

# Serve gpt-oss-120b across 4× RTX 3090
vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant. Think step by step."},
    {"role": "user", "content": "What is the difference between MoE and dense transformers?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Both models are also supported by llama.cpp and Ollama for CPU/Apple Silicon inference, with Flash Attention 3 support for NVIDIA users and confirmed AMD ROCm compatibility.

Use Cases In Depth

🏥 Healthcare & Regulated Industries

GPT-OSS's HealthBench performance — surpassing o1 and GPT-4o — combined with local deployment makes it uniquely positioned for HIPAA-compliant AI. PHI never leaves your firewall. Fine-tune on clinical notes, radiology reports, or proprietary medical literature without sending data to the cloud.

🏢 Private Enterprise Deployment

Run behind your own firewall. No API keys, no usage tracking, no third-party data exposure. Partners like AI Sweden, Orange, and Snowflake are deploying on-premises for exactly this reason — full control over data residency and model behavior.

🤖 Agentic Workflows

Native tool use (web search, Python execution, function calling), Responses API compatibility, and Structured Outputs make GPT-OSS a drop-in backbone for agent frameworks. Adjustable reasoning effort means you can dial cost vs. quality per-task in an agent loop.

🔬 Fine-Tuning on Specialized Data

Apache 2.0 licensing allows unrestricted fine-tuning and commercial deployment of derivatives. Teams with domain-specific datasets — legal contracts, scientific literature, proprietary codebases — can fine-tune both models without restriction.

📱 Edge & On-Device AI

gpt-oss-20b at 12–16GB VRAM opens on-device deployment on Apple M-series laptops, edge servers, and embedded GPU systems. This is the first OpenAI-trained model in this capability tier that can run entirely offline.

💻 Code Generation & Developer Tooling

Strong tool-use training and CoT reasoning make GPT-OSS capable for coding agents and dev tooling. Run as a local Copilot alternative — no subscription, no data sent to external servers, configurable context length up to 128K for large codebases.

Who It's For: A Tiered Breakdown

🧑‍💻 Hobbyist / Developer (gpt-oss-20b)

You have an RTX 3090/4090 or an Apple M2/M3 Mac with 16GB+ unified memory. You want a capable, fast, locally-running LLM for coding, writing, and general-purpose tasks. gpt-oss-20b gives you o3-mini-class reasoning on hardware you already own. Ollama installation is five minutes.

🏗️ Small Team / Startup (gpt-oss-20b or 120b)

You're building a product and want to avoid per-token API costs at scale. gpt-oss-20b works on a single consumer GPU server. gpt-oss-120b works on a single rented H100 (available from Lambda Labs, RunPod, Vast.ai at $2–3/hr). Apache 2.0 means you can build commercial products without licensing headaches.

🏥 Healthcare / Legal / Finance (gpt-oss-120b)

Data residency requirements, HIPAA/GDPR constraints, or proprietary training data make cloud API use untenable. gpt-oss-120b's HealthBench superiority over o1 and GPT-4o makes it the strongest open-weight option for regulated industry AI. Deploy on-premises, fine-tune on your data, maintain full control.

🏢 Enterprise (gpt-oss-120b)

You need fleet-scale deployment, SLA guarantees, and security control that cloud APIs can't offer. Partners like Snowflake and Orange are already here. gpt-oss-120b fits a single H100 80GB — one node per model instance, horizontally scalable. Azure and Dell have announced certified deployment configurations.

🔬 AI Researcher / Fine-Tuner

You want to study and improve frontier-class reasoning models. The Apache 2.0 license and open weights give you full access to the model internals. Train domain-specific derivatives, study the MoE routing behavior, run ablations. This is the first OpenAI model you can actually dissect since GPT-2.

GPT-OSS Safeguard: The Safety Companion Models

In October 2025, OpenAI followed up with gpt-oss-safeguard — safety reasoning models in both 120B and 20B sizes. These are specialized models trained for policy classification: evaluating whether model outputs comply with content policies, detecting harmful completions, and providing reasoning-level safety oversight.

The safeguard models are designed to be run alongside the main GPT-OSS models in production pipelines — a companion that checks outputs before they reach users. This is particularly relevant for regulated industry deployments where human-in-the-loop review is insufficient at scale but policy compliance is non-negotiable.

The 120B safeguard variant was adversarially fine-tuned and evaluated under OpenAI's Preparedness Framework v2, the same safety evaluation process applied to OpenAI's own frontier models. For enterprise teams deploying AI in high-stakes contexts, this represents a significant step — you're not just getting the model, you're getting the safety infrastructure that OpenAI uses internally.

The Bigger Picture: OpenAI's Strategic Shift

GPT-OSS represents a meaningful reversal of the trajectory OpenAI had followed since GPT-2. For six years, every major capability improvement stayed behind the API. The open-source ecosystem learned to chase OpenAI rather than collaborate with it. The release of GPT-OSS changes that dynamic in ways that will take time to fully manifest.

Why now? Several forces converged:

Competitive pressure from open-weight peers. DeepSeek-V3, Qwen3.5, and Llama 4 demonstrated that capable open models were closing the gap on proprietary ones. Staying closed no longer guaranteed capability leadership.
Enterprise demand for on-premises. The regulated industries — healthcare, finance, government — needed self-hostable models that they could fine-tune on proprietary data. No amount of API-level privacy guarantees fully addresses HIPAA and GDPR concerns.
Ecosystem positioning. By releasing open-weight models, OpenAI enters the broader developer ecosystem as a participant rather than a gatekeeper. Nine million downloads in weeks signal that this positioning is working.
Safety as a differentiator. The gpt-oss-safeguard release makes the point explicitly: OpenAI believes it can deliver open weights safely, and it wants to model what responsible open-weight releases look like for the industry.

The Apache 2.0 license is the most permissive choice possible. No royalties, no usage restrictions beyond legal compliance, no "not for commercial use" carve-outs. Combined with the minimal usage policy ("comply with applicable law"), this gives developers and enterprises maximum flexibility.

What this means practically: the GPT-OSS release is likely not a one-time event. OpenAI has signaled a new willingness to engage the open-weight ecosystem. Future releases — smaller distillations, specialized fine-tunes, updated versions — are a reasonable expectation.

✅ Strengths

Apache 2.0 — full commercial freedom
Near o4-mini reasoning (120b) and o3-mini (20b)
MXFP4: 94% accuracy retention at 4.25 bits/param
Best-in-class HealthBench — beats o1 and GPT-4o
Native tool use: search, Python, function calling
Adjustable CoT reasoning effort
Wide ecosystem support: Ollama, vLLM, llama.cpp, Transformers, ROCm
gpt-oss-20b fits on consumer hardware (16GB)
gpt-oss-120b fits on single H100 80GB

⚠️ Considerations

gpt-oss-120b requires H100 or equivalent — not truly consumer-accessible
4K native context (128K via YaRN — may degrade at extreme lengths)
Limited public benchmarks outside OpenAI's own reporting
No vision/multimodal capability (text only)
Fine-tuning at 120B scale requires significant compute