Running TTS Locally — How We Ditched OpenAI's Voice API for a GPU Rig

We replaced OpenAI's paid text-to-speech API with open-source Kokoro TTS running on a 4× RTX 3090 GPU rig. Every article narration, every podcast, every video voiceover — now costs exactly $0.

February 28, 2026

14 min read

Builder's Guide

📺 Watch the Full Video Guide

See the GPU rig in action and follow along as we set up Kokoro TTS from scratch.

🎬 Watch Video Guide

🎧 Listen to this article

We were spending real money — every single day — on OpenAI's text-to-speech API. News articles, research post narrations, podcast episodes, YouTube video voiceovers. Every character cost money. Then we built a GPU rig, installed an open-source TTS model, and changed one URL. Now it's all free. Here's exactly how we did it.

This isn't a theoretical guide. This is a build log from our actual infrastructure at ThinkSmart.Life. We run an AI-powered content pipeline that generates research posts, narrates them with human-quality audio, creates videos, and publishes everything automatically. Every piece of that pipeline used to call OpenAI's /v1/audio/speech endpoint. Today, it calls the same endpoint — but on our own machine, at http://10.0.0.79:8880.

The model is called Kokoro. It's 82 million parameters, open-source, Apache 2.0 licensed, and it sounds remarkably good. Let's break it all down.

Cost Per Audio Generation

82M

Kokoro Model Parameters

96GB

Total VRAM (4× RTX 3090)

1 URL

Config Change to Switch

The Cloud TTS Problem

OpenAI's TTS API is excellent. The voices sound natural, the latency is low, and integration is dead simple. But it has one unavoidable problem: it charges per character.

Here's the current pricing as of early 2026:

Model	Price	Quality
`tts-1` (Standard)	$15 / 1M characters	Good
`tts-1-hd` (HD)	$30 / 1M characters	Excellent
`gpt-4o-mini-tts`	$0.60/1M input + $12/1M audio tokens	Excellent + controllable

For a single blog post narration (~5,000 characters), you're looking at roughly $0.075 on tts-1 or $0.15 on tts-1-hd. That sounds cheap — until you're generating audio for 10+ articles per week, plus podcasts, plus video narrations, plus agent voice responses.

⚠️ The Math Adds Up Fast

At 10 articles/week × 5,000 characters × tts-1-hd pricing: that's $78/month just for blog narrations. Add podcast episodes (20,000+ chars each), video scripts, and agent TTS — you're easily looking at $200-400/month for a serious content operation.

And that's before you factor in the other cloud dependency risks: rate limits during peak usage, API changes that break your pipeline, latency spikes that slow down your content workflow, and zero control over voice model updates that might change how your content sounds.

What Is Kokoro TTS?

Kokoro is a frontier text-to-speech model with only 82 million parameters. That "only" is doing serious work — despite being orders of magnitude smaller than proprietary models, Kokoro produces speech quality that rivals commercial offerings from OpenAI, Google, and ElevenLabs.

Key Specs

Size: 82M parameters — small enough to run on a single GPU or even a CPU
Output: 24kHz high-fidelity audio
Format: Available as ONNX (cross-platform, GPU/CPU) and PyTorch
License: Apache 2.0 — fully open, commercial use allowed
Languages: English, Japanese, Chinese, with more coming
Voices: 50+ built-in voices with the ability to blend them
Architecture: Single-pass generation (no separate vocoder needed)

The model was created by hexgrad and has been embraced by the open-source community. The ONNX version, maintained by thewh1teagle, makes it trivially easy to run anywhere — Python, JavaScript, Rust, even mobile devices via Expo.

"Kokoro is a frontier TTS model for its size. 82 million parameters, text in, audio out." — Kokoro model card, Hugging Face

Why Kokoro Over Other Open-Source TTS?

The open-source TTS landscape has exploded in 2025-2026. Here's how Kokoro compares:

Model	Parameters	Voice Cloning	Speed	Quality
Kokoro-82M	82M	No (50+ built-in + blending)	Very fast	Excellent
XTTS v2 (Coqui)	~467M	Yes (zero-shot)	Moderate	Very Good
F5-TTS	~335M	Yes (zero-shot)	Fast	Excellent
Piper	Varies	No	Very fast	Good

Kokoro's sweet spot: it's the fastest high-quality option. If you need voice cloning, look at F5-TTS or XTTS v2. If you need speed + quality with great built-in voices, Kokoro wins.

Our GPU Rig: The Hardware

We built a dedicated AI inference machine. It's not a gaming PC — it's a workhorse designed for running multiple AI models simultaneously.

The Specs

GPUs: 4× NVIDIA RTX 3090 (24GB VRAM each = 96GB total)
OS: Ubuntu Server
Primary use: Ollama (LLM inference) + Kokoro TTS + future models
Network: Exposed on local network at 10.0.0.79

💡 You Don't Need 4× 3090s for TTS

Kokoro-82M is tiny. It runs comfortably on a single GPU — even an RTX 3060 — or on CPU. Our 4× 3090 setup is because we also run Ollama for LLM inference (Llama, DeepSeek, etc.). The TTS is almost a freebie on top of our existing rig. Even a $200 used RTX 3060 would be more than enough for Kokoro alone.

The economics are simple: a used RTX 3090 costs ~$700. That one GPU can serve TTS, LLM inference, image generation, and more. At $200-400/month in cloud API costs, the hardware pays for itself in 2-4 months.

Setup Guide: From Zero to Local TTS

Here's exactly how we set up Kokoro TTS with an OpenAI-compatible API. This uses Kokoro-FastAPI, a Dockerized wrapper that exposes Kokoro as a drop-in replacement for OpenAI's TTS API.

Step 1: Prerequisites

# Ubuntu with NVIDIA drivers installed
nvidia-smi  # Verify GPU is detected

# Install Docker + NVIDIA Container Toolkit
sudo apt-get update
sudo apt-get install -y docker.io nvidia-container-toolkit
sudo systemctl restart docker

Step 2: Pull and Run Kokoro-FastAPI

# Clone the repo
git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI

# Run with Docker Compose (GPU mode)
docker compose -f docker-compose.gpu.yml up -d

That's it. The server starts on port 8880 by default.

Step 3: Test It

# Generate speech (same format as OpenAI!)
curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "input": "Hello! This is Kokoro running locally on my GPU rig.",
    "voice": "af_bella"
  }' \
  --output test.mp3

# Play it
mpv test.mp3  # or any audio player

Step 4: Expose on Your Network

# By default, Kokoro-FastAPI binds to 0.0.0.0:8880
# It's already accessible from other machines on your network at:
# http://YOUR_GPU_RIG_IP:8880

# In our case:
# http://10.0.0.79:8880/v1/audio/speech

✅ Available Voices

Kokoro comes with 50+ voices. Some highlights: af_bella, af_sarah, am_adam, am_michael, bf_emma, bm_george. You can even blend voices: af_bella+am_adam creates a mix. List all available voices at GET /v1/audio/voices.

The OpenAI-Compatible Trick

This is the most powerful part of the setup. Kokoro-FastAPI exposes the exact same API contract as OpenAI's TTS:

POST /v1/audio/speech
{
  "model": "kokoro",
  "input": "Text to speak",
  "voice": "af_bella",
  "response_format": "mp3",
  "speed": 1.0
}

This means any tool, script, or application that uses OpenAI's TTS API can switch to your local server by changing a single URL. No code changes. No library swaps. Just point to a different base URL.

Python Example (OpenAI SDK)

from openai import OpenAI

# Before: using OpenAI's cloud
# client = OpenAI(api_key="sk-...")

# After: using our local Kokoro server
client = OpenAI(
    base_url="http://10.0.0.79:8880/v1",
    api_key="not-needed"  # no auth required locally
)

response = client.audio.speech.create(
    model="kokoro",
    voice="af_bella",
    input="This is being generated locally, for free."
)

response.stream_to_file("output.mp3")

curl Example

# Just change the URL — same payload works
curl -X POST http://10.0.0.79:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro","input":"Free speech synthesis.","voice":"af_bella"}' \
  -o output.mp3

Environment Variable Swap

# In your .env or config:
# OLD:
# TTS_BASE_URL=https://api.openai.com/v1
# TTS_API_KEY=sk-your-key-here

# NEW:
TTS_BASE_URL=http://10.0.0.79:8880/v1
TTS_API_KEY=not-needed

💡 This Pattern is Everywhere

The "OpenAI-compatible API" pattern is becoming the standard interface for self-hosted AI. Ollama uses it for LLMs (/v1/chat/completions). Kokoro-FastAPI uses it for TTS (/v1/audio/speech). This means you can self-host your entire AI stack behind the same API contract your tools already expect.

Real-World Integration: Our News Pipeline

At ThinkSmart.Life, we run an automated content pipeline. Our AI agents research topics, write blog posts, generate audio narrations, create video slides, and publish — all hands-free. Here's how we integrated local TTS:

The Config Change

# In our tts-generator.py config:
LOCAL_TTS_URL = "http://10.0.0.79:8880"

# The generator already used OpenAI's API format,
# so the only change was this URL.
# Before: https://api.openai.com
# After:  http://10.0.0.79:8880

What's Now Running Locally

Research post narrations — Every article on /research/ gets an audio version
Video voiceovers — YouTube videos use locally-generated narration
News digest audio — Daily news summaries converted to listenable audio
Agent voice responses — AI agents can speak through local TTS

All of this was previously hitting OpenAI's API. Now it hits our local rig. The audio quality difference? Minimal. The cost difference? 100%.

Cost Analysis: The Real Numbers

Let's do honest math on what this saves.

Our Monthly TTS Usage (Estimated)

Content Type	Volume	Characters/Month	OpenAI tts-1-hd Cost
Research post narrations	~15 posts	~75,000	$2.25
Video voiceovers	~15 videos	~120,000	$3.60
News digests	~30 digests	~300,000	$9.00
Agent TTS (misc)	Variable	~100,000	$3.00
Total		~595,000	$17.85/month

At our current scale, the direct API savings are about $18/month. That's modest. But here's the thing: we're scaling up. As we add more content types, more languages, and more agents that speak, that number grows linearly with OpenAI — but stays at $0 with local TTS.

🔑 The Real Value: Unlimited Usage

The biggest win isn't cost savings on current usage — it's removing the cost ceiling entirely. Want to generate 100 audio versions of a post in different voices for A/B testing? Free. Want every agent to have a voice? Free. Want to run TTS on every incoming email? Free. When something is free, you use it in ways you'd never consider at $15-30 per million characters.

Hardware ROI

Our GPU rig cost roughly $4,000 to build (4× used RTX 3090s at ~$700 each + motherboard, PSU, etc.). But TTS is just one of many services running on it. The rig also runs Ollama for LLM inference (saving us from GPT-4 API calls), and will soon run image generation. The TTS is effectively a free bonus.

If you're building the rig just for TTS, a single used RTX 3060 (~$200) would be sufficient. ROI at high usage: 1-2 months.

Quality Comparison: Kokoro vs. OpenAI

Let's be honest about the tradeoffs:

Aspect	OpenAI tts-1-hd	Kokoro-82M
Naturalness	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Voice variety	6 voices	50+ voices + blending
Long-form stability	Excellent	Very good (occasional artifacts on very long texts)
Latency (first byte)	~500ms (network dependent)	~100-300ms (local)
Cost	$15-30/1M chars	$0 (after hardware)
Privacy	Data sent to OpenAI	100% local
Uptime	Depends on OpenAI	Depends on your hardware

OpenAI's tts-1-hd is still the gold standard for naturalness. But Kokoro is 85-90% of the way there — and improving with every release. For content pipelines, podcasts, and video narration, the quality difference is negligible to most listeners. For applications where you need the absolute best voice quality (like a customer-facing voice assistant), you might still want OpenAI or ElevenLabs.

What's Next: The Self-Hosted AI Stack

Local TTS is just one piece of the puzzle. Here's our roadmap for fully self-hosted AI:

Already Running

✅ LLM Inference: Ollama on the GPU rig (Llama 3, DeepSeek, Qwen)
✅ Text-to-Speech: Kokoro TTS via Kokoro-FastAPI

Coming Soon

🔜 Voice Cloning: F5-TTS for zero-shot voice cloning — create custom brand voices from a 10-second audio sample
🔜 XTTS v2: Coqui's voice cloning for multilingual TTS with custom voices
🔜 More Kokoro Voices: Community-trained voice packs and voice blending experiments
🔜 All Agents on Local TTS: Every AI agent in our system speaks through the local rig
🔜 Streaming TTS: Real-time speech generation for conversational agents

The Vision

We're building toward a fully self-hosted AI stack where the only cloud dependencies are the frontier models (Claude, GPT-4) that we can't run locally — yet. Everything else — inference, TTS, image generation, embeddings — runs on our own metal. The goal is sovereignty over our AI infrastructure.

💡 The Moment to Go Self-Hosted Is Now

Open-source AI models have crossed the quality threshold. Kokoro proves that a tiny 82M parameter model can rival billion-dollar commercial APIs. The hardware is affordable. The tooling is mature. Docker makes deployment trivial. If you're running any kind of content pipeline, automation system, or AI-powered product — the ROI on self-hosting has never been better.

"The best time to start self-hosting AI was last year. The second best time is today." — Every builder who's made the switch

References

Kokoro-82M Model Card — Hugging Face, hexgrad
kokoro-onnx: TTS with Kokoro and ONNX Runtime — GitHub, thewh1teagle
Kokoro-FastAPI: Dockerized OpenAI-compatible TTS Server — GitHub, remsky
Kokoro-82M-v1.0-ONNX — Hugging Face, ONNX Community
OpenAI API Pricing — TTS pricing: $15/1M chars (standard), $30/1M chars (HD)
OpenAI TTS API Pricing Calculator — CostGoat, February 2026
Meet Kokoro: The Lightweight TTS Model Delivering High-Quality Speech Synthesis — AI Mind, Okan Yenigün, February 2025
Introducing kokoro-onnx TTS — r/LocalLLaMA, January 2025
F5-TTS: Fairytaler that Fakes Fluent and Faithful Speech — GitHub, SWivid
Kokoro FastAPI — Self Hosted Text to Speech Platform Installation Guide — Noted, February 2025
Kokoro TTS Documentation — Voxta
Show HN: kokoro-onnx TTS — Hacker News Discussion