Running TTS Locally — How We Ditched OpenAI's Voice API for a GPU Rig
We replaced OpenAI's paid text-to-speech API with open-source Kokoro TTS running on a 4× RTX 3090 GPU rig. Every article narration, every podcast, every video voiceover — now costs exactly $0.
February 28, 202614 min readBuilder's Guide
📺 Watch the Full Video Guide
See the GPU rig in action and follow along as we set up Kokoro TTS from scratch.
We were spending real money — every single day — on OpenAI's text-to-speech API. News articles, research post narrations, podcast episodes, YouTube video voiceovers. Every character cost money. Then we built a GPU rig, installed an open-source TTS model, and changed one URL. Now it's all free. Here's exactly how we did it.
This isn't a theoretical guide. This is a build log from our actual infrastructure at ThinkSmart.Life. We run an AI-powered content pipeline that generates research posts, narrates them with human-quality audio, creates videos, and publishes everything automatically. Every piece of that pipeline used to call OpenAI's /v1/audio/speech endpoint. Today, it calls the same endpoint — but on our own machine, at http://10.0.0.79:8880.
The model is called Kokoro. It's 82 million parameters, open-source, Apache 2.0 licensed, and it sounds remarkably good. Let's break it all down.
$0
Cost Per Audio Generation
82M
Kokoro Model Parameters
96GB
Total VRAM (4× RTX 3090)
1 URL
Config Change to Switch
The Cloud TTS Problem
OpenAI's TTS API is excellent. The voices sound natural, the latency is low, and integration is dead simple. But it has one unavoidable problem: it charges per character.
Here's the current pricing as of early 2026:
Model
Price
Quality
tts-1 (Standard)
$15 / 1M characters
Good
tts-1-hd (HD)
$30 / 1M characters
Excellent
gpt-4o-mini-tts
$0.60/1M input + $12/1M audio tokens
Excellent + controllable
For a single blog post narration (~5,000 characters), you're looking at roughly $0.075 on tts-1 or $0.15 on tts-1-hd. That sounds cheap — until you're generating audio for 10+ articles per week, plus podcasts, plus video narrations, plus agent voice responses.
⚠️ The Math Adds Up Fast
At 10 articles/week × 5,000 characters × tts-1-hd pricing: that's $78/month just for blog narrations. Add podcast episodes (20,000+ chars each), video scripts, and agent TTS — you're easily looking at $200-400/month for a serious content operation.
And that's before you factor in the other cloud dependency risks: rate limits during peak usage, API changes that break your pipeline, latency spikes that slow down your content workflow, and zero control over voice model updates that might change how your content sounds.
What Is Kokoro TTS?
Kokoro is a frontier text-to-speech model with only 82 million parameters. That "only" is doing serious work — despite being orders of magnitude smaller than proprietary models, Kokoro produces speech quality that rivals commercial offerings from OpenAI, Google, and ElevenLabs.
Key Specs
Size: 82M parameters — small enough to run on a single GPU or even a CPU
Output: 24kHz high-fidelity audio
Format: Available as ONNX (cross-platform, GPU/CPU) and PyTorch
License: Apache 2.0 — fully open, commercial use allowed
Languages: English, Japanese, Chinese, with more coming
Voices: 50+ built-in voices with the ability to blend them
Architecture: Single-pass generation (no separate vocoder needed)
The model was created by hexgrad and has been embraced by the open-source community. The ONNX version, maintained by thewh1teagle, makes it trivially easy to run anywhere — Python, JavaScript, Rust, even mobile devices via Expo.
"Kokoro is a frontier TTS model for its size. 82 million parameters, text in, audio out." — Kokoro model card, Hugging Face
Why Kokoro Over Other Open-Source TTS?
The open-source TTS landscape has exploded in 2025-2026. Here's how Kokoro compares:
Model
Parameters
Voice Cloning
Speed
Quality
Kokoro-82M
82M
No (50+ built-in + blending)
Very fast
Excellent
XTTS v2 (Coqui)
~467M
Yes (zero-shot)
Moderate
Very Good
F5-TTS
~335M
Yes (zero-shot)
Fast
Excellent
Piper
Varies
No
Very fast
Good
Kokoro's sweet spot: it's the fastest high-quality option. If you need voice cloning, look at F5-TTS or XTTS v2. If you need speed + quality with great built-in voices, Kokoro wins.
Our GPU Rig: The Hardware
We built a dedicated AI inference machine. It's not a gaming PC — it's a workhorse designed for running multiple AI models simultaneously.
Kokoro-82M is tiny. It runs comfortably on a single GPU — even an RTX 3060 — or on CPU. Our 4× 3090 setup is because we also run Ollama for LLM inference (Llama, DeepSeek, etc.). The TTS is almost a freebie on top of our existing rig. Even a $200 used RTX 3060 would be more than enough for Kokoro alone.
The economics are simple: a used RTX 3090 costs ~$700. That one GPU can serve TTS, LLM inference, image generation, and more. At $200-400/month in cloud API costs, the hardware pays for itself in 2-4 months.
Setup Guide: From Zero to Local TTS
Here's exactly how we set up Kokoro TTS with an OpenAI-compatible API. This uses Kokoro-FastAPI, a Dockerized wrapper that exposes Kokoro as a drop-in replacement for OpenAI's TTS API.
# Clone the repo
git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI
# Run with Docker Compose (GPU mode)
docker compose -f docker-compose.gpu.yml up -d
That's it. The server starts on port 8880 by default.
Step 3: Test It
# Generate speech (same format as OpenAI!)
curl -X POST http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro",
"input": "Hello! This is Kokoro running locally on my GPU rig.",
"voice": "af_bella"
}' \
--output test.mp3
# Play it
mpv test.mp3 # or any audio player
Step 4: Expose on Your Network
# By default, Kokoro-FastAPI binds to 0.0.0.0:8880
# It's already accessible from other machines on your network at:
# http://YOUR_GPU_RIG_IP:8880
# In our case:
# http://10.0.0.79:8880/v1/audio/speech
✅ Available Voices
Kokoro comes with 50+ voices. Some highlights: af_bella, af_sarah, am_adam, am_michael, bf_emma, bm_george. You can even blend voices: af_bella+am_adam creates a mix. List all available voices at GET /v1/audio/voices.
The OpenAI-Compatible Trick
This is the most powerful part of the setup. Kokoro-FastAPI exposes the exact same API contract as OpenAI's TTS:
POST /v1/audio/speech
{
"model": "kokoro",
"input": "Text to speak",
"voice": "af_bella",
"response_format": "mp3",
"speed": 1.0
}
This means any tool, script, or application that uses OpenAI's TTS API can switch to your local server by changing a single URL. No code changes. No library swaps. Just point to a different base URL.
Python Example (OpenAI SDK)
from openai import OpenAI
# Before: using OpenAI's cloud
# client = OpenAI(api_key="sk-...")
# After: using our local Kokoro server
client = OpenAI(
base_url="http://10.0.0.79:8880/v1",
api_key="not-needed" # no auth required locally
)
response = client.audio.speech.create(
model="kokoro",
voice="af_bella",
input="This is being generated locally, for free."
)
response.stream_to_file("output.mp3")
curl Example
# Just change the URL — same payload works
curl -X POST http://10.0.0.79:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"kokoro","input":"Free speech synthesis.","voice":"af_bella"}' \
-o output.mp3
Environment Variable Swap
# In your .env or config:
# OLD:
# TTS_BASE_URL=https://api.openai.com/v1
# TTS_API_KEY=sk-your-key-here
# NEW:
TTS_BASE_URL=http://10.0.0.79:8880/v1
TTS_API_KEY=not-needed
💡 This Pattern is Everywhere
The "OpenAI-compatible API" pattern is becoming the standard interface for self-hosted AI. Ollama uses it for LLMs (/v1/chat/completions). Kokoro-FastAPI uses it for TTS (/v1/audio/speech). This means you can self-host your entire AI stack behind the same API contract your tools already expect.
Real-World Integration: Our News Pipeline
At ThinkSmart.Life, we run an automated content pipeline. Our AI agents research topics, write blog posts, generate audio narrations, create video slides, and publish — all hands-free. Here's how we integrated local TTS:
The Config Change
# In our tts-generator.py config:
LOCAL_TTS_URL = "http://10.0.0.79:8880"
# The generator already used OpenAI's API format,
# so the only change was this URL.
# Before: https://api.openai.com
# After: http://10.0.0.79:8880
What's Now Running Locally
Research post narrations — Every article on /research/ gets an audio version
Video voiceovers — YouTube videos use locally-generated narration
Agent voice responses — AI agents can speak through local TTS
All of this was previously hitting OpenAI's API. Now it hits our local rig. The audio quality difference? Minimal. The cost difference? 100%.
Cost Analysis: The Real Numbers
Let's do honest math on what this saves.
Our Monthly TTS Usage (Estimated)
Content Type
Volume
Characters/Month
OpenAI tts-1-hd Cost
Research post narrations
~15 posts
~75,000
$2.25
Video voiceovers
~15 videos
~120,000
$3.60
News digests
~30 digests
~300,000
$9.00
Agent TTS (misc)
Variable
~100,000
$3.00
Total
~595,000
$17.85/month
At our current scale, the direct API savings are about $18/month. That's modest. But here's the thing: we're scaling up. As we add more content types, more languages, and more agents that speak, that number grows linearly with OpenAI — but stays at $0 with local TTS.
🔑 The Real Value: Unlimited Usage
The biggest win isn't cost savings on current usage — it's removing the cost ceiling entirely. Want to generate 100 audio versions of a post in different voices for A/B testing? Free. Want every agent to have a voice? Free. Want to run TTS on every incoming email? Free. When something is free, you use it in ways you'd never consider at $15-30 per million characters.
Hardware ROI
Our GPU rig cost roughly $4,000 to build (4× used RTX 3090s at ~$700 each + motherboard, PSU, etc.). But TTS is just one of many services running on it. The rig also runs Ollama for LLM inference (saving us from GPT-4 API calls), and will soon run image generation. The TTS is effectively a free bonus.
If you're building the rig just for TTS, a single used RTX 3060 (~$200) would be sufficient. ROI at high usage: 1-2 months.
Quality Comparison: Kokoro vs. OpenAI
Let's be honest about the tradeoffs:
Aspect
OpenAI tts-1-hd
Kokoro-82M
Naturalness
⭐⭐⭐⭐⭐
⭐⭐⭐⭐
Voice variety
6 voices
50+ voices + blending
Long-form stability
Excellent
Very good (occasional artifacts on very long texts)
Latency (first byte)
~500ms (network dependent)
~100-300ms (local)
Cost
$15-30/1M chars
$0 (after hardware)
Privacy
Data sent to OpenAI
100% local
Uptime
Depends on OpenAI
Depends on your hardware
OpenAI's tts-1-hd is still the gold standard for naturalness. But Kokoro is 85-90% of the way there — and improving with every release. For content pipelines, podcasts, and video narration, the quality difference is negligible to most listeners. For applications where you need the absolute best voice quality (like a customer-facing voice assistant), you might still want OpenAI or ElevenLabs.
What's Next: The Self-Hosted AI Stack
Local TTS is just one piece of the puzzle. Here's our roadmap for fully self-hosted AI:
Already Running
✅ LLM Inference: Ollama on the GPU rig (Llama 3, DeepSeek, Qwen)
✅ Text-to-Speech: Kokoro TTS via Kokoro-FastAPI
Coming Soon
🔜 Voice Cloning: F5-TTS for zero-shot voice cloning — create custom brand voices from a 10-second audio sample
🔜 XTTS v2: Coqui's voice cloning for multilingual TTS with custom voices
🔜 More Kokoro Voices: Community-trained voice packs and voice blending experiments
🔜 All Agents on Local TTS: Every AI agent in our system speaks through the local rig
🔜 Streaming TTS: Real-time speech generation for conversational agents
The Vision
We're building toward a fully self-hosted AI stack where the only cloud dependencies are the frontier models (Claude, GPT-4) that we can't run locally — yet. Everything else — inference, TTS, image generation, embeddings — runs on our own metal. The goal is sovereignty over our AI infrastructure.
💡 The Moment to Go Self-Hosted Is Now
Open-source AI models have crossed the quality threshold. Kokoro proves that a tiny 82M parameter model can rival billion-dollar commercial APIs. The hardware is affordable. The tooling is mature. Docker makes deployment trivial. If you're running any kind of content pipeline, automation system, or AI-powered product — the ROI on self-hosting has never been better.
"The best time to start self-hosting AI was last year. The second best time is today." — Every builder who's made the switch