Introduction
You've built an AI agent that speaks. It generates audio using a TTS engine — maybe OpenAI's tts-1-hd, maybe a self-hosted Kokoro instance on a GPU rig. The audio sounds great. Then you try to send it as a voice message on Telegram, and it arrives as a generic file attachment instead of an inline voice note. What happened?
The answer is format compatibility. Telegram requires Ogg container with Opus codec at 48kHz. Your TTS engine probably output MP3, or worse — Ogg Vorbis masquerading as "Opus." This mismatch between what TTS engines produce and what platforms consume is one of the most common and least documented pain points in voice AI.
This guide covers everything: what each format actually is, what every major platform requires, what every major TTS engine outputs, and the exact FFmpeg commands to bridge the gap.
opus format. The file extension says .ogg, but the codec inside is Vorbis, not Opus. Telegram, WhatsApp, and Signal will reject it as a voice message. The fix is one FFmpeg command — we'll get there.
Codec vs Container: The Fundamental Distinction
Before diving into formats, you need to understand the most important concept in audio engineering: the difference between a codec and a container.
Codec (Coder-Decoder)
The codec is the algorithm that compresses and decompresses audio data. It determines the audio quality, compression ratio, and computational requirements. Examples: Opus, Vorbis, MP3 (MPEG-1 Layer III), AAC, FLAC, PCM.
Container (File Format)
The container is the file wrapper that holds the compressed audio data plus metadata (duration, sample rate, chapters, album art). Examples: .ogg (can hold Opus OR Vorbis), .mp4/.m4a (holds AAC), .webm (holds Opus or Vorbis), .wav (holds PCM or other codecs), .mkv (holds almost anything).
Why This Matters
Two files can both be .ogg but contain completely different codecs. An .ogg file with Vorbis inside is not the same as an .ogg file with Opus inside. Telegram requires the latter. If you don't check what's actually inside the container, you'll waste hours debugging why your "Ogg" file doesn't work as a voice message.
To inspect what codec is inside a file:
ffprobe -v quiet -show_streams -select_streams a:0 myfile.ogg | grep codec_name
# codec_name=opus ← Telegram will accept this
# codec_name=vorbis ← Telegram will reject this as voice
The Major TTS Audio Formats
Opus
The modern king of audio codecs. Developed by Xiph.Org and IETF (RFC 6716), Opus is royalty-free, open-source, and excels at everything from speech (6 kbps) to music (510 kbps). It's the default voice codec for WebRTC, Discord, WhatsApp, Telegram, and Signal. At 48 kbps it sounds as good as MP3 at 128 kbps.[1]
- Typical bitrates: 16–128 kbps for speech, 64–256 kbps for music
- Sample rates: 8, 12, 16, 24, 48 kHz (internally always 48 kHz)
- Containers:
.ogg,.opus,.webm,.mkv,.mp4(rare) - Latency: 2.5–60 ms (best-in-class for real-time)
Ogg Vorbis
The predecessor to Opus in the Ogg family. Vorbis is a mature, royalty-free codec that's been around since 2000. It's good for music but less efficient than Opus for speech. Amazon Polly still outputs Ogg Vorbis as its primary compressed format.[2]
- Typical bitrates: 64–320 kbps
- Sample rates: 8–192 kHz
- Containers:
.ogg,.oga - Key limitation: No low-bitrate speech mode (Opus wins below 64 kbps)
MP3 (MPEG-1 Audio Layer III)
The most universally supported audio format. Every device, browser, and platform on Earth can play MP3. Patents expired in 2017, making it effectively free. It's the safest default if you don't know what your target platform supports.[3]
- Typical bitrates: 128–320 kbps
- Sample rates: 8–48 kHz
- Container:
.mp3(self-contained) - Trade-off: Larger files than Opus at equivalent quality
AAC (Advanced Audio Coding)
Apple's preferred codec, also widely used in YouTube, Spotify, and Android. Better quality than MP3 at the same bitrate. The default for iPhone voice memos (inside .m4a container) and the standard audio codec in MP4 video files.[4]
- Typical bitrates: 96–256 kbps
- Sample rates: 8–96 kHz
- Containers:
.m4a,.mp4,.aac(raw),.caf(Apple) - Key advantage: Native hardware decoding on all Apple devices
WAV / PCM (Uncompressed)
Raw, uncompressed audio. Perfect quality because there's no compression at all — but files are huge (about 10 MB per minute at 44.1 kHz stereo). Used as intermediate format in audio pipelines and for archival.[5]
- Bitrate: ~1,411 kbps (CD quality, 16-bit 44.1kHz stereo)
- Container:
.wav(RIFF header + PCM data) - Use case: TTS engine raw output, audio processing intermediate
FLAC (Free Lossless Audio Codec)
Lossless compression — bit-perfect reconstruction of the original audio at 50–70% of WAV file size. Useful when you need to preserve every detail of TTS output for post-processing, but don't want WAV-sized files.
- Typical bitrates: 600–1,000 kbps
- Container:
.flac, or inside.ogg,.mkv - Trade-off: Still 3–5× larger than Opus at perceptually similar quality
WebM
Google's container format for web media. Typically holds Opus or Vorbis audio (and VP8/VP9/AV1 video). Native to Chrome and Firefox, used by YouTube internally.
- Audio codecs: Opus, Vorbis
- Container:
.webm(Matroska-based) - Use case: Web audio/video delivery, browser recording
Format Comparison at a Glance
| Format | Type | Typical Speech Bitrate | Quality at 48kbps | Latency | Patent-Free |
|---|---|---|---|---|---|
| Opus | Lossy | 24–64 kbps | Excellent | 2.5 ms | ✅ |
| Vorbis | Lossy | 64–128 kbps | Poor | ~100 ms | ✅ |
| MP3 | Lossy | 128–192 kbps | Poor | ~100 ms | ✅ (since 2017) |
| AAC | Lossy | 64–128 kbps | Good | ~20 ms | ❌ (licensed) |
| WAV/PCM | Uncompressed | ~1,411 kbps | Perfect | 0 ms | ✅ |
| FLAC | Lossless | ~700 kbps | Perfect | ~50 ms | ✅ |
Platform Requirements for Voice Messages
Each messaging platform has specific format requirements for audio to display as an inline voice message (rather than a generic file attachment). Get this wrong and your perfectly good audio arrives as a downloadable file instead of a playable voice note.
| Platform | Required Format | Codec | Sample Rate | Notes |
|---|---|---|---|---|
| Telegram | .ogg | Opus (libopus) | 48 kHz | sendVoice API requires OGG+Opus specifically[6] |
.ogg | Opus | 48 kHz | Business API accepts audio/ogg; codecs=opus | |
| Discord | .ogg | Opus | 48 kHz | Voice channels use Opus; file uploads support MP3/AAC too |
| Signal | .ogg | Opus | 48 kHz | Same as Telegram/WhatsApp |
| iMessage | .caf / .m4a | AAC / AMR | Various | Apple ecosystem; native CoreAudio formats |
| Slack | .mp3, .m4a, .ogg, .wav | Various | Various | Most flexible — accepts almost anything[7] |
| Zoom/Meet | Internal | AAC / Opus | 48 kHz | Real-time codec handled internally |
OS and Runtime Support Matrix
Not every operating system natively supports every codec. Here's what you can expect:
| OS / Runtime | Native Codecs | Opus Support | Notes |
|---|---|---|---|
| Linux | Everything via FFmpeg ecosystem | ✅ libopus, libvorbis | Best codec support via package managers. apt install libopus-dev |
| macOS / iOS | AAC, ALAC, MP3, CAF, WAV | ⚠️ Via libopus only | CoreAudio provides native AAC encode/decode. Opus requires libopus (Homebrew: brew install opus) |
| Windows | MP3, AAC, WMA, WAV, FLAC | ⚠️ Via codec pack | Media Foundation handles MP3/AAC natively. Opus needs third-party codec or FFmpeg |
| Android | AAC, MP3, Vorbis, WAV, FLAC | ✅ API 29+ (Android 10) | Opus decode since Android 5.0, encode since Android 10. MediaCodec API[8] |
| Web Browsers | MP3, AAC, Opus, Vorbis, WAV | ✅ All modern browsers | Chrome, Firefox, Safari 15+, Edge all support Opus playback |
What TTS Engines Actually Output
Here's the critical reference table — what each major TTS engine produces and in what formats:
| TTS Engine | Output Formats | Default | Telegram-Ready? |
|---|---|---|---|
| OpenAI tts-1 / tts-1-hd | MP3, Opus, AAC, FLAC, WAV, PCM | MP3 | ✅ Request opus format — outputs real Ogg+Opus[9] |
| ElevenLabs | MP3 (various bitrates), PCM (8–48kHz), Opus (48kHz, 32–192kbps), μ-law, A-law | MP3 44.1kHz 128kbps | ✅ Request opus_48000_64 or similar[10] |
| Google Cloud TTS | MP3, OGG_OPUS, LINEAR16 (WAV), MULAW, ALAW | MP3 | ✅ Request OGG_OPUS encoding[11] |
| Amazon Polly | MP3, OGG Vorbis, PCM | MP3 | ❌ No Opus output — must re-encode with FFmpeg[2] |
| Azure TTS | MP3 (various), WAV (various), OGG Opus, WebM Opus, RAW | WAV 16kHz 16-bit | ✅ Request ogg-48khz-16bit-mono-opus[12] |
| Kokoro (local) | Ogg Vorbis (regardless of format parameter) | Ogg Vorbis | ❌ Outputs Vorbis even when "opus" is requested — MUST re-encode[13] |
The Kokoro Problem (And How to Fix It)
This is the most common issue for anyone running local TTS. Kokoro — and many OpenAI-compatible TTS servers built on top of it — expose an API that accepts a response_format parameter. You set it to opus, expecting Ogg+Opus output. What you actually get is Ogg+Vorbis.
Why? Kokoro's audio pipeline uses Python's built-in audio libraries which default to Vorbis encoding. The format parameter is often ignored or mapped incorrectly. The output file has an .ogg extension, which looks correct, but the codec inside is Vorbis — and that's what matters.
How to Detect the Problem
# Check what codec is inside your .ogg file
ffprobe -v quiet -show_entries stream=codec_name -of csv=p=0 audio.ogg
# If it says "vorbis" — you have the wrong codec for Telegram/WhatsApp
# More detailed check
ffprobe -v quiet -show_streams audio.ogg 2>&1 | grep -E "codec_name|sample_rate|bit_rate"
The Fix: One FFmpeg Command
# Convert Ogg Vorbis → Ogg Opus (Telegram-compatible)
ffmpeg -i input.ogg -c:a libopus -b:a 48k -ar 48000 output.ogg
# Breakdown:
# -c:a libopus → use the Opus codec (not Vorbis)
# -b:a 48k → 48 kbps bitrate (excellent for speech)
# -ar 48000 → 48 kHz sample rate (required by most platforms)
ffprobe -v quiet -show_entries stream=codec_name -of csv=p=0 output.ogg should print opus. If it still says vorbis, your FFmpeg wasn't compiled with libopus support. Install it: brew install ffmpeg (macOS) or apt install ffmpeg libopus-dev (Linux).
Automating It in Your Pipeline
#!/bin/bash
# tts-to-telegram.sh — convert any TTS output to Telegram voice format
INPUT="$1"
OUTPUT="${2:-telegram-voice.ogg}"
# Detect current codec
CODEC=$(ffprobe -v quiet -show_entries stream=codec_name -of csv=p=0 "$INPUT")
if [ "$CODEC" = "opus" ]; then
echo "Already Opus — copying"
cp "$INPUT" "$OUTPUT"
else
echo "Converting $CODEC → Opus"
ffmpeg -y -i "$INPUT" -c:a libopus -b:a 48k -ar 48000 "$OUTPUT"
fi
# Verify
FINAL=$(ffprobe -v quiet -show_entries stream=codec_name -of csv=p=0 "$OUTPUT")
echo "Output codec: $FINAL"
FFmpeg Conversion Cheat Sheet
Copy-paste commands for the most common TTS audio conversions:
To Ogg Opus (Telegram, WhatsApp, Discord, Signal)
# From MP3
ffmpeg -i input.mp3 -c:a libopus -b:a 48k -ar 48000 output.ogg
# From AAC / M4A
ffmpeg -i input.aac -c:a libopus -b:a 48k -ar 48000 output.ogg
# From WAV / PCM
ffmpeg -i input.wav -c:a libopus -b:a 48k -ar 48000 output.ogg
# From Ogg Vorbis (the Kokoro fix)
ffmpeg -i input.ogg -c:a libopus -b:a 48k -ar 48000 output.ogg
# From FLAC
ffmpeg -i input.flac -c:a libopus -b:a 48k -ar 48000 output.ogg
To AAC (iMessage, Apple ecosystem)
# From MP3
ffmpeg -i input.mp3 -c:a aac -b:a 128k output.m4a
# From Ogg (any codec)
ffmpeg -i input.ogg -c:a aac -b:a 128k output.m4a
# From WAV
ffmpeg -i input.wav -c:a aac -b:a 128k output.m4a
To MP3 (Universal compatibility)
# From any format
ffmpeg -i input.ogg -c:a libmp3lame -b:a 128k output.mp3
# Higher quality for podcast/RSS
ffmpeg -i input.wav -c:a libmp3lame -b:a 192k -ar 44100 output.mp3
To WAV (Processing intermediate)
# From any compressed format
ffmpeg -i input.ogg -c:a pcm_s16le -ar 44100 output.wav
Batch Convert All Files in a Directory
# Convert all .ogg files to Telegram-compatible Opus
for f in *.ogg; do
ffmpeg -y -i "$f" -c:a libopus -b:a 48k -ar 48000 "opus_${f}"
done
Recommended Format by Use Case
| Use Case | Best Format | Why |
|---|---|---|
| Chat app voice messages (Telegram, WhatsApp, Discord, Signal) |
Ogg Opus, 48kHz, 48kbps | Required by all major chat platforms. Small files, excellent speech quality. |
| Podcast / RSS feed | MP3, 44.1kHz, 128–192kbps | Universal podcast player support. Every RSS reader and podcast app handles MP3. |
| Web audio player (blog, documentation) |
AAC in M4A, 128kbps or MP3 128kbps |
AAC for quality/size; MP3 for maximum browser compatibility. |
| Phone call / VoIP | Opus, 16kHz, 16–32kbps or μ-law/A-law 8kHz |
Low latency critical. Opus dominates WebRTC; μ-law for traditional telephony (PSTN). |
| Apple ecosystem (iMessage, Siri, HomePod) |
AAC in CAF or M4A | Native CoreAudio hardware decode. Zero-effort playback on all Apple devices. |
| Archival / post-processing | WAV (PCM 16-bit) or FLAC | Lossless. Preserve full TTS output quality for future re-encoding. |
| "I don't know the target" | MP3, 44.1kHz, 128kbps | Works literally everywhere. The safe default. |
References
- Opus Codec — Official Site — Xiph.Org / IETF RFC 6716
- Amazon Polly SynthesizeSpeech API — AWS Documentation
- MP3 Patent Expiration — Wikipedia, 2017
- AVFAudio — Apple Developer Documentation — Apple CoreAudio AAC
- FFmpeg Codec Documentation — PCM, Opus, AAC encoders
- Telegram Bot API — sendVoice — "audio must be in an .ogg file encoded with OPUS"
- Slack API — Files and Uploads — Supported audio formats
- Android Supported Media Formats — Google Developer Docs
- OpenAI Text-to-Speech Guide — Output formats: mp3, opus, aac, flac, wav, pcm
- ElevenLabs — What audio formats do you support? — MP3, PCM, Opus, μ-law
- Google Cloud TTS AudioEncoding — LINEAR16, MP3, OGG_OPUS, MULAW, ALAW
- Azure Speech Service REST API — Output format options
- How to convert OGG file to Telegram voice format — Stack Overflow
- FFmpeg — Convert any media file to Telegram Opus audio note — Super User