🎧 Listen ~10 min
📺 Watch the video version: TTS Audio Formats Guide

Introduction

You've built an AI agent that speaks. It generates audio using a TTS engine — maybe OpenAI's tts-1-hd, maybe a self-hosted Kokoro instance on a GPU rig. The audio sounds great. Then you try to send it as a voice message on Telegram, and it arrives as a generic file attachment instead of an inline voice note. What happened?

The answer is format compatibility. Telegram requires Ogg container with Opus codec at 48kHz. Your TTS engine probably output MP3, or worse — Ogg Vorbis masquerading as "Opus." This mismatch between what TTS engines produce and what platforms consume is one of the most common and least documented pain points in voice AI.

This guide covers everything: what each format actually is, what every major platform requires, what every major TTS engine outputs, and the exact FFmpeg commands to bridge the gap.

⚠️ The #1 Gotcha Kokoro and many OpenAI-compatible local TTS servers output Ogg Vorbis even when you request opus format. The file extension says .ogg, but the codec inside is Vorbis, not Opus. Telegram, WhatsApp, and Signal will reject it as a voice message. The fix is one FFmpeg command — we'll get there.

Codec vs Container: The Fundamental Distinction

Before diving into formats, you need to understand the most important concept in audio engineering: the difference between a codec and a container.

Codec (Coder-Decoder)

The codec is the algorithm that compresses and decompresses audio data. It determines the audio quality, compression ratio, and computational requirements. Examples: Opus, Vorbis, MP3 (MPEG-1 Layer III), AAC, FLAC, PCM.

Container (File Format)

The container is the file wrapper that holds the compressed audio data plus metadata (duration, sample rate, chapters, album art). Examples: .ogg (can hold Opus OR Vorbis), .mp4/.m4a (holds AAC), .webm (holds Opus or Vorbis), .wav (holds PCM or other codecs), .mkv (holds almost anything).

Why This Matters

Two files can both be .ogg but contain completely different codecs. An .ogg file with Vorbis inside is not the same as an .ogg file with Opus inside. Telegram requires the latter. If you don't check what's actually inside the container, you'll waste hours debugging why your "Ogg" file doesn't work as a voice message.

To inspect what codec is inside a file:

ffprobe -v quiet -show_streams -select_streams a:0 myfile.ogg | grep codec_name
# codec_name=opus    ← Telegram will accept this
# codec_name=vorbis  ← Telegram will reject this as voice

The Major TTS Audio Formats

Opus

The modern king of audio codecs. Developed by Xiph.Org and IETF (RFC 6716), Opus is royalty-free, open-source, and excels at everything from speech (6 kbps) to music (510 kbps). It's the default voice codec for WebRTC, Discord, WhatsApp, Telegram, and Signal. At 48 kbps it sounds as good as MP3 at 128 kbps.[1]

Ogg Vorbis

The predecessor to Opus in the Ogg family. Vorbis is a mature, royalty-free codec that's been around since 2000. It's good for music but less efficient than Opus for speech. Amazon Polly still outputs Ogg Vorbis as its primary compressed format.[2]

MP3 (MPEG-1 Audio Layer III)

The most universally supported audio format. Every device, browser, and platform on Earth can play MP3. Patents expired in 2017, making it effectively free. It's the safest default if you don't know what your target platform supports.[3]

AAC (Advanced Audio Coding)

Apple's preferred codec, also widely used in YouTube, Spotify, and Android. Better quality than MP3 at the same bitrate. The default for iPhone voice memos (inside .m4a container) and the standard audio codec in MP4 video files.[4]

WAV / PCM (Uncompressed)

Raw, uncompressed audio. Perfect quality because there's no compression at all — but files are huge (about 10 MB per minute at 44.1 kHz stereo). Used as intermediate format in audio pipelines and for archival.[5]

FLAC (Free Lossless Audio Codec)

Lossless compression — bit-perfect reconstruction of the original audio at 50–70% of WAV file size. Useful when you need to preserve every detail of TTS output for post-processing, but don't want WAV-sized files.

WebM

Google's container format for web media. Typically holds Opus or Vorbis audio (and VP8/VP9/AV1 video). Native to Chrome and Firefox, used by YouTube internally.

Format Comparison at a Glance

FormatTypeTypical Speech BitrateQuality at 48kbpsLatencyPatent-Free
OpusLossy24–64 kbpsExcellent2.5 ms
VorbisLossy64–128 kbpsPoor~100 ms
MP3Lossy128–192 kbpsPoor~100 ms✅ (since 2017)
AACLossy64–128 kbpsGood~20 ms❌ (licensed)
WAV/PCMUncompressed~1,411 kbpsPerfect0 ms
FLACLossless~700 kbpsPerfect~50 ms

Platform Requirements for Voice Messages

Each messaging platform has specific format requirements for audio to display as an inline voice message (rather than a generic file attachment). Get this wrong and your perfectly good audio arrives as a downloadable file instead of a playable voice note.

PlatformRequired FormatCodecSample RateNotes
Telegram.oggOpus (libopus)48 kHzsendVoice API requires OGG+Opus specifically[6]
WhatsApp.oggOpus48 kHzBusiness API accepts audio/ogg; codecs=opus
Discord.oggOpus48 kHzVoice channels use Opus; file uploads support MP3/AAC too
Signal.oggOpus48 kHzSame as Telegram/WhatsApp
iMessage.caf / .m4aAAC / AMRVariousApple ecosystem; native CoreAudio formats
Slack.mp3, .m4a, .ogg, .wavVariousVariousMost flexible — accepts almost anything[7]
Zoom/MeetInternalAAC / Opus48 kHzReal-time codec handled internally
🎯 The Pattern The dominant format for voice messages across chat platforms is Ogg container + Opus codec at 48 kHz. If you target Telegram, WhatsApp, Discord, and Signal — one format serves them all. The outlier is Apple's iMessage, which needs AAC.

OS and Runtime Support Matrix

Not every operating system natively supports every codec. Here's what you can expect:

OS / RuntimeNative CodecsOpus SupportNotes
Linux Everything via FFmpeg ecosystem ✅ libopus, libvorbis Best codec support via package managers. apt install libopus-dev
macOS / iOS AAC, ALAC, MP3, CAF, WAV ⚠️ Via libopus only CoreAudio provides native AAC encode/decode. Opus requires libopus (Homebrew: brew install opus)
Windows MP3, AAC, WMA, WAV, FLAC ⚠️ Via codec pack Media Foundation handles MP3/AAC natively. Opus needs third-party codec or FFmpeg
Android AAC, MP3, Vorbis, WAV, FLAC ✅ API 29+ (Android 10) Opus decode since Android 5.0, encode since Android 10. MediaCodec API[8]
Web Browsers MP3, AAC, Opus, Vorbis, WAV ✅ All modern browsers Chrome, Firefox, Safari 15+, Edge all support Opus playback

What TTS Engines Actually Output

Here's the critical reference table — what each major TTS engine produces and in what formats:

TTS EngineOutput FormatsDefaultTelegram-Ready?
OpenAI tts-1 / tts-1-hd MP3, Opus, AAC, FLAC, WAV, PCM MP3 ✅ Request opus format — outputs real Ogg+Opus[9]
ElevenLabs MP3 (various bitrates), PCM (8–48kHz), Opus (48kHz, 32–192kbps), μ-law, A-law MP3 44.1kHz 128kbps ✅ Request opus_48000_64 or similar[10]
Google Cloud TTS MP3, OGG_OPUS, LINEAR16 (WAV), MULAW, ALAW MP3 ✅ Request OGG_OPUS encoding[11]
Amazon Polly MP3, OGG Vorbis, PCM MP3 ❌ No Opus output — must re-encode with FFmpeg[2]
Azure TTS MP3 (various), WAV (various), OGG Opus, WebM Opus, RAW WAV 16kHz 16-bit ✅ Request ogg-48khz-16bit-mono-opus[12]
Kokoro (local) Ogg Vorbis (regardless of format parameter) Ogg Vorbis ❌ Outputs Vorbis even when "opus" is requested — MUST re-encode[13]

The Kokoro Problem (And How to Fix It)

This is the most common issue for anyone running local TTS. Kokoro — and many OpenAI-compatible TTS servers built on top of it — expose an API that accepts a response_format parameter. You set it to opus, expecting Ogg+Opus output. What you actually get is Ogg+Vorbis.

Why? Kokoro's audio pipeline uses Python's built-in audio libraries which default to Vorbis encoding. The format parameter is often ignored or mapped incorrectly. The output file has an .ogg extension, which looks correct, but the codec inside is Vorbis — and that's what matters.

How to Detect the Problem

# Check what codec is inside your .ogg file
ffprobe -v quiet -show_entries stream=codec_name -of csv=p=0 audio.ogg
# If it says "vorbis" — you have the wrong codec for Telegram/WhatsApp

# More detailed check
ffprobe -v quiet -show_streams audio.ogg 2>&1 | grep -E "codec_name|sample_rate|bit_rate"

The Fix: One FFmpeg Command

# Convert Ogg Vorbis → Ogg Opus (Telegram-compatible)
ffmpeg -i input.ogg -c:a libopus -b:a 48k -ar 48000 output.ogg

# Breakdown:
# -c:a libopus   → use the Opus codec (not Vorbis)
# -b:a 48k       → 48 kbps bitrate (excellent for speech)
# -ar 48000      → 48 kHz sample rate (required by most platforms)
💡 Pro Tip: Validate After Conversion Always verify the output: ffprobe -v quiet -show_entries stream=codec_name -of csv=p=0 output.ogg should print opus. If it still says vorbis, your FFmpeg wasn't compiled with libopus support. Install it: brew install ffmpeg (macOS) or apt install ffmpeg libopus-dev (Linux).

Automating It in Your Pipeline

#!/bin/bash
# tts-to-telegram.sh — convert any TTS output to Telegram voice format
INPUT="$1"
OUTPUT="${2:-telegram-voice.ogg}"

# Detect current codec
CODEC=$(ffprobe -v quiet -show_entries stream=codec_name -of csv=p=0 "$INPUT")

if [ "$CODEC" = "opus" ]; then
    echo "Already Opus — copying"
    cp "$INPUT" "$OUTPUT"
else
    echo "Converting $CODEC → Opus"
    ffmpeg -y -i "$INPUT" -c:a libopus -b:a 48k -ar 48000 "$OUTPUT"
fi

# Verify
FINAL=$(ffprobe -v quiet -show_entries stream=codec_name -of csv=p=0 "$OUTPUT")
echo "Output codec: $FINAL"

FFmpeg Conversion Cheat Sheet

Copy-paste commands for the most common TTS audio conversions:

To Ogg Opus (Telegram, WhatsApp, Discord, Signal)

# From MP3
ffmpeg -i input.mp3 -c:a libopus -b:a 48k -ar 48000 output.ogg

# From AAC / M4A
ffmpeg -i input.aac -c:a libopus -b:a 48k -ar 48000 output.ogg

# From WAV / PCM
ffmpeg -i input.wav -c:a libopus -b:a 48k -ar 48000 output.ogg

# From Ogg Vorbis (the Kokoro fix)
ffmpeg -i input.ogg -c:a libopus -b:a 48k -ar 48000 output.ogg

# From FLAC
ffmpeg -i input.flac -c:a libopus -b:a 48k -ar 48000 output.ogg

To AAC (iMessage, Apple ecosystem)

# From MP3
ffmpeg -i input.mp3 -c:a aac -b:a 128k output.m4a

# From Ogg (any codec)
ffmpeg -i input.ogg -c:a aac -b:a 128k output.m4a

# From WAV
ffmpeg -i input.wav -c:a aac -b:a 128k output.m4a

To MP3 (Universal compatibility)

# From any format
ffmpeg -i input.ogg -c:a libmp3lame -b:a 128k output.mp3

# Higher quality for podcast/RSS
ffmpeg -i input.wav -c:a libmp3lame -b:a 192k -ar 44100 output.mp3

To WAV (Processing intermediate)

# From any compressed format
ffmpeg -i input.ogg -c:a pcm_s16le -ar 44100 output.wav

Batch Convert All Files in a Directory

# Convert all .ogg files to Telegram-compatible Opus
for f in *.ogg; do
    ffmpeg -y -i "$f" -c:a libopus -b:a 48k -ar 48000 "opus_${f}"
done

Recommended Format by Use Case

Use CaseBest FormatWhy
Chat app voice messages
(Telegram, WhatsApp, Discord, Signal)
Ogg Opus, 48kHz, 48kbps Required by all major chat platforms. Small files, excellent speech quality.
Podcast / RSS feed MP3, 44.1kHz, 128–192kbps Universal podcast player support. Every RSS reader and podcast app handles MP3.
Web audio player
(blog, documentation)
AAC in M4A, 128kbps
or MP3 128kbps
AAC for quality/size; MP3 for maximum browser compatibility.
Phone call / VoIP Opus, 16kHz, 16–32kbps
or μ-law/A-law 8kHz
Low latency critical. Opus dominates WebRTC; μ-law for traditional telephony (PSTN).
Apple ecosystem
(iMessage, Siri, HomePod)
AAC in CAF or M4A Native CoreAudio hardware decode. Zero-effort playback on all Apple devices.
Archival / post-processing WAV (PCM 16-bit) or FLAC Lossless. Preserve full TTS output quality for future re-encoding.
"I don't know the target" MP3, 44.1kHz, 128kbps Works literally everywhere. The safe default.
💡 The Pragmatic Approach Generate your TTS in the engine's best format (often MP3 or WAV), then convert to the target format with FFmpeg as the last step in your pipeline. This keeps your pipeline flexible — one source, multiple targets. Store WAV as your archival copy.

References

  1. Opus Codec — Official Site — Xiph.Org / IETF RFC 6716
  2. Amazon Polly SynthesizeSpeech API — AWS Documentation
  3. MP3 Patent Expiration — Wikipedia, 2017
  4. AVFAudio — Apple Developer Documentation — Apple CoreAudio AAC
  5. FFmpeg Codec Documentation — PCM, Opus, AAC encoders
  6. Telegram Bot API — sendVoice — "audio must be in an .ogg file encoded with OPUS"
  7. Slack API — Files and Uploads — Supported audio formats
  8. Android Supported Media Formats — Google Developer Docs
  9. OpenAI Text-to-Speech Guide — Output formats: mp3, opus, aac, flac, wav, pcm
  10. ElevenLabs — What audio formats do you support? — MP3, PCM, Opus, μ-law
  11. Google Cloud TTS AudioEncoding — LINEAR16, MP3, OGG_OPUS, MULAW, ALAW
  12. Azure Speech Service REST API — Output format options
  13. How to convert OGG file to Telegram voice format — Stack Overflow
  14. FFmpeg — Convert any media file to Telegram Opus audio note — Super User