Local Kokoro TTS · 14 Voices · No API Costs · Auto-Ducks Music · AI Captions

AI Voiceover for Your Video —
14 Voices, No API Costs

Type a script, pick from 14 English voices — Kokoro TTS runs locally, your script never goes to ElevenLabs, OpenAI, or any external API. Auto-ducks 500,000 CC0 music tracks during speech. Captions optional in the same render.

Unlimited voiceovers · Cancel anytime
See it in action — a real render
Want to test with a real input file? Open the live sample animation →
How it works
Drop your video
.mp4 · .mov · .webm
Type your script
Up to 500 characters · 14 voices
Download narrated MP4
Music auto-ducks under speech

Studio-quality narration without the studio bill

14 local voices — your script stays private
5 American female (Bella, Nicole, Sarah, Sky, Kore), 4 American male (Adam, Michael, Onyx, Eric), 3 British female (Alice, Emma, Lily), 2 British male (Daniel, George). Kokoro runs on-server — your script never goes to ElevenLabs, OpenAI, or Azure.
Auto-ducked music — sounds like a podcast
When voiceover and background music are both present, FFmpeg's sidechaincompress drops the music ~8 dB during speech and lifts it back up between phrases. Voice on top, 500,000 CC0 tracks underneath — no manual mixing.
AI captions in the same render
Add voiceover and captions together — Whisper transcribes the generated speech and burns time-synced captions into frames. Four styles. One render pass: voiceover, captions, music, logo — all at once.
$5/month flat — no per-character bill
ElevenLabs Starter is $22/mo + per-character. OpenAI TTS is $15/M characters. Animation Machine is $5/month for unlimited voiceover renders, unlimited captions, and unlimited music. Flat rate, forever.
Pull script from on-screen text
Click "Pull script from video" and Tesseract OCRs visible text from your video frames, pre-filling the script box. Edit before rendering. Useful for re-narrating kinetic-typography content.
Fast — under a minute total
Kokoro runs at ~1.2× realtime on our CPU box. A 200-character script (~20 seconds of speech) generates in ~25 seconds. Full render with voiceover, captions, and music typically completes in under a minute.

Adding Voiceover to Video — Common Questions

How does the voiceover feature work?
Type a script (up to 500 characters) and pick one of 14 voices. We synthesize the audio using Kokoro v1.0 — an open-source TTS model based on StyleTTS2 — running locally on our server. The generated WAV is mixed into your video alongside any background music you've added.
What voices are available?
14 voices total: 5 American female (Bella, Nicole, Sarah, Sky, Kore), 4 American male (Adam, Michael, Onyx, Eric), 3 British female (Alice, Emma, Lily), and 2 British male (Daniel, George). All English. Quality is comparable to ElevenLabs for short narration scripts.
Does this cost extra per character?
No. Kokoro TTS runs entirely on our server — no third-party API calls, no per-character billing. Your $5/month subscription covers unlimited voiceovers.
Can I record my own voice instead of using AI?
Yes — upload your own audio file in the music panel. The mixing pipeline treats it like any other audio track, so it'll play alongside your video.
What if I don't know what to say?
If your input is a video with text on screen (kinetic-typography animation, marketing video with overlays), click the "Pull script from video" button. Tesseract OCRs the on-screen text and pre-fills the script field for you. Edit before rendering if needed.
Will the music drown out the voiceover?
No — when both are present, music is automatically ducked under speech via FFmpeg's sidechaincompress filter. Music drops 8 dB whenever the voice signal exceeds threshold and lifts back up between phrases.
What input formats are supported?
MP4, MOV, M4V, and WebM video files. You can also voiceover HTML animations (Claude Design exports, Lottie, GSAP, CSS) — perfect for narrating a silent kinetic-typography piece.
How long does voiceover generation take?
Kokoro runs at roughly 1.2× realtime on our 2-core CPU box. So a 200-character script (~20 seconds of speech) generates in ~25 seconds. Sub-2-minute total render overhead is typical.
How much does it cost?
$5 per month, unlimited renders, unlimited voiceover characters. Cancel anytime.
More tools

Related converters & tools

Add Text to Video Add an Intro to Video Add an Outro to Video Add a Watermark to Video
Sign in to continue
Sign in with Google to start — 1 free watermarked render included. Subscribe for $5/month to unlock unlimited clean exports.
or