Local Whisper AI · No API Costs · 4 Caption Styles · 500k CC0 Tracks · Voiceover
AI Captions for Your Video —
Local Whisper, No API Costs
Local Whisper (small.en) transcribes your video's audio and burns time-synced captions directly into frames. Four styles, every aspect ratio, every platform. Your audio never leaves our server — no OpenAI, no AssemblyAI, no per-minute fees.
1 free render — no card needed · $5/mo or $36/yr after · Cancel anytime
See it in action — a real render
Want to test with a real input file? Open the live sample animation →
How it works
Drop your video
.mp4 · .mov · .webm
→
Whisper transcribes
Local · ~1.3× realtime · 95% accuracy
→
Download captioned MP4
Burned in · plays everywhere
Why use this
Captions that look right and sound right
Local Whisper — ~95% accuracy
OpenAI Whisper (small.en) runs on-server via whisper.cpp. ~95% word accuracy on clear English audio. Handles accents, background music, and natural speech — without sending a byte to any external transcription service.
Audio never leaves our server
Unlike Rev, Kapwing, or Descript, your audio is never uploaded to OpenAI, Google, or AssemblyAI. Whisper runs entirely on our box. The temporary audio extract is deleted at the end of your render.
4 caption styles + OCR fallback
Subtle (bottom-center, default), Bold (thick outline for TikTok/Reels), Cinematic (italic, raised), Block (semi-opaque background). No spoken audio? Tesseract reads on-screen text and uses that as the caption track instead.
$5/month — unlimited captions
Flat rate, no per-minute charge. Most caption services bill $0.10–$0.50 per minute of audio — at $5/mo you break even after 10–50 minutes. One-click presets for TikTok, Reels, YouTube Shorts, YouTube, and Instagram.
Add AI voiceover too
Captions and voiceover work in the same render. Kokoro TTS narrates from a script, Whisper captions the speech, FFmpeg ducks the music during narration. One render, fully produced video.
500,000 CC0 music tracks
Pick background music from Freesound's CC0 catalog and bake captions in the same pass. Music auto-ducks under voiceover narration. No attribution required, no extra cost beyond the $5/month flat rate.
FAQ
Adding Captions to Video — Common Questions
How does Animation Machine generate captions?
We run OpenAI's Whisper (small.en model, ~487MB) locally on our server using whisper.cpp — the optimized C++ port. The model transcribes your video's audio with high accuracy and produces an SRT subtitle track. We then burn those captions directly into the video frames via FFmpeg's libass-based subtitles filter, so they're permanently visible everywhere it plays — no separate caption file required.
How accurate are the captions?
Whisper is one of the strongest open-source ASR (automatic speech recognition) models available. For clear English audio, the small.en model produces 95%+ word accuracy. It handles accents, background music, and natural speech well. Quality drops on heavily distorted audio or unusual jargon, but typical YouTube-style narration transcribes cleanly.
Can I customize how the captions look?
Yes. Four built-in caption styles: Subtle (small bottom-center, default), Bold (large white-on-black TikTok-style at 42pt with thick outline), Cinematic (italic with raised margin), and Block (semi-opaque dark background pill). All scale appropriately at every resolution.
Does this work for animations or videos with no spoken audio?
Yes. When Whisper finds no usable speech, the system automatically falls back to OCR (Tesseract) and reads on-screen text from your video frames instead. So a kinetic-typography marketing animation gets captioned from its visible text, even though there's no spoken narration.
Are my videos uploaded to third parties for transcription?
No. Whisper runs entirely on our server. Your audio is never sent to OpenAI, Google, AssemblyAI, Deepgram, or any other transcription service. The temporary audio extract is deleted at the end of your render.
What input formats are supported?
MP4, MOV, M4V, and WebM video files. You can also caption HTML animations (Claude Design exports, Lottie, GSAP, CSS) — the OCR fallback reads on-screen text since these typically don't have audio.
How long does captioning take?
Whisper transcription runs at roughly 1.3× realtime on our 2-core CPU box. So a 30-second video adds ~25 seconds to the render. The caption burn-in step adds another ~5–10 seconds. Total overhead is typically 30–60 seconds for short videos.
How much does it cost?
$5 per month for unlimited renders, including unlimited captions. No per-video fee, no per-minute fee, no API surcharges. Cancel anytime.
Also available