Question 1

How does Animation Machine generate captions?

Accepted Answer

We run OpenAI's Whisper (small.en model, ~487MB) locally on our server using whisper.cpp — the optimized C++ port. The model transcribes your video's audio with high accuracy and produces an SRT subtitle track. We then burn those captions directly into the video frames via FFmpeg's libass-based subtitles filter, so they're permanently visible everywhere it plays — no separate caption file required.

Question 2

How accurate are the captions?

Accepted Answer

Whisper is one of the strongest open-source ASR (automatic speech recognition) models available. For clear English audio, the small.en model produces 95%+ word accuracy. It handles accents, background music, and natural speech well. Quality drops on heavily distorted audio or unusual jargon, but typical YouTube-style narration transcribes cleanly.

Question 3

Can I customize how the captions look?

Accepted Answer

Yes. Four built-in caption styles: Subtle (small bottom-center, default), Bold (large white-on-black TikTok-style, 42pt with thick outline), Cinematic (italic with raised margin), and Block (semi-opaque dark background pill). All scale appropriately at 720p, 1080p, and 4K.

Question 4

Does this work for animations or videos with no spoken audio?

Accepted Answer

Yes — when Whisper finds no usable speech, the system automatically falls back to OCR (Tesseract) and reads on-screen text from your video frames instead. So a kinetic-typography marketing animation gets captioned from its visible text, even though there's no spoken narration.

Question 5

Are my videos uploaded to third parties for transcription?

Accepted Answer

No. Whisper runs entirely on our server. Your audio is never sent to OpenAI, Google, AssemblyAI, Deepgram, or any third-party transcription service. The temporary audio extract is deleted at the end of your render.

Question 6

What input formats are supported?

Accepted Answer

MP4, MOV, M4V, and WebM video files. You can also caption HTML animations (Claude Design exports, Lottie, GSAP, CSS) — the OCR fallback reads on-screen text since these typically don't have audio.

Question 7

Can I render captioned video for TikTok or Reels?

Accepted Answer

Yes. Use a one-click platform preset (TikTok, Reels, YouTube Shorts, Instagram, etc.) and the captions automatically render with the right aspect ratio and crop. Bold caption style works particularly well for short-form vertical content.

Question 8

How long does captioning take?

Accepted Answer

Whisper transcription runs at roughly 1.3× realtime on our 2-core CPU box. So a 30-second video adds ~25 seconds to the render. The caption burn-in step adds another ~5–10 seconds. Total overhead is typically 30–60 seconds for short videos.

Question 9

How much does it cost?

Accepted Answer

$5 per month for unlimited renders, including unlimited captions. No per-video fee, no per-minute fee, no API surcharges. Cancel anytime.

AI Captions for Your Video —
Local Whisper, No API Costs

Captions that look right and sound right

Adding Captions to Video — Common Questions

Related converters & tools

AI Captions for Your Video —Local Whisper, No API Costs

Captions that look right and sound right

Adding Captions to Video — Common Questions

Related converters & tools

AI Captions for Your Video —
Local Whisper, No API Costs