Transcribe · sync · render

Captioned video in minutes,
not Saturdays.

Drop in audio or video. CaptionFit transcribes, and renders a finished video with burned-in captions — 9x16 or 16x9, your font, your color.

Design as you wish
9x16 or 16x9 render
Download SRT or MP4
Aligning rooftop_demo.mp3
00:00 / 00:24
0:00 0:05 0:10 0:15 0:20
00:00 listening…
How it works

Three steps. That's the whole thing.

Transcribe audio. Sync captions. Render video. No timeline-scrubbing, no manually nudging timestamps, no weird XML.

— STEP 01 · TRANSCRIBE

Drop in audio or video

MP3, WAV, M4A, MP4 — up to 2 hours. Pick a language (or auto-detect) and a caption length, then hit Transcribe.

rooftop_demo.mp3
3.4 MB · 0:24
verse_takes_03.wav
28.1 MB · 4:12
aligning…
+ drop more files
— STEP 02 · SYNC

Paste script or lyrics (optional)

Got the words already? Paste them — one line per caption — to fix spelling and word grouping. We snap them to the audio.

LYRICS.TXT
When the night is bright
we run a little wild
city lights below↳ snap to 0:08
I'll meet you on the rooftop
underneath the neon
for one more song
— STEP 03 · RENDER

Render video or grab the SRT

Pick a font, size, color and aspect ratio (9x16 or 16x9). Hit Render Video — or just download the SRT.

1
00:00:01,200 → 00:00:03,800When the night is bright
2
00:00:04,100 → 00:00:06,500we run a little wild
3
00:00:08,400 → 00:00:11,100city lights below
What you get

Built for the last-mile stuff that always eats your night.

Fast

Faster than real-time

A 4-minute song aligns in about 20 seconds. We run on dedicated GPUs so your queue stays empty.

Captionfit
12s
Service B
1m 42s
Service C
3m 08s
By hand
~40m
Lyric-aware

It uses what you give it

Paste lyrics or a script and CaptionFit aligns to those exact words. No more "ahh-vuh-tahn-deal" mishears.

Audio-only
we run a little while
city light's below
underneath the neo
+ Lyrics
we run a little wild
city lights below
underneath the neon
Render-ready

Burn-in, your way

Pick a font, dial in size and color, choose 9x16 for Reels or 16x9 for YouTube. Hit Render Video.

Noto Sans ▾ A− A+
9x16
16x9 · Cover
Render queue: 12s Render Video →
In the editor

Preview, tweak, render. All in one tab.

captionfit / projects / rooftop_demo

Recent projects

rooftop_demojust now
verse_takes_032h ago
podcast_ep_42yesterday
livestream_clipMon
lecture_introApr 28
New transcription
rooftop_demo.mp3 · 6 segments
00:24 · english · lyric-aligned
00:01,200When the night is bright0.99
00:04,100we run a little wild0.97
00:08,400city lights below0.99
00:12,300I'll meet you on the rooftop0.95
00:16,000underneath the neon0.98
00:20,100for one more song0.99
Preview & edit

Tweak captions while the video plays.

Burn-in preview updates live. Edit a line, nudge a timestamp, split or merge — render when it feels right.

16x9 · 1080p
00:00 / 00:24
Noto Sans · 48 · White
Position · Bottom
Render Video
J back 1s K play/pause L ahead 1s split here
From the inbox

People who used to dread Sunday captioning.

I had a 3-minute song and the lyrics in a Notes file. CaptionFit gave me an SRT in 18 seconds and I uploaded the video before my coffee was cold.

MR
Mara Reyes
Independent songwriter

The lyric-paste feature is the unlock. Other tools mishear half my band's vocals — pasting the words means it just works.

DW
Devon Wu
Music video editor

Replaced an internal Python script we'd been duct-taping for a year. The keyboard shortcuts in the editor are chef's kiss.

PK
Priya Kothari
Podcast producer, Loopfield
FAQ

Things people ask before signing up.

How accurate is the alignment?

When you paste lyrics or a script, alignment is typically within 80–150ms of the spoken word — good enough that you'll rarely need to nudge anything. Audio-only transcription depends on the recording, but you can always paste a correction and re-align.

Which formats can I upload?

MP3, WAV, M4A, FLAC, AAC, OGG, plus video formats (MP4, MOV, WebM, MKV). Up to 2 hours per file on paid plans, 10 minutes on Free.

What aspect ratios can I render?

9x16 for Reels, TikTok, and Shorts, and 16x9 for YouTube and the web. Toggle Cover to fit horizontal source into a vertical canvas (or vice versa).

Can I download captions as something other than a video?

Yep — download a clean SRT any time, even before rendering.

Does CaptionFit train on my audio?

No. Your files are processed and deleted from our servers within 24 hours unless you pin them to a project. We never use your audio or transcripts to train models.

What about long files or batch jobs?

On paid plans you can drop a folder of files at once and we'll align them in parallel. Long files (lectures, audiobooks) are chunked automatically — you still get one clean SRT at the end.

Ready when you are

Drop a track. Get a captioned video.

No card required, no setup call, no "book a demo." Free tier covers most one-off projects.