AI Subtitle Generator — Free Auto Subtitles (.srt) Online

Generate accurate .srt subtitles from any video. Whisper.cpp transcribes the audio entirely in your browser and outputs a timestamped subtitle file ready to drop into your editor.

Drop your MP4 / WebM / MOV / MKV / OGG / AVI file hereTap to select a file

or click to browse

Supports MP4, WebM, MOV, MKV, OGG, AVI, up to 200MB

Runs entirely in your browser

What to do next

Add Subtitles Video Video Trimmer Video Compressor

Related tools

Add Subtitles to Video

Burn an SRT or ASS subtitle file directly into your video frames so the captions show in every player. Files are processed entirely in your browser with FFmpeg WebAssembly.

video

AI Audio Transcriber

Transcribe any audio file to text in your browser. Whisper.cpp recognises 99 languages and exports a clean .txt or timestamped transcript without uploading the recording.

audio

Video to MP3

Extract the audio from any video as an MP3 file — entirely in your browser. Pick bitrate and channels, then download the result. Files are processed locally with FFmpeg WebAssembly.

video

Video Trimmer

Set precise in and out timestamps, snap to keyframes when needed, and document handles for social-safe cutdowns.

video

Video Compressor

Shrink any video with five quality presets — from visually lossless to tiny — using H.264 + AAC at the proven CRF rate-control. Files are processed entirely in your browser with FFmpeg WebAssembly.

video

Video Converter

Convert any video to MP4, WebM, MOV, or MKV — entirely inside your browser. Files are processed locally with FFmpeg WebAssembly, so nothing is uploaded.

video

About AI Subtitle Generator

Subtitling a video used to be a paid SaaS chore: upload the file to subscription transcription services or subscription video editors, wait for someone else's GPU to grind through it, then pay per minute to download the .srt. The AI Subtitle Generator does the same job inside your browser. A copy of whisper.cpp compiled to WebAssembly transcribes the audio and emits a timestamped subtitle file you can drop straight into a professional video editor, a professional video editor, CapCut, FFmpeg, or YouTube Studio. The whole video stays on your device — only the model files travel.

The pipeline starts by extracting the audio track. We use the same FFmpeg-WASM binary that powers Compress Video and Convert Video, so MP4, MOV, MKV, WebM, AVI and OGV are all supported without any plugins. The audio is resampled to 16 kHz mono — the format Whisper expects — and handed to the inference engine in 30-second windows. Each window comes back as a list of segments with start and end timestamps and recognised text. The tool stitches those segments together into a standards-compliant .srt file: numeric cue index, hh:mm:ss,xxx --> hh:mm:ss,xxx timestamps, and the cue text on the next line.

Whisper-tiny — the model variant used here — was trained on 680,000 hours of multilingual speech. It auto-detects the language of the audio (no need to pre-select), and accuracy on clear English speech reaches well above 90%. Spanish, French, German, Italian, Portuguese, Hindi, Mandarin, and Japanese also produce solid results. Heavy accents, overlapping voices, distant microphones, and background music are the universal hard cases for any speech model; the typical fix is to run the source through Audio Noise Reducer first, or to extract just the speaker track if you have a separate dialogue stem.

The 200 MB upload cap covers anything from a short clip to a half-hour podcast video. Performance is bounded by your CPU rather than your network: a 10-minute clip transcribes in roughly 2-4 minutes on a modern laptop and 4-8 minutes on a phone, after the one-time 39 MB model download. Subsequent clips reuse the cached model, so each new run only pays the inference cost. The model files come from the same Cloudflare R2 bucket that hosts the FFmpeg core and the Tesseract OCR engine — it is the only third-party host the production CSP allows for this site.

How it works

1Drop a video onto the upload area. MP4, MOV, MKV, WebM, AVI and OGV are all accepted, up to 200 MB.
2On first use, the whisper.cpp WebAssembly engine and the Whisper-tiny model (~39 MB combined) download from the Favtoo CDN and are cached in your browser. This is a one-time cost per device.
3FFmpeg extracts the audio from the video, resamples it to 16 kHz mono, and feeds it to the speech model in 30-second windows.
4The model auto-detects the language and returns segment-level timestamps and recognised text. Progress is shown so you can see how many minutes have been processed.
5Preview the cues in the timeline editor. Edit any line whose text needs cleanup; timestamps stay locked to the audio.
6Export the .srt file. Drop it into your video editor, or use Add Subtitles to Video to burn the cues directly into the video frames.

Common use cases

Subtitle a 10-minute YouTube tutorial in your own voice without paying per-minute transcription fees
Caption a customer-testimonial reel before uploading to Instagram or LinkedIn so the audio-off audience still gets the message
Generate an English transcript of a foreign-language interview clip for a journalism piece
Add accessibility captions to an internal training video before sharing it on a corporate intranet
Produce subtitle files for an indie short film without uploading the unreleased cut to a third-party SaaS
Caption a school assignment recording so a teacher can grade against a written transcript

FAQ

Which languages are supported?

The Whisper-tiny model recognises 99 languages out of the box and auto-detects the language of your audio. Accuracy is highest on English; Spanish, French, German, Italian, Portuguese, Hindi and Japanese also work well.

Does my video upload anywhere?

No. The audio is decoded with FFmpeg-WASM, fed to whisper.cpp running locally in your browser, and the .srt file is built in memory. Nothing leaves the device.

How big can the video be?

Files up to 200 MB. For longer recordings, run the source through Video Compressor first or extract the audio with Video to MP3 to cut the working size.

How long does a 10-minute clip take?

Roughly 2–4 minutes on a modern laptop after the one-time 39 MB model download. Phones are slower — expect 4–8 minutes for the same clip.

Can I edit the subtitles before exporting?

Yes — every cue is editable in the preview pane and the timestamps stay locked to the source. After exporting, drop the .srt into your video editor or burn it in with Add Subtitles to Video.

How is this different from the captions YouTube generates?

YouTube’s auto-captions use a server-side speech model and only work if the video is uploaded to YouTube. The AI Subtitle Generator runs entirely on your device, gives you the .srt file directly so you can edit and use it anywhere, and works on any video file regardless of where you plan to host it. Accuracy is comparable on clean English speech; YouTube’s model is slightly stronger on heavily accented audio because it has access to far more compute at inference time.

Will the original video file be uploaded anywhere?

No. The video bytes are read into memory by JavaScript inside your browser tab, decoded by FFmpeg-WASM running locally, and the audio samples are handed to whisper.cpp running locally. The .srt file is built in memory and offered as a download. Open the Network tab in DevTools while the tool runs and you will only see the initial CDN requests for the engine + model — no requests carrying your video.

Can it transcribe multiple speakers separately?

No — the Whisper-tiny model is a transcriber, not a diarizer. Every spoken line is captured but the .srt does not say which speaker said which line. For interview-style content where speaker separation matters, you can manually annotate the cues afterwards or run the audio through a dedicated diarization tool first.

What if the language detection picks the wrong language?

You can override it with the language dropdown. Pick the dominant language of your audio; whisper.cpp will still recognise the occasional foreign word but it will treat the bulk of the audio as the language you selected. For genuinely bilingual content, transcribe twice (once per language) and merge the .srt files manually.

Why is the model only 39 MB? Is bigger better?

Whisper-tiny is the smallest variant in the family. The base, small, medium, and large variants are 74 MB, 244 MB, 769 MB and 1.5 GB respectively, with each step up giving meaningfully better accuracy at the cost of a much heavier download and substantially slower inference. Tiny is the practical sweet spot for in-browser use; it gives 90%+ accuracy on clean English speech without making mobile users wait minutes for a model download.

Will it work on a phone?

Yes, on any modern smartphone browser. Phones are CPU-bound for inference so transcription runs roughly 2-3× slower than on a laptop. The model still loads and runs from the browser cache; you can put the phone to sleep mid-job and the tab will resume when you wake it up.

Can I transcribe a music video?

You can try, but speech recognition models are trained on speech and will struggle with vocals over instrumental backing. Strip the vocals first with a karaoke / vocal removal tool to get a clean speech track, then transcribe that.

AI Subtitle Generator — Free Auto Subtitles (.srt) Online

Related tools

About AI Subtitle Generator

How it works

Common use cases

FAQ

About AI Subtitle Generator

How it works

Common use cases

FAQ

Explore more Video Tools