Skip to main content

AI Subtitle Generator — Free Auto Subtitles (.srt) Online

Generate accurate .srt subtitles from any video. Whisper.cpp transcribes the audio entirely in your browser and outputs a timestamped subtitle file ready to drop into your editor.

Tap to select a file

Supports MP4, WebM, MOV, MKV, OGG, AVI, up to 200MB

Runs entirely in your browser

What to do next

Related tools

About AI Subtitle Generator

Subtitling a video used to be a paid SaaS chore: upload the file to subscription transcription services or subscription video editors, wait for someone else's GPU to grind through it, then pay per minute to download the .srt. The AI Subtitle Generator does the same job inside your browser. A copy of whisper.cpp compiled to WebAssembly transcribes the audio and emits a timestamped subtitle file you can drop straight into a professional video editor, a professional video editor, CapCut, FFmpeg, or YouTube Studio. The whole video stays on your device — only the model files travel.

The pipeline starts by extracting the audio track. We use the same FFmpeg-WASM binary that powers Compress Video and Convert Video, so MP4, MOV, MKV, WebM, AVI and OGV are all supported without any plugins. The audio is resampled to 16 kHz mono — the format Whisper expects — and handed to the inference engine in 30-second windows. Each window comes back as a list of segments with start and end timestamps and recognised text. The tool stitches those segments together into a standards-compliant .srt file: numeric cue index, hh:mm:ss,xxx --> hh:mm:ss,xxx timestamps, and the cue text on the next line.

Whisper-tiny — the model variant used here — was trained on 680,000 hours of multilingual speech. It auto-detects the language of the audio (no need to pre-select), and accuracy on clear English speech reaches well above 90%. Spanish, French, German, Italian, Portuguese, Hindi, Mandarin, and Japanese also produce solid results. Heavy accents, overlapping voices, distant microphones, and background music are the universal hard cases for any speech model; the typical fix is to run the source through Audio Noise Reducer first, or to extract just the speaker track if you have a separate dialogue stem.

The 200 MB upload cap covers anything from a short clip to a half-hour podcast video. Performance is bounded by your CPU rather than your network: a 10-minute clip transcribes in roughly 2-4 minutes on a modern laptop and 4-8 minutes on a phone, after the one-time 39 MB model download. Subsequent clips reuse the cached model, so each new run only pays the inference cost. The model files come from the same Cloudflare R2 bucket that hosts the FFmpeg core and the Tesseract OCR engine — it is the only third-party host the production CSP allows for this site.

How it works

  1. 1Drop a video onto the upload area. MP4, MOV, MKV, WebM, AVI and OGV are all accepted, up to 200 MB.
  2. 2On first use, the whisper.cpp WebAssembly engine and the Whisper-tiny model (~39 MB combined) download from the Favtoo CDN and are cached in your browser. This is a one-time cost per device.
  3. 3FFmpeg extracts the audio from the video, resamples it to 16 kHz mono, and feeds it to the speech model in 30-second windows.
  4. 4The model auto-detects the language and returns segment-level timestamps and recognised text. Progress is shown so you can see how many minutes have been processed.
  5. 5Preview the cues in the timeline editor. Edit any line whose text needs cleanup; timestamps stay locked to the audio.
  6. 6Export the .srt file. Drop it into your video editor, or use Add Subtitles to Video to burn the cues directly into the video frames.

Common use cases

  • Subtitle a 10-minute YouTube tutorial in your own voice without paying per-minute transcription fees
  • Caption a customer-testimonial reel before uploading to Instagram or LinkedIn so the audio-off audience still gets the message
  • Generate an English transcript of a foreign-language interview clip for a journalism piece
  • Add accessibility captions to an internal training video before sharing it on a corporate intranet
  • Produce subtitle files for an indie short film without uploading the unreleased cut to a third-party SaaS
  • Caption a school assignment recording so a teacher can grade against a written transcript

FAQ

Which languages are supported?

The Whisper-tiny model recognises 99 languages out of the box and auto-detects the language of your audio. Accuracy is highest on English; Spanish, French, German, Italian, Portuguese, Hindi and Japanese also work well.

Does my video upload anywhere?

No. The audio is decoded with FFmpeg-WASM, fed to whisper.cpp running locally in your browser, and the .srt file is built in memory. Nothing leaves the device.

How big can the video be?

Files up to 200 MB. For longer recordings, run the source through Video Compressor first or extract the audio with Video to MP3 to cut the working size.

How long does a 10-minute clip take?

Roughly 2–4 minutes on a modern laptop after the one-time 39 MB model download. Phones are slower — expect 4–8 minutes for the same clip.

Can I edit the subtitles before exporting?

Yes — every cue is editable in the preview pane and the timestamps stay locked to the source. After exporting, drop the .srt into your video editor or burn it in with Add Subtitles to Video.

How is this different from the captions YouTube generates?

YouTube’s auto-captions use a server-side speech model and only work if the video is uploaded to YouTube. The AI Subtitle Generator runs entirely on your device, gives you the .srt file directly so you can edit and use it anywhere, and works on any video file regardless of where you plan to host it. Accuracy is comparable on clean English speech; YouTube’s model is slightly stronger on heavily accented audio because it has access to far more compute at inference time.

Will the original video file be uploaded anywhere?

No. The video bytes are read into memory by JavaScript inside your browser tab, decoded by FFmpeg-WASM running locally, and the audio samples are handed to whisper.cpp running locally. The .srt file is built in memory and offered as a download. Open the Network tab in DevTools while the tool runs and you will only see the initial CDN requests for the engine + model — no requests carrying your video.

Can it transcribe multiple speakers separately?

No — the Whisper-tiny model is a transcriber, not a diarizer. Every spoken line is captured but the .srt does not say which speaker said which line. For interview-style content where speaker separation matters, you can manually annotate the cues afterwards or run the audio through a dedicated diarization tool first.

What if the language detection picks the wrong language?

You can override it with the language dropdown. Pick the dominant language of your audio; whisper.cpp will still recognise the occasional foreign word but it will treat the bulk of the audio as the language you selected. For genuinely bilingual content, transcribe twice (once per language) and merge the .srt files manually.

Why is the model only 39 MB? Is bigger better?

Whisper-tiny is the smallest variant in the family. The base, small, medium, and large variants are 74 MB, 244 MB, 769 MB and 1.5 GB respectively, with each step up giving meaningfully better accuracy at the cost of a much heavier download and substantially slower inference. Tiny is the practical sweet spot for in-browser use; it gives 90%+ accuracy on clean English speech without making mobile users wait minutes for a model download.

Will it work on a phone?

Yes, on any modern smartphone browser. Phones are CPU-bound for inference so transcription runs roughly 2-3× slower than on a laptop. The model still loads and runs from the browser cache; you can put the phone to sleep mid-job and the tab will resume when you wake it up.

Can I transcribe a music video?

You can try, but speech recognition models are trained on speech and will struggle with vocals over instrumental backing. Strip the vocals first with a karaoke / vocal removal tool to get a clean speech track, then transcribe that.

About AI Subtitle Generator

Subtitling a video used to be a paid SaaS chore: upload the file to subscription transcription services or subscription video editors, wait for someone else's GPU to grind through it, then pay per minute to download the .srt. The AI Subtitle Generator does the same job inside your browser. A copy of whisper.cpp compiled to WebAssembly transcribes the audio and emits a timestamped subtitle file you can drop straight into a professional video editor, a professional video editor, CapCut, FFmpeg, or YouTube Studio. The whole video stays on your device — only the model files travel.

The pipeline starts by extracting the audio track. We use the same FFmpeg-WASM binary that powers Compress Video and Convert Video, so MP4, MOV, MKV, WebM, AVI and OGV are all supported without any plugins. The audio is resampled to 16 kHz mono — the format Whisper expects — and handed to the inference engine in 30-second windows. Each window comes back as a list of segments with start and end timestamps and recognised text. The tool stitches those segments together into a standards-compliant .srt file: numeric cue index, hh:mm:ss,xxx --> hh:mm:ss,xxx timestamps, and the cue text on the next line.

Whisper-tiny — the model variant used here — was trained on 680,000 hours of multilingual speech. It auto-detects the language of the audio (no need to pre-select), and accuracy on clear English speech reaches well above 90%. Spanish, French, German, Italian, Portuguese, Hindi, Mandarin, and Japanese also produce solid results. Heavy accents, overlapping voices, distant microphones, and background music are the universal hard cases for any speech model; the typical fix is to run the source through Audio Noise Reducer first, or to extract just the speaker track if you have a separate dialogue stem.

The 200 MB upload cap covers anything from a short clip to a half-hour podcast video. Performance is bounded by your CPU rather than your network: a 10-minute clip transcribes in roughly 2-4 minutes on a modern laptop and 4-8 minutes on a phone, after the one-time 39 MB model download. Subsequent clips reuse the cached model, so each new run only pays the inference cost. The model files come from the same Cloudflare R2 bucket that hosts the FFmpeg core and the Tesseract OCR engine — it is the only third-party host the production CSP allows for this site.

How it works

  1. 1Drop a video onto the upload area. MP4, MOV, MKV, WebM, AVI and OGV are all accepted, up to 200 MB.
  2. 2On first use, the whisper.cpp WebAssembly engine and the Whisper-tiny model (~39 MB combined) download from the Favtoo CDN and are cached in your browser. This is a one-time cost per device.
  3. 3FFmpeg extracts the audio from the video, resamples it to 16 kHz mono, and feeds it to the speech model in 30-second windows.
  4. 4The model auto-detects the language and returns segment-level timestamps and recognised text. Progress is shown so you can see how many minutes have been processed.
  5. 5Preview the cues in the timeline editor. Edit any line whose text needs cleanup; timestamps stay locked to the audio.
  6. 6Export the .srt file. Drop it into your video editor, or use Add Subtitles to Video to burn the cues directly into the video frames.

Common use cases

FAQ

Which languages are supported?

The Whisper-tiny model recognises 99 languages out of the box and auto-detects the language of your audio. Accuracy is highest on English; Spanish, French, German, Italian, Portuguese, Hindi and Japanese also work well.

Does my video upload anywhere?

No. The audio is decoded with FFmpeg-WASM, fed to whisper.cpp running locally in your browser, and the .srt file is built in memory. Nothing leaves the device.

How big can the video be?

Files up to 200 MB. For longer recordings, run the source through Video Compressor first or extract the audio with Video to MP3 to cut the working size.

How long does a 10-minute clip take?

Roughly 2–4 minutes on a modern laptop after the one-time 39 MB model download. Phones are slower — expect 4–8 minutes for the same clip.

Can I edit the subtitles before exporting?

Yes — every cue is editable in the preview pane and the timestamps stay locked to the source. After exporting, drop the .srt into your video editor or burn it in with Add Subtitles to Video.

How is this different from the captions YouTube generates?

YouTube’s auto-captions use a server-side speech model and only work if the video is uploaded to YouTube. The AI Subtitle Generator runs entirely on your device, gives you the .srt file directly so you can edit and use it anywhere, and works on any video file regardless of where you plan to host it. Accuracy is comparable on clean English speech; YouTube’s model is slightly stronger on heavily accented audio because it has access to far more compute at inference time.

Will the original video file be uploaded anywhere?

No. The video bytes are read into memory by JavaScript inside your browser tab, decoded by FFmpeg-WASM running locally, and the audio samples are handed to whisper.cpp running locally. The .srt file is built in memory and offered as a download. Open the Network tab in DevTools while the tool runs and you will only see the initial CDN requests for the engine + model — no requests carrying your video.

Can it transcribe multiple speakers separately?

No — the Whisper-tiny model is a transcriber, not a diarizer. Every spoken line is captured but the .srt does not say which speaker said which line. For interview-style content where speaker separation matters, you can manually annotate the cues afterwards or run the audio through a dedicated diarization tool first.

What if the language detection picks the wrong language?

You can override it with the language dropdown. Pick the dominant language of your audio; whisper.cpp will still recognise the occasional foreign word but it will treat the bulk of the audio as the language you selected. For genuinely bilingual content, transcribe twice (once per language) and merge the .srt files manually.

Why is the model only 39 MB? Is bigger better?

Whisper-tiny is the smallest variant in the family. The base, small, medium, and large variants are 74 MB, 244 MB, 769 MB and 1.5 GB respectively, with each step up giving meaningfully better accuracy at the cost of a much heavier download and substantially slower inference. Tiny is the practical sweet spot for in-browser use; it gives 90%+ accuracy on clean English speech without making mobile users wait minutes for a model download.

Will it work on a phone?

Yes, on any modern smartphone browser. Phones are CPU-bound for inference so transcription runs roughly 2-3× slower than on a laptop. The model still loads and runs from the browser cache; you can put the phone to sleep mid-job and the tab will resume when you wake it up.

Can I transcribe a music video?

You can try, but speech recognition models are trained on speech and will struggle with vocals over instrumental backing. Strip the vocals first with a karaoke / vocal removal tool to get a clean speech track, then transcribe that.

Subtitle Editor

Open a .srt or .vtt subtitle file and shift timestamps, scale playback speed, find/replace text, strip styling, or convert between formats. Pairs with the AI Subtitle Generator for a complete subtitle workflow.

Compress Video

Reduce video file size while preserving quality. 100% browser-based — your video never leaves your device.

Convert Video

Convert videos between MP4, WebM, MOV, MKV, and animated GIF — entirely in your browser. Your file never leaves your device.

Video to GIF

Convert any video clip to an animated GIF entirely in your browser. Pick the start, length, frame rate, and width — your file is processed locally with FFmpeg WebAssembly and never uploaded.

Screen Recorder

Record your screen, a window, or a browser tab directly in your browser. Optionally include system audio and your microphone. Capture, preview, and download the video without installing any app — and without uploading anything.

Webcam Recorder

Record your webcam directly in your browser with optional microphone audio. Pick the resolution (480p, 720p, or 1080p), frame rate, and mirror mode, then capture and download the result without installing any app.

Screen + Webcam Recorder

Record your screen with your webcam composited into a picture-in-picture corner — perfect for tutorials, course videos, demos, and reaction recordings. Pick the camera position, size, and audio sources, then capture and download in your browser.

Video Slideshow Maker

Turn a stack of photos into an MP4 slideshow with per-slide durations, crossfades, and an optional soundtrack. Pick the resolution (up to 1080p), frame rate, and transitions, then download a single MP4 — all processed in your browser with FFmpeg WebAssembly.

View all Video Tools