AI Audio Transcriber — Free Speech-to-Text Online
Transcribe any audio file to text in your browser. Whisper.cpp recognises 99 languages and exports a clean .txt or timestamped transcript without uploading the recording.
Drop your audio hereTap to select a file
Supports MP3, WAV, M4A, FLAC, OGG, AAC, up to 200MB
How it works
- 1Drop an audio file. MP3, WAV, M4A, FLAC, OGG, AAC up to 200 MB.
- 2Pick the output format — plain text, timestamped text, or .srt subtitles.
- 3Pick the language (or leave on Auto-detect). Toggle "Translate to English" if you want the transcript in English regardless of source language.
- 4The audio is decoded to 16 kHz mono PCM with the WebAudio API, then transcribed by Whisper.cpp running entirely in your browser via WebAssembly.
- 5The transcript downloads automatically. Open it in any text editor, drop it into a video editor, or feed it back into the Subtitle Editor for timing tweaks.
What to do next
About AI Audio Transcriber
If you record voice memos, lectures, podcasts, customer interviews, or meeting audio, the manual typing-it-up step is usually the most painful part of the workflow. The AI Audio Transcriber removes that step. Drop in any common audio file — MP3, WAV, M4A, AAC, FLAC, OGG, Opus — and a copy of whisper.cpp running inside your browser tab returns a clean transcript ready to paste into your notes, your email, or your editor. The recording itself stays on your device; only the speech model travels.
The transcriber and the AI Subtitle Generator share the same speech recognition engine, so the same accuracy expectations apply: 90%+ on clean English speech, strong results in Spanish, French, German, Italian, Portuguese, Hindi, Mandarin and Japanese, and degrading gracefully when the audio has heavy accents, overlapping speakers, or significant background noise. The biggest single win you can give a noisy recording is a pre-pass through Audio Noise Reducer; cleaning up wind, room hum, or fan noise can lift accuracy by 5-10 percentage points on a marginal clip.
Three output options cover the common workflows. Plain text is the default — every recognised line on a separate line, no timestamps, ready to paste into Google Docs or Notion. Timestamped output prefixes each line with the segment start time in [hh:mm:ss] format so you can navigate back to the source audio for any sentence. SRT-style output (segment-level start and end timestamps) is useful when you want to align the transcript to a separate video track or feed it back into the AI Subtitle Generator's editor.
The tool has the same 200 MB upload ceiling as the subtitle generator, which comfortably fits a 60-minute MP3 at 128 kbps or a 90-minute uncompressed WAV at 16 kHz. Longer recordings can be transcribed by splitting them in Audio Trimmer first and processing each chunk separately; the resulting transcripts can be concatenated together. As with every tool on Favtoo, nothing is uploaded — your recording is decoded by the WebAudio API, fed into whisper.cpp running locally, and the transcript is built in memory.
How it works
- 1Drop an audio file onto the upload area. MP3, WAV, M4A, AAC, FLAC, OGG and Opus are accepted, up to 200 MB.
- 2Pick the output format: plain text, timestamped text, or SRT-style. Plain text is the default for note-taking workflows.
- 3On first use, the whisper.cpp WebAssembly engine and the Whisper-tiny model (~39 MB combined) download from the Favtoo CDN and are cached in your browser.
- 4The audio is decoded by the WebAudio API, resampled to 16 kHz mono, and processed in 30-second windows by the speech model.
- 5The transcript is assembled and shown in the preview pane. Every line is editable so you can fix any recognised word that came out wrong.
- 6Export the result as a .txt or .srt file. The original audio file on your disk is untouched.
Common use cases
- Turn a 30-minute customer-discovery interview into a searchable text transcript for product research
- Transcribe a recorded board meeting so the minutes can be drafted from text instead of replayed audio
- Convert a lecture recording into study notes that can be highlighted, searched, and reorganised
- Generate a written script from an off-the-cuff voice memo so it can be polished into a blog post
- Make a podcast episode searchable by publishing a text version alongside the audio
- Pull quotable lines out of a journalism recording without scrubbing through the timeline
FAQ
Which formats can I transcribe?
MP3, WAV, M4A, AAC, FLAC, OGG and Opus are accepted directly. Other formats can be normalised first with Audio Converter.
Will my recording be uploaded?
No. The audio is decoded locally and fed to whisper.cpp running inside your browser tab. The transcript is generated in memory and downloaded to your device only.
How accurate is it?
Whisper-tiny reaches 90%+ accuracy on clean English audio. Heavy accents, background noise, overlapping speakers and music all reduce accuracy. Run the file through Audio Noise Reducer first if the recording is noisy.
Can I get timestamps?
Yes — toggle the "Include timestamps" option to export segment-level start and end times alongside each line.
How big can the audio be?
Up to 200 MB. A 60-minute MP3 at 128 kbps fits comfortably under that ceiling.
How accurate is it for technical jargon?
Whisper does not know domain-specific vocabulary that was rare in its training data — niche pharmaceutical names, internal product codenames, obscure programming terms, etc. It will usually substitute the closest common word, which is easy to fix in the preview editor before you export. For repeated jargon you can do a Find & Replace on the exported .txt afterwards.
Can it tell speakers apart?
No. Whisper.cpp is a transcription engine, not a speaker-diarization engine. Every line is recognised but the transcript does not say who said which line. If you need speaker separation, you can manually annotate during the editing pass — that is still substantially faster than typing the entire recording from scratch.
Will the audio file be uploaded?
No. The file is read into your browser tab, decoded by the WebAudio API, and processed by whisper.cpp running locally. The transcript is generated in memory and offered as a direct download. Closing the tab clears every byte; nothing is logged on any server.
How long will my recording take to transcribe?
Roughly 0.2-0.4× the audio duration on a modern laptop, so a 30-minute recording finishes in 6-12 minutes. On phones the ratio climbs to 0.4-0.8× — expect 12-24 minutes for the same 30-minute clip. The first run also includes the one-time engine + model download (~39 MB combined).
What format should I record in for the best transcription quality?
Aim for 16 kHz or higher sample rate, mono, and at least 64 kbps if you must compress. Lossless formats like WAV and FLAC give the best accuracy because they have no MP3-style codec artefacts on consonants. If you are recording on a phone, the default Voice Memos format on iOS (M4A) and the default Recorder format on Android both work well.
Can it transcribe multiple files at once?
One at a time. The model holds the audio buffer plus the inference state in memory; running two simultaneously would likely OOM the tab on most devices. Drop the next file in once the first transcript is downloaded — the model is already loaded so each subsequent file skips the engine warm-up step.
How is this different from cloud transcription services?
The pipeline is functionally similar — both use a deep-learning speech model — but cloud services run inference on the provider’s servers, with their own pricing and retention policies. The AI Audio Transcriber runs the same family of model entirely on your device, with no per-minute charge and the audio never leaving the browser. The trade-off is that some cloud services use larger, slower-but-more-accurate models on dedicated GPUs.
Related tools
AI Subtitle Generator
Generate accurate .srt subtitles from any video. Whisper.cpp transcribes the audio entirely in your browser and outputs a timestamped subtitle file ready to drop into your editor.
Audio Trimmer
Trim any audio file to a precise start and end time. Outputs a lossless stream-copy by default (no quality loss, very fast) or re-encodes to MP3, WAV, OGG, or M4A. Files are processed entirely in your browser with FFmpeg WebAssembly.
Audio Noise Reducer
Reduce constant background noise (hum, hiss, fan whir, AC drone) using FFmpeg's spectral noise reduction filter (afftdn). Files are processed entirely in your browser with FFmpeg WebAssembly.
Audio Converter
Convert any audio file between MP3, WAV, OGG, FLAC, M4A, AAC, Opus, and WMA. Pick the output bitrate or quality. Files are processed entirely in your browser with FFmpeg WebAssembly — nothing is uploaded.
Audio Compressor
Apply dynamic range compression to even out loud and quiet parts of any audio file. Pick threshold and ratio. Great for taming peaks in podcasts, evening out vocal performances, or making music loud for casual listening. Runs in your browser with FFmpeg WebAssembly.
Audio Merger
Merge up to 12 audio files into one continuous track. Supports MP3, WAV, OGG, M4A, AAC, FLAC, Opus, AIFF and more. Optional loudness normalization to even out clip levels. Files are processed entirely in your browser with FFmpeg WebAssembly.