Skip to main content

AI Audio Transcriber — Free Speech-to-Text Online

Transcribe any audio file to text in your browser. Whisper.cpp recognises 99 languages and exports a clean .txt or timestamped transcript without uploading the recording.

No sign up requiredFiles stay in your browser100% free

Tap to select a file

Supports MP3, WAV, M4A, FLAC, OGG, AAC, up to 200MB

Runs entirely in your browser

How it works

  1. 1Drop an audio file. MP3, WAV, M4A, FLAC, OGG, AAC up to 200 MB.
  2. 2Pick the output format — plain text, timestamped text, or .srt subtitles.
  3. 3Pick the language (or leave on Auto-detect). Toggle "Translate to English" if you want the transcript in English regardless of source language.
  4. 4The audio is decoded to 16 kHz mono PCM with the WebAudio API, then transcribed by Whisper.cpp running entirely in your browser via WebAssembly.
  5. 5The transcript downloads automatically. Open it in any text editor, drop it into a video editor, or feed it back into the Subtitle Editor for timing tweaks.

What to do next

About AI Audio Transcriber

If you record voice memos, lectures, podcasts, customer interviews, or meeting audio, the manual typing-it-up step is usually the most painful part of the workflow. The AI Audio Transcriber removes that step. Drop in any common audio file — MP3, WAV, M4A, AAC, FLAC, OGG, Opus — and a copy of whisper.cpp running inside your browser tab returns a clean transcript ready to paste into your notes, your email, or your editor. The recording itself stays on your device; only the speech model travels.

The transcriber and the AI Subtitle Generator share the same speech recognition engine, so the same accuracy expectations apply: 90%+ on clean English speech, strong results in Spanish, French, German, Italian, Portuguese, Hindi, Mandarin and Japanese, and degrading gracefully when the audio has heavy accents, overlapping speakers, or significant background noise. The biggest single win you can give a noisy recording is a pre-pass through Audio Noise Reducer; cleaning up wind, room hum, or fan noise can lift accuracy by 5-10 percentage points on a marginal clip.

Three output options cover the common workflows. Plain text is the default — every recognised line on a separate line, no timestamps, ready to paste into Google Docs or Notion. Timestamped output prefixes each line with the segment start time in [hh:mm:ss] format so you can navigate back to the source audio for any sentence. SRT-style output (segment-level start and end timestamps) is useful when you want to align the transcript to a separate video track or feed it back into the AI Subtitle Generator's editor.

The tool has the same 200 MB upload ceiling as the subtitle generator, which comfortably fits a 60-minute MP3 at 128 kbps or a 90-minute uncompressed WAV at 16 kHz. Longer recordings can be transcribed by splitting them in Audio Trimmer first and processing each chunk separately; the resulting transcripts can be concatenated together. As with every tool on Favtoo, nothing is uploaded — your recording is decoded by the WebAudio API, fed into whisper.cpp running locally, and the transcript is built in memory.

How it works

  1. 1Drop an audio file onto the upload area. MP3, WAV, M4A, AAC, FLAC, OGG and Opus are accepted, up to 200 MB.
  2. 2Pick the output format: plain text, timestamped text, or SRT-style. Plain text is the default for note-taking workflows.
  3. 3On first use, the whisper.cpp WebAssembly engine and the Whisper-tiny model (~39 MB combined) download from the Favtoo CDN and are cached in your browser.
  4. 4The audio is decoded by the WebAudio API, resampled to 16 kHz mono, and processed in 30-second windows by the speech model.
  5. 5The transcript is assembled and shown in the preview pane. Every line is editable so you can fix any recognised word that came out wrong.
  6. 6Export the result as a .txt or .srt file. The original audio file on your disk is untouched.

Common use cases

  • Turn a 30-minute customer-discovery interview into a searchable text transcript for product research
  • Transcribe a recorded board meeting so the minutes can be drafted from text instead of replayed audio
  • Convert a lecture recording into study notes that can be highlighted, searched, and reorganised
  • Generate a written script from an off-the-cuff voice memo so it can be polished into a blog post
  • Make a podcast episode searchable by publishing a text version alongside the audio
  • Pull quotable lines out of a journalism recording without scrubbing through the timeline

FAQ

Which formats can I transcribe?

MP3, WAV, M4A, AAC, FLAC, OGG and Opus are accepted directly. Other formats can be normalised first with Audio Converter.

Will my recording be uploaded?

No. The audio is decoded locally and fed to whisper.cpp running inside your browser tab. The transcript is generated in memory and downloaded to your device only.

How accurate is it?

Whisper-tiny reaches 90%+ accuracy on clean English audio. Heavy accents, background noise, overlapping speakers and music all reduce accuracy. Run the file through Audio Noise Reducer first if the recording is noisy.

Can I get timestamps?

Yes — toggle the "Include timestamps" option to export segment-level start and end times alongside each line.

How big can the audio be?

Up to 200 MB. A 60-minute MP3 at 128 kbps fits comfortably under that ceiling.

How accurate is it for technical jargon?

Whisper does not know domain-specific vocabulary that was rare in its training data — niche pharmaceutical names, internal product codenames, obscure programming terms, etc. It will usually substitute the closest common word, which is easy to fix in the preview editor before you export. For repeated jargon you can do a Find & Replace on the exported .txt afterwards.

Can it tell speakers apart?

No. Whisper.cpp is a transcription engine, not a speaker-diarization engine. Every line is recognised but the transcript does not say who said which line. If you need speaker separation, you can manually annotate during the editing pass — that is still substantially faster than typing the entire recording from scratch.

Will the audio file be uploaded?

No. The file is read into your browser tab, decoded by the WebAudio API, and processed by whisper.cpp running locally. The transcript is generated in memory and offered as a direct download. Closing the tab clears every byte; nothing is logged on any server.

How long will my recording take to transcribe?

Roughly 0.2-0.4× the audio duration on a modern laptop, so a 30-minute recording finishes in 6-12 minutes. On phones the ratio climbs to 0.4-0.8× — expect 12-24 minutes for the same 30-minute clip. The first run also includes the one-time engine + model download (~39 MB combined).

What format should I record in for the best transcription quality?

Aim for 16 kHz or higher sample rate, mono, and at least 64 kbps if you must compress. Lossless formats like WAV and FLAC give the best accuracy because they have no MP3-style codec artefacts on consonants. If you are recording on a phone, the default Voice Memos format on iOS (M4A) and the default Recorder format on Android both work well.

Can it transcribe multiple files at once?

One at a time. The model holds the audio buffer plus the inference state in memory; running two simultaneously would likely OOM the tab on most devices. Drop the next file in once the first transcript is downloaded — the model is already loaded so each subsequent file skips the engine warm-up step.

How is this different from cloud transcription services?

The pipeline is functionally similar — both use a deep-learning speech model — but cloud services run inference on the provider’s servers, with their own pricing and retention policies. The AI Audio Transcriber runs the same family of model entirely on your device, with no per-minute charge and the audio never leaving the browser. The trade-off is that some cloud services use larger, slower-but-more-accurate models on dedicated GPUs.

Related tools

About AI Audio Transcriber

If you record voice memos, lectures, podcasts, customer interviews, or meeting audio, the manual typing-it-up step is usually the most painful part of the workflow. The AI Audio Transcriber removes that step. Drop in any common audio file — MP3, WAV, M4A, AAC, FLAC, OGG, Opus — and a copy of whisper.cpp running inside your browser tab returns a clean transcript ready to paste into your notes, your email, or your editor. The recording itself stays on your device; only the speech model travels.

The transcriber and the AI Subtitle Generator share the same speech recognition engine, so the same accuracy expectations apply: 90%+ on clean English speech, strong results in Spanish, French, German, Italian, Portuguese, Hindi, Mandarin and Japanese, and degrading gracefully when the audio has heavy accents, overlapping speakers, or significant background noise. The biggest single win you can give a noisy recording is a pre-pass through Audio Noise Reducer; cleaning up wind, room hum, or fan noise can lift accuracy by 5-10 percentage points on a marginal clip.

Three output options cover the common workflows. Plain text is the default — every recognised line on a separate line, no timestamps, ready to paste into Google Docs or Notion. Timestamped output prefixes each line with the segment start time in [hh:mm:ss] format so you can navigate back to the source audio for any sentence. SRT-style output (segment-level start and end timestamps) is useful when you want to align the transcript to a separate video track or feed it back into the AI Subtitle Generator's editor.

The tool has the same 200 MB upload ceiling as the subtitle generator, which comfortably fits a 60-minute MP3 at 128 kbps or a 90-minute uncompressed WAV at 16 kHz. Longer recordings can be transcribed by splitting them in Audio Trimmer first and processing each chunk separately; the resulting transcripts can be concatenated together. As with every tool on Favtoo, nothing is uploaded — your recording is decoded by the WebAudio API, fed into whisper.cpp running locally, and the transcript is built in memory.

How it works

  1. 1Drop an audio file onto the upload area. MP3, WAV, M4A, AAC, FLAC, OGG and Opus are accepted, up to 200 MB.
  2. 2Pick the output format: plain text, timestamped text, or SRT-style. Plain text is the default for note-taking workflows.
  3. 3On first use, the whisper.cpp WebAssembly engine and the Whisper-tiny model (~39 MB combined) download from the Favtoo CDN and are cached in your browser.
  4. 4The audio is decoded by the WebAudio API, resampled to 16 kHz mono, and processed in 30-second windows by the speech model.
  5. 5The transcript is assembled and shown in the preview pane. Every line is editable so you can fix any recognised word that came out wrong.
  6. 6Export the result as a .txt or .srt file. The original audio file on your disk is untouched.

Common use cases

FAQ

Which formats can I transcribe?

MP3, WAV, M4A, AAC, FLAC, OGG and Opus are accepted directly. Other formats can be normalised first with Audio Converter.

Will my recording be uploaded?

No. The audio is decoded locally and fed to whisper.cpp running inside your browser tab. The transcript is generated in memory and downloaded to your device only.

How accurate is it?

Whisper-tiny reaches 90%+ accuracy on clean English audio. Heavy accents, background noise, overlapping speakers and music all reduce accuracy. Run the file through Audio Noise Reducer first if the recording is noisy.

Can I get timestamps?

Yes — toggle the "Include timestamps" option to export segment-level start and end times alongside each line.

How big can the audio be?

Up to 200 MB. A 60-minute MP3 at 128 kbps fits comfortably under that ceiling.

How accurate is it for technical jargon?

Whisper does not know domain-specific vocabulary that was rare in its training data — niche pharmaceutical names, internal product codenames, obscure programming terms, etc. It will usually substitute the closest common word, which is easy to fix in the preview editor before you export. For repeated jargon you can do a Find & Replace on the exported .txt afterwards.

Can it tell speakers apart?

No. Whisper.cpp is a transcription engine, not a speaker-diarization engine. Every line is recognised but the transcript does not say who said which line. If you need speaker separation, you can manually annotate during the editing pass — that is still substantially faster than typing the entire recording from scratch.

Will the audio file be uploaded?

No. The file is read into your browser tab, decoded by the WebAudio API, and processed by whisper.cpp running locally. The transcript is generated in memory and offered as a direct download. Closing the tab clears every byte; nothing is logged on any server.

How long will my recording take to transcribe?

Roughly 0.2-0.4× the audio duration on a modern laptop, so a 30-minute recording finishes in 6-12 minutes. On phones the ratio climbs to 0.4-0.8× — expect 12-24 minutes for the same 30-minute clip. The first run also includes the one-time engine + model download (~39 MB combined).

What format should I record in for the best transcription quality?

Aim for 16 kHz or higher sample rate, mono, and at least 64 kbps if you must compress. Lossless formats like WAV and FLAC give the best accuracy because they have no MP3-style codec artefacts on consonants. If you are recording on a phone, the default Voice Memos format on iOS (M4A) and the default Recorder format on Android both work well.

Can it transcribe multiple files at once?

One at a time. The model holds the audio buffer plus the inference state in memory; running two simultaneously would likely OOM the tab on most devices. Drop the next file in once the first transcript is downloaded — the model is already loaded so each subsequent file skips the engine warm-up step.

How is this different from cloud transcription services?

The pipeline is functionally similar — both use a deep-learning speech model — but cloud services run inference on the provider’s servers, with their own pricing and retention policies. The AI Audio Transcriber runs the same family of model entirely on your device, with no per-minute charge and the audio never leaving the browser. The trade-off is that some cloud services use larger, slower-but-more-accurate models on dedicated GPUs.

Compress Audio

Shrink any audio file to a smaller size by lowering the bitrate. Pick a target quality (96, 128, 192, 256, or 320 Kbps) or output format (MP3, OGG, M4A) and the file is re-encoded right inside your browser using FFmpeg WebAssembly. Nothing is uploaded — your audio never leaves your device.

Convert Audio

Convert any audio file between MP3, WAV, OGG, FLAC, M4A, AAC, and Opus right in your browser. Pick the output format and (for lossy formats) the target bitrate. Everything runs locally with FFmpeg WebAssembly — your file is never uploaded and no account is required.

Audio Recorder

Record from your microphone directly in the browser. Pick quality (high, medium, low), toggle echo cancellation, noise suppression and auto-gain, then save to WebM/Opus or M4A/AAC. Audio is captured locally — nothing is uploaded.

Text to Speech

Type or paste text, pick a system voice, and listen instantly. Adjust speaking rate (0.5×–2×), pitch, and volume in real time. Uses your browser's built-in Web Speech API — no cloud TTS, no API keys, no costs.

Tone Generator

Generate a pure tone at any frequency from 20 Hz to 20 kHz. Pick a sine, square, triangle, or sawtooth waveform, choose duration, amplitude, and mono/stereo. Exports a 16-bit PCM WAV file at 44.1 kHz with built-in click-preventing fades.

Silence Generator

Generate a perfectly silent WAV file of any length from 1 second up to 1 hour. Pick mono or stereo, get a 16-bit PCM WAV at 44.1 kHz. Useful as padding between clips, intro silence, leader audio for video timing, or test material.

White Noise Generator

Generate white, pink, or brown noise as a 16-bit PCM WAV file. Pick noise type, duration up to 1 hour, amplitude, and mono/stereo. Useful for sleep, focus, masking distractions, audio testing, and as a backing layer for ambient music.

Metronome

A precise browser-based metronome powered by the Web Audio API. Set BPM from 30 to 300, choose a time signature, accent the first beat, and use tap-tempo to sync. Click timing is sample-accurate using lookahead scheduling — much steadier than typical JavaScript setInterval beats.

View all Audio Tools