OCR PDF — Make Scanned PDFs Searchable Online
Make a scanned PDF searchable. Each page is recognised with Tesseract OCR (compiled to WebAssembly) and an invisible text layer is overlaid so every word is selectable, copyable, and findable in any PDF viewer.
Drop your PDF file hereTap to select a file
Supports PDF, up to 200MB
What to do next
Related tools
PDF to Text Extractor
Extract all text content from a PDF and download it as a plain text file. Preserves reading order across all pages.
pdfCompress PDF
Reduce PDF file size while maintaining quality.
pdfPDF to HTML
Convert a PDF document to HTML format.
pdfMerge PDF
Combine multiple PDF files into one document.
pdfAbout OCR PDF (Make Scanned PDF Searchable)
OCR PDF turns a scanned, photographed, or otherwise image-based PDF into a fully searchable document where every word can be selected, copied, found with Cmd/Ctrl-F, and read aloud by a screen reader. The recognition is done by Google’s open-source Tesseract engine compiled to WebAssembly — the same engine that powers the free tier of every major commercial OCR product, just running on your device instead of theirs. The PDF you upload is never sent to a server, never written to a disk, never logged, and disappears from memory the moment you close the tab.
Architecturally the tool is split into three stages. First, every page of the source PDF is inspected with the bundled pdfjs-dist library to see whether it already has a meaningful text layer. Pages that do — the typical case for born-digital PDFs exported from Word, Pages, Google Docs, or any modern reporting tool — are copied through unchanged so they keep their original vector quality and selectable text. Second, pages that lack a text layer (typical scans, photographs of receipts, image-only exports) are rasterised at roughly twice their native resolution, fed to Tesseract along with the user’s chosen language model, and returned as a list of recognised words plus pixel bounding boxes. Third, the open-source pdf-lib JavaScript library rebuilds the output PDF: each recognised word is painted on top of the rasterised page image with opacity zero, so visually nothing changes, but the underlying PDF text content stream now contains every word at the right location.
Selectivity matters because OCR is the slow stage. A 200-page born-digital PDF with three scanned pages stuck in the middle finishes in seconds rather than minutes — Tesseract only runs on those three pages instead of grinding through 200. The unchanged pages keep their original vector fidelity instead of being rasterised down to a finite-DPI image. This is the same smart-skip strategy commercial PDF editors’s "Recognize Text" feature uses; brute-force "rasterize then OCR everything" tools end up wasting compute and degrading already-perfect pages.
Fifteen languages ship in the first release: English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Polish, Turkish, Arabic, Hindi, Japanese, Korean, and Simplified Chinese. Each language has its own statistical model that downloads on demand the first time you pick it, then caches in your browser so the next OCR job in that language starts immediately. Picking the right language matters more than people expect — Tesseract’s LSTM model is trained on per-language character shapes and word patterns, so an English-trained recogniser given Spanish text will mis-recognise accents and produce noticeably worse output than the Spanish model would.
Accuracy on a clean 300 dpi scan in a printed Latin-script font runs in the 95–99 % range. Smartphone photos of paperwork — uneven lighting, slight angle, mixed fonts — typically come in around 88–95 %. Handwriting, very small print, low-contrast scans, and complex scripts (Arabic ligatures, dense CJK) sit lower. The output is always a normal PDF that opens identically in commercial PDF editors, Apple Preview, standard PDF readers, browser viewers, Kindle, and every other PDF reader; the only difference from the input is that the words are now indexed and selectable.
Engine and model files are hosted on the same Cloudflare R2 bucket that serves our FFmpeg WebAssembly core. The Tesseract loader and WebAssembly module (~2.8 MB combined) are fetched from R2 on the first OCR run, the loader is cached via the Cache Storage API, and the WebAssembly module is reused from the browser’s HTTP cache on every subsequent run. A user who only ever OCRs English documents will only ever download the English model (~3 MB); a multilingual user will accumulate models in the cache as they pick new languages. We ship the integer-quantised "best" Tesseract LSTM models, which are 3-5× smaller than the standard float models with virtually identical accuracy on printed pages, so a single language download is typically 1-3 MB rather than the 8-15 MB the unoptimised models would cost. The same code runs in every modern browser that supports WebAssembly, with no third-party CDN fetches at runtime.
How it works
- 1Drop a PDF onto the upload area. The tool accepts files up to 200 MB — anything from a one-page receipt to a several-hundred-page archived report.
- 2Pick the language printed in your PDF from the dropdown. Each language has its own model that downloads once and caches for next time.
- 3Hit Process. Pages that already have a text layer are copied through unchanged; only the scanned or image-only pages run through Tesseract.
- 4A live progress bar shows the stage (engine load, language model load, recognising page X of N, saving) so you can tell the difference between "downloading the model" and "OCR is grinding".
- 5Download the searchable PDF. Every word in the previously scan-only pages is now selectable, copyable, and findable with Cmd/Ctrl-F in any PDF viewer.
Common use cases
- Make a stack of scanned legal contracts searchable so paralegals can grep across them without re-typing
- Turn smartphone photos of business cards or receipts into copyable text for an expense report
- OCR archived university lecture handouts so students can search for keywords and topics
- Extract quotable text from a magazine page that was scanned and saved as a PDF
- Make a scanned medical record searchable for clinicians who need to find specific test results fast
- Prepare a vintage book scan for full-text indexing in a personal digital library
FAQ
How accurate is the OCR?
Accuracy on a clean 300 dpi scan in a printed font is in the 95–99 % range for English and the major Latin-script European languages, and slightly lower for handwriting, very small type, and complex scripts (Arabic, CJK). The engine is Google’s Tesseract LSTM model compiled to WebAssembly — exactly the same model that powers the free tier of every major commercial OCR tool, just running on your device instead of theirs.
Does it actually run in my browser?
Yes. The Tesseract WebAssembly core is fetched once from our Cloudflare R2 CDN, cached in your browser via the Cache Storage API, and runs entirely inside your tab from then on. The PDF you upload is never sent anywhere, never written to a server, and never logged. Closing the tab clears it from memory.
Why does the first run take a moment to start?
On the first OCR job ever (per device + browser), the Tesseract loader and WebAssembly module (~2.8 MB combined) and your chosen language model (~1–3 MB depending on language) need to download from our R2 bucket. They are cached after the first download, so every subsequent job in the same browser starts immediately. The output is generated using the open-source pdf-lib JavaScript library, also bundled with the site.
What languages are supported?
English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Polish, Turkish, Arabic, Hindi, Japanese, Korean, and Chinese (Simplified). Each language has its own model that downloads on demand the first time you pick it. Languages can be combined inside a single document by running the file through twice with different language picks if you have a multilingual scan.
Will it re-OCR pages that already have text?
No. The tool inspects every page first and copies pages that already have a meaningful text layer through unchanged. Only pages that have no text — typical scans — are rasterised, OCR’d, and given a new invisible text layer. A 200-page born-digital PDF with three scanned pages in the middle finishes in seconds rather than minutes.
Why does it sometimes mis-spell words?
Tesseract is a statistical recogniser trained on millions of pages of printed text in each language. It produces best-guess characters for every glyph it sees. On clean 300 dpi typed pages in the right language, accuracy is typically 95–99 %. On low-resolution scans, photos with uneven lighting, unusual fonts, or text rotated more than a few degrees, accuracy drops. The cure is almost always a better source: re-scan at higher DPI, take the photo more straight-on, increase contrast, or pick a different language model if one was wrong.
Does the OCR also fix the visual quality of my scan?
No, and that is intentional. The tool keeps the original page image exactly as you uploaded it (just with an invisible text layer added on top) so you do not lose the visual fidelity of the source. If you also want to clean up a scan — straighten it, increase contrast, remove a colour cast — run it through an image processor first, then OCR the cleaned-up PDF for best recognition results.
How does this compare to commercial PDF editors’s Recognize Text feature?
The pipeline is essentially identical: detect which pages need OCR, rasterise them, run Tesseract (a commercial PDF editor uses a different engine but the result is comparable on clean scans), then add an invisible text layer with the recognised words. a commercial PDF editor’s commercial engine has a slight edge on extreme edge cases (very low-resolution scans, heavy handwriting) and built-in OCR auto-rotation; this tool is free, runs entirely on your device, and does not require any subscription, account, or upload.
Can I OCR a multilingual document?
Pick the dominant language for the first pass. Tesseract is forgiving: an English-trained model will still recognise a few Spanish words inside an English document with reasonable accuracy. For documents that are genuinely half-and-half, run the file through twice with each language and merge the results manually — a future version will support combining multiple languages in a single pass.
Will it run on my phone?
Yes, on any modern smartphone browser. OCR is CPU-bound and noticeably slower on phones than on a laptop — expect roughly 3–8 seconds per page on a recent iPhone or flagship Android, versus under a second per page on a desktop. The first job per language takes longer because the WebAssembly engine and language model need to download and warm up; subsequent jobs are quick because both are cached on the device.
Why is the first OCR run slow?
On the first OCR ever (per device + browser), three things have to happen before recognition can start: the Tesseract loader and WebAssembly module (~2.8 MB combined) download from our Cloudflare R2 CDN, your chosen language model (~1–3 MB) downloads from the same bucket, and Tesseract spends a second or two initialising the in-memory page recogniser. After this first run, all three are cached in your browser, so every later job in the same browser starts almost instantly.
Can I process a 200-page scanned book?
Yes. The tool scales linearly with page count — there is no per-document limit other than the 200 MB upload cap. A 200-page scan typically takes 2–5 minutes on a desktop and 10–25 minutes on a phone. Browsers will keep the tab alive in the background; you can switch to other tabs and come back. If your device runs out of memory on very long jobs, split the PDF into chunks first and OCR them separately, then merge the searchable outputs.
Is it really uploaded nowhere?
Correct. The file is read into memory by JavaScript running in your tab, the OCR engine and language model run inside that same tab via WebAssembly, and the resulting PDF is offered as a direct download. No upload to a server, no API call to a third party, no log line containing your file. Closing the tab clears every byte from memory. The Tesseract core and language model files do come from our CDN on first run, but they are open-source data files identical for every user — they reveal nothing about your specific document.