Question 1

How accurate is the OCR?

Accepted Answer

Accuracy on a clean 300 dpi scan in a printed font is in the 95–99 % range for English and the major Latin-script European languages, and slightly lower for handwriting, very small type, and complex scripts (Arabic, CJK). The engine is Google’s Tesseract LSTM model compiled to WebAssembly — exactly the same model that powers the free tier of every major commercial OCR tool, just running on your device instead of theirs.

Question 2

Does it actually run in my browser?

Accepted Answer

Yes. The Tesseract WebAssembly core is fetched once from our Cloudflare R2 CDN, cached in your browser via the Cache Storage API, and runs entirely inside your tab from then on. The PDF you upload is never sent anywhere, never written to a server, and never logged. Closing the tab clears it from memory.

Question 3

Why does the first run take a moment to start?

Accepted Answer

On the first OCR job ever (per device + browser), the Tesseract loader and WebAssembly module (~2.8 MB combined) and your chosen language model (~1–3 MB depending on language) need to download from our R2 bucket. They are cached after the first download, so every subsequent job in the same browser starts immediately. The output is generated using the open-source pdf-lib JavaScript library, also bundled with the site.

Question 4

What languages are supported?

Accepted Answer

English, Spanish, French, German, Portuguese, Italian, Dutch, Russian, Polish, Turkish, Arabic, Hindi, Japanese, Korean, and Chinese (Simplified). Each language has its own model that downloads on demand the first time you pick it. Languages can be combined inside a single document by running the file through twice with different language picks if you have a multilingual scan.

Question 5

Will it re-OCR pages that already have text?

Accepted Answer

No. The tool inspects every page first and copies pages that already have a meaningful text layer through unchanged. Only pages that have no text — typical scans — are rasterised, OCR’d, and given a new invisible text layer. A 200-page born-digital PDF with three scanned pages in the middle finishes in seconds rather than minutes.

Question 6

Why does it sometimes mis-spell words?

Accepted Answer

Tesseract is a statistical recogniser trained on millions of pages of printed text in each language. It produces best-guess characters for every glyph it sees. On clean 300 dpi typed pages in the right language, accuracy is typically 95–99 %. On low-resolution scans, photos with uneven lighting, unusual fonts, or text rotated more than a few degrees, accuracy drops. The cure is almost always a better source: re-scan at higher DPI, take the photo more straight-on, increase contrast, or pick a different language model if one was wrong.

Question 7

Does the OCR also fix the visual quality of my scan?

Accepted Answer

No, and that is intentional. The tool keeps the original page image exactly as you uploaded it (just with an invisible text layer added on top) so you do not lose the visual fidelity of the source. If you also want to clean up a scan — straighten it, increase contrast, remove a colour cast — run it through an image processor first, then OCR the cleaned-up PDF for best recognition results.

Question 8

How does this compare to commercial PDF editors’s Recognize Text feature?

Accepted Answer

The pipeline is essentially identical: detect which pages need OCR, rasterise them, run Tesseract (a commercial PDF editor uses a different engine but the result is comparable on clean scans), then add an invisible text layer with the recognised words. a commercial PDF editor’s commercial engine has a slight edge on extreme edge cases (very low-resolution scans, heavy handwriting) and built-in OCR auto-rotation; this tool is free, runs entirely on your device, and does not require any subscription, account, or upload.

Question 9

Can I OCR a multilingual document?

Accepted Answer

Pick the dominant language for the first pass. Tesseract is forgiving: an English-trained model will still recognise a few Spanish words inside an English document with reasonable accuracy. For documents that are genuinely half-and-half, run the file through twice with each language and merge the results manually — a future version will support combining multiple languages in a single pass.

Question 10

Will it run on my phone?

Accepted Answer

Yes, on any modern smartphone browser. OCR is CPU-bound and noticeably slower on phones than on a laptop — expect roughly 3–8 seconds per page on a recent iPhone or flagship Android, versus under a second per page on a desktop. The first job per language takes longer because the WebAssembly engine and language model need to download and warm up; subsequent jobs are quick because both are cached on the device.

Question 11

Why is the first OCR run slow?

Accepted Answer

On the first OCR ever (per device + browser), three things have to happen before recognition can start: the Tesseract loader and WebAssembly module (~2.8 MB combined) download from our Cloudflare R2 CDN, your chosen language model (~1–3 MB) downloads from the same bucket, and Tesseract spends a second or two initialising the in-memory page recogniser. After this first run, all three are cached in your browser, so every later job in the same browser starts almost instantly.

Question 12

Can I process a 200-page scanned book?

Accepted Answer

Yes. The tool scales linearly with page count — there is no per-document limit other than the 200 MB upload cap. A 200-page scan typically takes 2–5 minutes on a desktop and 10–25 minutes on a phone. Browsers will keep the tab alive in the background; you can switch to other tabs and come back. If your device runs out of memory on very long jobs, split the PDF into chunks first and OCR them separately, then merge the searchable outputs.

Question 13

Is it really uploaded nowhere?

Accepted Answer

Correct. The file is read into memory by JavaScript running in your tab, the OCR engine and language model run inside that same tab via WebAssembly, and the resulting PDF is offered as a direct download. No upload to a server, no API call to a third party, no log line containing your file. Closing the tab clears every byte from memory. The Tesseract core and language model files do come from our CDN on first run, but they are open-source data files identical for every user — they reveal nothing about your specific document.

OCR PDF — Make Scanned PDFs Searchable Online

Related tools

About OCR PDF (Make Scanned PDF Searchable)

How it works

Common use cases

FAQ

OCR PDF — Make Scanned PDFs Searchable Online

Related tools

About OCR PDF (Make Scanned PDF Searchable)

How it works

Common use cases

FAQ

Explore more PDF Tools