OCR Text Recognition — Privacy

Privacy model, limits, and safety notes for OCR Text Recognition.

OCR Text Recognition privacy model

This tool is classified as heavy workload and runs in On-device mode. Current status: beta (fully-functional). Release note: OCR quality depends on scan quality. First run downloads language data.

What this does

  • Applies the selected transformation to the document or exported output.
  • Keeps processing local in browser when marked On-device.
  • Uses monthly local counters for usage quotas.

What this does not protect

  • It does not remove names or sensitive content visible in document text or images.
  • It does not guarantee legal anonymity or endpoint compromise protection.
  • For hybrid tools, privacy depends on explicit cloud opt-in when enabled.
  • Tesseract.js runs in a Web Worker. Each page consumes ~50-100MB of RAM during processing. Documents over 50 pages may cause memory pressure on devices with less than 4GB free.
  • Handwritten text is poorly supported. Tesseract is designed for printed text. Expect less than 30% accuracy on handwriting.
  • Multi-column layouts are partially supported. Tesseract reads left-to-right by default and may interleave columns on complex layouts.
  • Tables are not preserved structurally. Cell contents are extracted as text, but row/column relationships are lost.

Safe workflow defaults

  • Verify output manually before sharing.
  • Use security guidance at /security for higher-risk scenarios.
  • Keep original and transformed files separated to avoid accidental leaks.