OCR PDF API: When You Need It and When You Don't
A practical guide to PDF OCR: how to check if a PDF actually needs OCR, Tesseract vs cloud APIs, and when you should skip OCR entirely by generating PDFs with real text layers.
A developer on my team spent two days integrating a cloud OCR API into our document pipeline. Processing cost: about $200/month. Then I checked the actual PDFs flowing through the system. 94% of them were born-digital — they already had a text layer. We were paying to "recognize" text that was already there.
This happens constantly. OCR has become the default answer to "I need to extract text from PDFs," but most PDFs don't need it. Here's how to tell the difference, what to use when you actually need OCR, and how to avoid the problem entirely.
Native text vs. scanned PDFs: how to tell
A PDF with native text has the actual character data embedded. You can select text, copy it, search it. The text is stored as Unicode characters with positioning instructions.
A scanned PDF is a stack of images. Every "page" is a raster image (usually JPEG or CCITT fax) wrapped in a PDF container. There's no text data — just pixels. To get text out, you need OCR.
Then there's the hybrid: a scanned PDF that's already been OCR'd. It has an image layer (what you see) and an invisible text layer (what you can search). These don't need OCR again, but they look like scans at first glance.
Here's how to check programmatically:
import fitz # pip install pymupdf
def check_pdf_type(pdf_path: str) -> dict:
"""Determine if a PDF has native text, is a scan, or needs OCR."""
doc = fitz.open(pdf_path)
total_pages = len(doc)
pages_with_text = 0
pages_with_images = 0
total_text_chars = 0
total_image_area = 0
for page in doc:
text = page.get_text("text").strip()
images = page.get_images(full=True)
page_area = page.rect.width * page.rect.height
if len(text) > 50:
pages_with_text += 1
total_text_chars += len(text)
if images:
pages_with_images += 1
# Check if any image covers most of the page (likely a scan)
for img in images:
xref = img[0]
img_info = doc.extract_image(xref)
if img_info:
w, h = img_info["width"], img_info["height"]
# Rough heuristic: if image is large relative to page
img_area = w * h
total_image_area += img_area
doc.close()
text_ratio = pages_with_text / total_pages if total_pages > 0 else 0
avg_chars = total_text_chars / total_pages if total_pages > 0 else 0
if text_ratio > 0.8 and avg_chars > 200:
pdf_type = "native_text"
needs_ocr = False
elif text_ratio > 0.3 and pages_with_images > total_pages * 0.5:
pdf_type = "hybrid_ocrd"
needs_ocr = False
else:
pdf_type = "scanned"
needs_ocr = True
return {
"path": pdf_path,
"type": pdf_type,
"needs_ocr": needs_ocr,
"total_pages": total_pages,
"pages_with_text": pages_with_text,
"pages_with_images": pages_with_images,
"avg_chars_per_page": round(avg_chars),
}
result = check_pdf_type("mystery_document.pdf")
print(f"Type: {result['type']}")
print(f"Needs OCR: {result['needs_ocr']}")
print(f"Avg chars/page: {result['avg_chars_per_page']}")
I run this check on every PDF before deciding whether to OCR. On a typical business document collection:
- ~70% are native text (Word/Google Docs exports, HTML-to-PDF)
- ~15% are scans that have already been OCR'd
- ~15% actually need OCR
That means roughly 85% of PDFs can be processed with simple text extraction. No OCR API, no compute cost, no waiting.
When you actually need OCR
You need OCR when:
- The PDF is a scan with no text layer (avg_chars_per_page near zero)
- The PDF was exported from a design tool that rasterizes text (Canva, some Figma exports)
- You're processing photos of documents (receipts, whiteboards, handwritten notes)
You don't need OCR when:
- The PDF was exported from Word, Excel, Google Docs, or any office suite
- The PDF was generated by an HTML-to-PDF tool
- The PDF already has a text layer (even if it's a scan with OCR applied)
- The text is selectable when you open it in a PDF viewer
Tesseract: free, local, good enough
Tesseract is the open-source OCR engine maintained by Google. It's been around since the 80s (originally by HP), and after a major rewrite with LSTM neural networks in v4/v5, the accuracy is solid for printed text.
Setup
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-eng
# Check version (want 5.x)
tesseract --version
Basic usage with pytesseract
import fitz
import pytesseract
from PIL import Image
import io
def ocr_pdf_tesseract(pdf_path: str, dpi: int = 300) -> list[dict]:
"""OCR a scanned PDF using Tesseract."""
doc = fitz.open(pdf_path)
results = []
for page_num, page in enumerate(doc):
# Render page to image
mat = fitz.Matrix(dpi / 72, dpi / 72)
pix = page.get_pixmap(matrix=mat)
img = Image.open(io.BytesIO(pix.tobytes("png")))
# Run OCR
text = pytesseract.image_to_string(img, lang="eng")
results.append({
"page": page_num + 1,
"text": text.strip(),
"confidence": _get_confidence(img)
})
doc.close()
return results
def _get_confidence(img: Image.Image) -> float:
"""Get average OCR confidence for an image."""
data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
confidences = [int(c) for c in data["conf"] if int(c) > 0]
return sum(confidences) / len(confidences) if confidences else 0.0
pages = ocr_pdf_tesseract("scanned_contract.pdf")
for p in pages:
print(f"Page {p['page']}: {len(p['text'])} chars, confidence: {p['confidence']:.1f}%")
print(p["text"][:200])
print("---")
Preprocessing for better accuracy
Raw scans often produce mediocre OCR results. A few preprocessing steps can push accuracy from 85% to 95%+:
import cv2
import numpy as np
def preprocess_for_ocr(img: Image.Image) -> Image.Image:
"""Clean up a scanned image before OCR."""
# Convert to OpenCV format
img_array = np.array(img)
# Convert to grayscale
if len(img_array.shape) == 3:
gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
else:
gray = img_array
# Denoise
denoised = cv2.fastNlMeansDenoising(gray, h=10)
# Adaptive threshold (handles uneven lighting from scans)
binary = cv2.adaptiveThreshold(
denoised, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
blockSize=11,
C=2
)
# Deskew (fix slight rotation from scanning)
coords = np.column_stack(np.where(binary < 128))
if len(coords) > 100:
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = 90 + angle
if abs(angle) > 0.5:
h, w = binary.shape
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
binary = cv2.warpAffine(
binary, M, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE
)
return Image.fromarray(binary)
Then use it in the pipeline:
# In ocr_pdf_tesseract, before pytesseract.image_to_string:
img = preprocess_for_ocr(img)
text = pytesseract.image_to_string(img, lang="eng", config="--psm 6")
The --psm 6 flag tells Tesseract to assume a uniform block of text, which works better for most document pages than the default.
Tesseract accuracy on my test set of 200 scanned pages:
| Document type | Raw accuracy | With preprocessing |
|---|---|---|
| Clean office scans (300dpi) | 94% | 97% |
| Phone photos of documents | 78% | 88% |
| Old fax copies | 65% | 79% |
| Handwritten text | 30% | 35% |
Tesseract doesn't do handwriting. If you need that, you need a cloud API.
Cloud OCR APIs: when Tesseract isn't enough
Google Cloud Vision
Best accuracy I've tested, especially for multi-language documents and poor-quality scans.
from google.cloud import vision
def ocr_google_vision(pdf_path: str) -> list[dict]:
client = vision.ImageAnnotatorClient()
with open(pdf_path, "rb") as f:
content = f.read()
# For PDFs, use async batch annotation
input_config = vision.InputConfig(
content=content,
mime_type="application/pdf"
)
feature = vision.Feature(type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)
request = vision.AnnotateFileRequest(
input_config=input_config,
features=[feature],
pages=[1, 2, 3, 4, 5] # specify pages
)
response = client.batch_annotate_files(requests=[request])
results = []
for resp in response.responses:
for page_resp in resp.responses:
text = page_resp.full_text_annotation.text
results.append({"text": text})
return results
Pricing: $1.50 per 1,000 pages. For occasional use, the free tier covers 1,000 pages/month.
AWS Textract
Best for structured documents (forms, tables) where you need field-level extraction, not just raw text.
# Quick test with the CLI
aws textract detect-document-text \
--document '{"S3Object": {"Bucket": "my-bucket", "Name": "scan.pdf"}}' \
--query 'Blocks[?BlockType==`LINE`].Text' \
--output text
import boto3
def ocr_textract(pdf_path: str) -> str:
client = boto3.client("textract")
with open(pdf_path, "rb") as f:
response = client.detect_document_text(
Document={"Bytes": f.read()}
)
lines = []
for block in response["Blocks"]:
if block["BlockType"] == "LINE":
lines.append(block["Text"])
return "\n".join(lines)
Pricing: $1.50 per 1,000 pages for text detection. $15 per 1,000 pages for table/form extraction.
Comparison
| Service | Accuracy (clean scans) | Accuracy (poor quality) | Handwriting | Cost per 1K pages | Latency |
|---|---|---|---|---|---|
| Tesseract 5 | 94-97% | 65-88% | Poor | $0 | 1-3s/page |
| Google Vision | 98% | 92% | Good | $1.50 | 0.5-1s/page |
| AWS Textract | 97% | 90% | Fair | $1.50 | 1-2s/page |
| Azure Doc Intelligence | 97% | 91% | Good | $1.00 | 1-2s/page |
My decision tree:
- Budget is $0: Tesseract with preprocessing.
- Need high accuracy on mixed-quality scans: Google Vision.
- Need structured extraction (tables, forms): AWS Textract.
- Processing sensitive documents on-premise: Tesseract (or PaddleOCR if you need better accuracy without cloud).
The hybrid pipeline
In production, I use a two-step pipeline that avoids unnecessary OCR:
import fitz
import pytesseract
from PIL import Image
import io
import json
def smart_extract(pdf_path: str, ocr_threshold: int = 100) -> dict:
"""Extract text from a PDF, using OCR only when needed."""
doc = fitz.open(pdf_path)
pages = []
ocr_used = False
for page_num, page in enumerate(doc):
# Try native text extraction first
text = page.get_text("text").strip()
if len(text) > ocr_threshold:
# Enough native text — no OCR needed
pages.append({
"page": page_num + 1,
"text": text,
"method": "native"
})
else:
# Probably a scan — use OCR
mat = fitz.Matrix(300 / 72, 300 / 72)
pix = page.get_pixmap(matrix=mat)
img = Image.open(io.BytesIO(pix.tobytes("png")))
ocr_text = pytesseract.image_to_string(img, lang="eng", config="--psm 6")
ocr_used = True
pages.append({
"page": page_num + 1,
"text": ocr_text.strip(),
"method": "ocr"
})
doc.close()
return {
"path": pdf_path,
"total_pages": len(pages),
"ocr_used": ocr_used,
"pages": pages
}
result = smart_extract("mixed_document.pdf")
print(f"Pages: {result['total_pages']}, OCR used: {result['ocr_used']}")
native_count = sum(1 for p in result["pages"] if p["method"] == "native")
print(f"Native extraction: {native_count}/{result['total_pages']} pages")
This saves 70-85% of OCR processing for a typical document collection. On a 10,000-page batch, that's the difference between $15 in cloud API costs and $2.25.
Batch processing with progress
For large batches, wrap it with progress tracking:
# Process a folder of PDFs with the smart pipeline
python -c "
import os, json
from smart_extract import smart_extract # the function above
pdf_dir = './documents'
results = []
total_ocr = 0
total_native = 0
for f in sorted(os.listdir(pdf_dir)):
if not f.endswith('.pdf'):
continue
r = smart_extract(os.path.join(pdf_dir, f))
results.append(r)
for p in r['pages']:
if p['method'] == 'ocr':
total_ocr += 1
else:
total_native += 1
print(f'{f}: {r[\"total_pages\"]} pages, OCR: {r[\"ocr_used\"]}')
print(f'Total: {total_native} native, {total_ocr} OCR')
print(f'OCR savings: {total_native / (total_native + total_ocr) * 100:.0f}%')
"
Skip OCR entirely: generate PDFs with real text
The entire OCR industry exists because people create PDFs without proper text layers. Scans, screenshots, rasterized design exports — they all throw away the text data and leave you with pixels.
If you're on the generating side of the equation, you can eliminate downstream OCR needs by producing PDFs with actual text. HTML-to-PDF rendering does this automatically. The browser engine renders your HTML, and the PDF output contains real, selectable, searchable text. Screen readers can parse it. Text extraction is instant and 100% accurate. No OCR needed, ever.
LightningPDF generates PDFs from HTML through a headless browser. The output has a proper text layer — every character is stored as Unicode, every heading is structured, every table cell is extractable. If someone downstream needs to OCR your PDFs, you've already failed at generation.
curl -X POST https://api.lightningpdf.dev/api/v1/pdf/generate \
-H "X-API-Key: your-key" \
-H "Content-Type: application/json" \
-d '{
"html": "<h1>Quarterly Report</h1><p>Revenue: $1.2M</p><table><tr><th>Product</th><th>Sales</th></tr><tr><td>Widget A</td><td>$450K</td></tr></table>"
}'
That produces a PDF where "Quarterly Report" is a real heading, "$1.2M" is real text, and the table is a real table. Any extraction tool — PyMuPDF, pdfplumber, even pdftotext from the command line — will get perfect output. Zero OCR, zero ambiguity, zero cost.
Generate accessible PDFs from the start and the OCR question answers itself.
LightningPDF Team
Building fast, reliable PDF generation tools for developers.