OCR PDF API: When You Need It and When You Don't

A practical guide to PDF OCR: how to check if a PDF actually needs OCR, Tesseract vs cloud APIs, and when you should skip OCR entirely by generating PDFs with real text layers.

By LightningPDF Team · · 5 min read

A developer on my team spent two days integrating a cloud OCR API into our document pipeline. Processing cost: about $200/month. Then I checked the actual PDFs flowing through the system. 94% of them were born-digital — they already had a text layer. We were paying to "recognize" text that was already there.

This happens constantly. OCR has become the default answer to "I need to extract text from PDFs," but most PDFs don't need it. Here's how to tell the difference, what to use when you actually need OCR, and how to avoid the problem entirely.

Native text vs. scanned PDFs: how to tell

A PDF with native text has the actual character data embedded. You can select text, copy it, search it. The text is stored as Unicode characters with positioning instructions.

A scanned PDF is a stack of images. Every "page" is a raster image (usually JPEG or CCITT fax) wrapped in a PDF container. There's no text data — just pixels. To get text out, you need OCR.

Then there's the hybrid: a scanned PDF that's already been OCR'd. It has an image layer (what you see) and an invisible text layer (what you can search). These don't need OCR again, but they look like scans at first glance.

Here's how to check programmatically:

import fitz  # pip install pymupdf

def check_pdf_type(pdf_path: str) -> dict:
    """Determine if a PDF has native text, is a scan, or needs OCR."""
    doc = fitz.open(pdf_path)

    total_pages = len(doc)
    pages_with_text = 0
    pages_with_images = 0
    total_text_chars = 0
    total_image_area = 0

    for page in doc:
        text = page.get_text("text").strip()
        images = page.get_images(full=True)
        page_area = page.rect.width * page.rect.height

        if len(text) > 50:
            pages_with_text += 1
            total_text_chars += len(text)

        if images:
            pages_with_images += 1
            # Check if any image covers most of the page (likely a scan)
            for img in images:
                xref = img[0]
                img_info = doc.extract_image(xref)
                if img_info:
                    w, h = img_info["width"], img_info["height"]
                    # Rough heuristic: if image is large relative to page
                    img_area = w * h
                    total_image_area += img_area

    doc.close()

    text_ratio = pages_with_text / total_pages if total_pages > 0 else 0
    avg_chars = total_text_chars / total_pages if total_pages > 0 else 0

    if text_ratio > 0.8 and avg_chars > 200:
        pdf_type = "native_text"
        needs_ocr = False
    elif text_ratio > 0.3 and pages_with_images > total_pages * 0.5:
        pdf_type = "hybrid_ocrd"
        needs_ocr = False
    else:
        pdf_type = "scanned"
        needs_ocr = True

    return {
        "path": pdf_path,
        "type": pdf_type,
        "needs_ocr": needs_ocr,
        "total_pages": total_pages,
        "pages_with_text": pages_with_text,
        "pages_with_images": pages_with_images,
        "avg_chars_per_page": round(avg_chars),
    }

result = check_pdf_type("mystery_document.pdf")
print(f"Type: {result['type']}")
print(f"Needs OCR: {result['needs_ocr']}")
print(f"Avg chars/page: {result['avg_chars_per_page']}")

I run this check on every PDF before deciding whether to OCR. On a typical business document collection:

  • ~70% are native text (Word/Google Docs exports, HTML-to-PDF)
  • ~15% are scans that have already been OCR'd
  • ~15% actually need OCR

That means roughly 85% of PDFs can be processed with simple text extraction. No OCR API, no compute cost, no waiting.

When you actually need OCR

You need OCR when:

  • The PDF is a scan with no text layer (avg_chars_per_page near zero)
  • The PDF was exported from a design tool that rasterizes text (Canva, some Figma exports)
  • You're processing photos of documents (receipts, whiteboards, handwritten notes)

You don't need OCR when:

  • The PDF was exported from Word, Excel, Google Docs, or any office suite
  • The PDF was generated by an HTML-to-PDF tool
  • The PDF already has a text layer (even if it's a scan with OCR applied)
  • The text is selectable when you open it in a PDF viewer

Tesseract: free, local, good enough

Tesseract is the open-source OCR engine maintained by Google. It's been around since the 80s (originally by HP), and after a major rewrite with LSTM neural networks in v4/v5, the accuracy is solid for printed text.

Setup

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-eng

# Check version (want 5.x)
tesseract --version

Basic usage with pytesseract

import fitz
import pytesseract
from PIL import Image
import io

def ocr_pdf_tesseract(pdf_path: str, dpi: int = 300) -> list[dict]:
    """OCR a scanned PDF using Tesseract."""
    doc = fitz.open(pdf_path)
    results = []

    for page_num, page in enumerate(doc):
        # Render page to image
        mat = fitz.Matrix(dpi / 72, dpi / 72)
        pix = page.get_pixmap(matrix=mat)
        img = Image.open(io.BytesIO(pix.tobytes("png")))

        # Run OCR
        text = pytesseract.image_to_string(img, lang="eng")

        results.append({
            "page": page_num + 1,
            "text": text.strip(),
            "confidence": _get_confidence(img)
        })

    doc.close()
    return results

def _get_confidence(img: Image.Image) -> float:
    """Get average OCR confidence for an image."""
    data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
    confidences = [int(c) for c in data["conf"] if int(c) > 0]
    return sum(confidences) / len(confidences) if confidences else 0.0

pages = ocr_pdf_tesseract("scanned_contract.pdf")
for p in pages:
    print(f"Page {p['page']}: {len(p['text'])} chars, confidence: {p['confidence']:.1f}%")
    print(p["text"][:200])
    print("---")

Preprocessing for better accuracy

Raw scans often produce mediocre OCR results. A few preprocessing steps can push accuracy from 85% to 95%+:

import cv2
import numpy as np

def preprocess_for_ocr(img: Image.Image) -> Image.Image:
    """Clean up a scanned image before OCR."""
    # Convert to OpenCV format
    img_array = np.array(img)

    # Convert to grayscale
    if len(img_array.shape) == 3:
        gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
    else:
        gray = img_array

    # Denoise
    denoised = cv2.fastNlMeansDenoising(gray, h=10)

    # Adaptive threshold (handles uneven lighting from scans)
    binary = cv2.adaptiveThreshold(
        denoised, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY,
        blockSize=11,
        C=2
    )

    # Deskew (fix slight rotation from scanning)
    coords = np.column_stack(np.where(binary < 128))
    if len(coords) > 100:
        angle = cv2.minAreaRect(coords)[-1]
        if angle < -45:
            angle = 90 + angle
        if abs(angle) > 0.5:
            h, w = binary.shape
            center = (w // 2, h // 2)
            M = cv2.getRotationMatrix2D(center, angle, 1.0)
            binary = cv2.warpAffine(
                binary, M, (w, h),
                flags=cv2.INTER_CUBIC,
                borderMode=cv2.BORDER_REPLICATE
            )

    return Image.fromarray(binary)

Then use it in the pipeline:

# In ocr_pdf_tesseract, before pytesseract.image_to_string:
img = preprocess_for_ocr(img)
text = pytesseract.image_to_string(img, lang="eng", config="--psm 6")

The --psm 6 flag tells Tesseract to assume a uniform block of text, which works better for most document pages than the default.

Tesseract accuracy on my test set of 200 scanned pages:

Document type Raw accuracy With preprocessing
Clean office scans (300dpi) 94% 97%
Phone photos of documents 78% 88%
Old fax copies 65% 79%
Handwritten text 30% 35%

Tesseract doesn't do handwriting. If you need that, you need a cloud API.

Cloud OCR APIs: when Tesseract isn't enough

Google Cloud Vision

Best accuracy I've tested, especially for multi-language documents and poor-quality scans.

from google.cloud import vision

def ocr_google_vision(pdf_path: str) -> list[dict]:
    client = vision.ImageAnnotatorClient()

    with open(pdf_path, "rb") as f:
        content = f.read()

    # For PDFs, use async batch annotation
    input_config = vision.InputConfig(
        content=content,
        mime_type="application/pdf"
    )
    feature = vision.Feature(type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)
    request = vision.AnnotateFileRequest(
        input_config=input_config,
        features=[feature],
        pages=[1, 2, 3, 4, 5]  # specify pages
    )

    response = client.batch_annotate_files(requests=[request])

    results = []
    for resp in response.responses:
        for page_resp in resp.responses:
            text = page_resp.full_text_annotation.text
            results.append({"text": text})

    return results

Pricing: $1.50 per 1,000 pages. For occasional use, the free tier covers 1,000 pages/month.

AWS Textract

Best for structured documents (forms, tables) where you need field-level extraction, not just raw text.

# Quick test with the CLI
aws textract detect-document-text \
  --document '{"S3Object": {"Bucket": "my-bucket", "Name": "scan.pdf"}}' \
  --query 'Blocks[?BlockType==`LINE`].Text' \
  --output text
import boto3

def ocr_textract(pdf_path: str) -> str:
    client = boto3.client("textract")

    with open(pdf_path, "rb") as f:
        response = client.detect_document_text(
            Document={"Bytes": f.read()}
        )

    lines = []
    for block in response["Blocks"]:
        if block["BlockType"] == "LINE":
            lines.append(block["Text"])

    return "\n".join(lines)

Pricing: $1.50 per 1,000 pages for text detection. $15 per 1,000 pages for table/form extraction.

Comparison

Service Accuracy (clean scans) Accuracy (poor quality) Handwriting Cost per 1K pages Latency
Tesseract 5 94-97% 65-88% Poor $0 1-3s/page
Google Vision 98% 92% Good $1.50 0.5-1s/page
AWS Textract 97% 90% Fair $1.50 1-2s/page
Azure Doc Intelligence 97% 91% Good $1.00 1-2s/page

My decision tree:

  • Budget is $0: Tesseract with preprocessing.
  • Need high accuracy on mixed-quality scans: Google Vision.
  • Need structured extraction (tables, forms): AWS Textract.
  • Processing sensitive documents on-premise: Tesseract (or PaddleOCR if you need better accuracy without cloud).

The hybrid pipeline

In production, I use a two-step pipeline that avoids unnecessary OCR:

import fitz
import pytesseract
from PIL import Image
import io
import json

def smart_extract(pdf_path: str, ocr_threshold: int = 100) -> dict:
    """Extract text from a PDF, using OCR only when needed."""
    doc = fitz.open(pdf_path)
    pages = []
    ocr_used = False

    for page_num, page in enumerate(doc):
        # Try native text extraction first
        text = page.get_text("text").strip()

        if len(text) > ocr_threshold:
            # Enough native text — no OCR needed
            pages.append({
                "page": page_num + 1,
                "text": text,
                "method": "native"
            })
        else:
            # Probably a scan — use OCR
            mat = fitz.Matrix(300 / 72, 300 / 72)
            pix = page.get_pixmap(matrix=mat)
            img = Image.open(io.BytesIO(pix.tobytes("png")))

            ocr_text = pytesseract.image_to_string(img, lang="eng", config="--psm 6")
            ocr_used = True

            pages.append({
                "page": page_num + 1,
                "text": ocr_text.strip(),
                "method": "ocr"
            })

    doc.close()

    return {
        "path": pdf_path,
        "total_pages": len(pages),
        "ocr_used": ocr_used,
        "pages": pages
    }

result = smart_extract("mixed_document.pdf")
print(f"Pages: {result['total_pages']}, OCR used: {result['ocr_used']}")

native_count = sum(1 for p in result["pages"] if p["method"] == "native")
print(f"Native extraction: {native_count}/{result['total_pages']} pages")

This saves 70-85% of OCR processing for a typical document collection. On a 10,000-page batch, that's the difference between $15 in cloud API costs and $2.25.

Batch processing with progress

For large batches, wrap it with progress tracking:

# Process a folder of PDFs with the smart pipeline
python -c "
import os, json
from smart_extract import smart_extract  # the function above

pdf_dir = './documents'
results = []
total_ocr = 0
total_native = 0

for f in sorted(os.listdir(pdf_dir)):
    if not f.endswith('.pdf'):
        continue
    r = smart_extract(os.path.join(pdf_dir, f))
    results.append(r)
    for p in r['pages']:
        if p['method'] == 'ocr':
            total_ocr += 1
        else:
            total_native += 1
    print(f'{f}: {r[\"total_pages\"]} pages, OCR: {r[\"ocr_used\"]}')

print(f'Total: {total_native} native, {total_ocr} OCR')
print(f'OCR savings: {total_native / (total_native + total_ocr) * 100:.0f}%')
"

Skip OCR entirely: generate PDFs with real text

The entire OCR industry exists because people create PDFs without proper text layers. Scans, screenshots, rasterized design exports — they all throw away the text data and leave you with pixels.

If you're on the generating side of the equation, you can eliminate downstream OCR needs by producing PDFs with actual text. HTML-to-PDF rendering does this automatically. The browser engine renders your HTML, and the PDF output contains real, selectable, searchable text. Screen readers can parse it. Text extraction is instant and 100% accurate. No OCR needed, ever.

LightningPDF generates PDFs from HTML through a headless browser. The output has a proper text layer — every character is stored as Unicode, every heading is structured, every table cell is extractable. If someone downstream needs to OCR your PDFs, you've already failed at generation.

curl -X POST https://api.lightningpdf.dev/api/v1/pdf/generate \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "html": "<h1>Quarterly Report</h1><p>Revenue: $1.2M</p><table><tr><th>Product</th><th>Sales</th></tr><tr><td>Widget A</td><td>$450K</td></tr></table>"
  }'

That produces a PDF where "Quarterly Report" is a real heading, "$1.2M" is real text, and the table is a real table. Any extraction tool — PyMuPDF, pdfplumber, even pdftotext from the command line — will get perfect output. Zero OCR, zero ambiguity, zero cost.

Generate accessible PDFs from the start and the OCR question answers itself.

L

LightningPDF Team

Building fast, reliable PDF generation tools for developers.

Ready to generate PDFs?

Start free with 100 PDFs per month. No credit card required.

Get Started Free