Best PDF Extraction APIs Compared: Textract vs Document AI vs the Rest

An honest comparison of AWS Textract, Google Document AI, Adobe PDF Extract, and open-source alternatives for PDF text extraction in 2026.

By LightningPDF Team · · 5 min read

Last month I needed to extract invoice data from about 3,000 PDFs for a client project. Some were native text, some were scans, and a handful were those nightmare PDFs exported from Canva where the text is actually vector paths.

I tested six extraction services. Here's what happened.

The contenders

I picked the services developers actually talk about, not the ones that buy the most Google ads:

  • AWS Textract — Amazon's document analysis service
  • Google Document AI — Google's extraction platform
  • Adobe PDF Extract — Adobe's API (they did invent the format)
  • Azure Document Intelligence — Microsoft's offering
  • Kreuzberg — open-source, Rust-based, no cloud dependency
  • PyMuPDF — the old reliable open-source library

Price comparison

Let's start with what matters when you're processing thousands of documents:

Service Text extraction Table extraction Form extraction
AWS Textract $0.0015/page $0.015/page $0.05/page
Google Document AI $0.001/page (OCR) $0.01/page $0.03-0.30/page
Adobe PDF Extract 500 free/month, then enterprise included included
Azure Doc Intelligence $0.001/page $0.01/page $0.01/page
Kreuzberg Free (MIT) Free Not supported
PyMuPDF Free (AGPL) Basic Not supported

For my 3,000 invoices (about 5,000 pages total):

  • Textract: ~$75 for tables
  • Google: ~$50 for tables
  • Azure: ~$50 for tables
  • Adobe: Enterprise pricing (they wouldn't give me a number without a sales call)
  • Kreuzberg: $0
  • PyMuPDF: $0

What I actually tested

I grabbed 100 PDFs from the batch — a mix of native text invoices, scanned receipts, and those Canva exports — and ran them through each service.

AWS Textract

Setup was straightforward if you already live in AWS. Upload to S3, call the API, get back JSON blocks.

import boto3

client = boto3.client('textract')
response = client.analyze_document(
    Document={'S3Object': {'Bucket': 'my-bucket', 'Name': 'invoice.pdf'}},
    FeatureTypes=['TABLES']
)

The good: Table detection was the best of the bunch. It correctly identified 94 out of 100 tables and maintained column alignment. OCR on scanned docs was solid.

The bad: The API returns a deeply nested JSON structure of "blocks" with relationships between them. Reconstructing a simple table from this requires about 50 lines of code to traverse the block tree. There's no "just give me the table as a CSV" option.

Pricing surprise: If you enable "FORMS" detection alongside "TABLES", the cost jumps from $0.015 to $0.065 per page. Read the pricing page carefully.

Google Document AI

Google has roughly 60 pre-trained "processors" for different document types. For general extraction, you use the OCR processor.

The good: Best OCR accuracy for handwritten text. If your PDFs include handwritten notes, Google wins. The layout analysis is also excellent — it understands columns, headers, and reading order better than Textract.

The bad: The API design is awkward. You need to create a "processor" in the console before you can make API calls. And the response format varies by processor type, so you can't just swap processors without changing your parsing code.

Adobe PDF Extract

You'd think the company that invented PDF would have the best extraction. They don't.

The good: The structured output is clean. You get paragraphs, headings, and tables in a well-organized JSON. For native text PDFs, the quality is excellent.

The bad: No OCR for scanned documents. If the PDF doesn't have a text layer, Adobe returns nothing. The free tier is 500 transactions/month, and after that they make you talk to sales. I never got a straight answer on pricing.

Azure Document Intelligence

Microsoft's offering. I almost skipped it because the branding changes every six months (it was "Form Recognizer" before), but developers on HN kept recommending it.

The good: Genuinely surprised. Table extraction was on par with Textract, the OCR was solid, and the pricing was the most transparent. The "prebuilt-layout" model handles most documents well without any training.

The bad: The Python SDK has some rough edges. Error messages could be better. But functionally, this was the strongest all-rounder.

Kreuzberg (open source)

The new player. Written in Rust, runs locally, supports 91 file formats. I installed it with pip (it has Python bindings):

from kreuzberg import extract_file

result = extract_file("invoice.pdf")
print(result.content)

The good: Blazing fast for native text PDFs. Processed my 100-PDF test batch in 8 seconds locally vs. 45+ seconds for the cloud APIs (network latency adds up). No API keys, no cloud dependency, no per-page costs.

The bad: OCR quality with Tesseract is noticeably worse than the cloud services. It extracted text from 96/100 PDFs, but the 4 failures were scanned receipts where Tesseract couldn't handle the low resolution. No table structure detection — you get raw text.

PyMuPDF

The workhorse that's been around forever:

import fitz
doc = fitz.open("invoice.pdf")
for page in doc:
    print(page.get_text())

The good: Fastest for pure text extraction. Zero setup, zero cost, works offline.

The bad: No OCR. No table detection. No layout analysis. It gives you text in reading order (usually) and that's it.

My results table

Native text Scanned docs Tables Speed (100 PDFs)
Textract Great Great Best 52s
Google Doc AI Great Best (handwriting) Good 48s
Adobe Extract Excellent No OCR Good 65s
Azure Great Great Great 45s
Kreuzberg Great OK (Tesseract) No 8s
PyMuPDF Great No OCR No 3s

The honest recommendation

If you need high accuracy on mixed document types (native + scanned), use Azure Document Intelligence. It's the best balance of quality, pricing, and API usability.

If you're all-in on AWS, Textract is fine. The table detection is best-in-class. Just budget for the parsing code to reconstruct tables from the block tree.

If you're processing native text PDFs at scale, skip the cloud entirely. Kreuzberg or PyMuPDF will be 10x faster and free.

If most of your PDFs are handwritten or low-quality scans, Google Document AI. The OCR is genuinely a tier above the rest.

The question you should actually be asking

Before you spend a week building an extraction pipeline, ask yourself: do you actually need to extract, or do you need to generate?

I see this pattern constantly: a developer builds a system that creates invoices as PDFs, stores them, and then later needs to pull data back out of those PDFs for reporting or analysis.

The data was yours all along. You didn't need extraction — you needed to keep the structured data alongside the PDF.

If you're generating PDFs (invoices, reports, contracts, certificates), consider building your pipeline so the structured data lives in your database and the PDF is just a rendered view:

Data (DB) ─── API call ──→ PDF (for humans)
     └────────────────────→ JSON/CSV (for machines)

LightningPDF does the "data → PDF" step in one API call. You keep the data, we render the PDF. No extraction needed.

Try it in the playground — paste HTML, get a PDF. Or check the API docs if you prefer curl.


LightningPDF is a complete PDF API for developers — generate, merge, split, compress, protect, and more. 100 free PDFs/month, no credit card required.

L

LightningPDF Team

Building fast, reliable PDF generation tools for developers.

Ready to generate PDFs?

Start free with 100 PDFs per month. No credit card required.

Get Started Free