Kreuzberg vs PyMuPDF vs pdfplumber: Which PDF Parser Should You Use?
A head-to-head comparison of Kreuzberg, PyMuPDF, and pdfplumber for Python PDF parsing. Benchmarks, architecture differences, and code examples to help you pick the right tool.
Last week a coworker asked me which Python PDF parser to use and I answered with a question: "What's in the PDF?"
It's not a dodge. These three libraries — Kreuzberg, PyMuPDF, and pdfplumber — make fundamentally different architectural decisions. Picking the wrong one means either rewriting your pipeline in a month or tolerating extraction bugs that poison your data.
I ran all three against the same set of PDFs and measured speed, accuracy, and developer experience. Here's what I found.
Architecture: Rust FFI vs C binding vs pure Python
Understanding what's under the hood matters because it predicts where each library will break.
PyMuPDF wraps the MuPDF C library via SWIG bindings. MuPDF is maintained by Artifex (the Ghostscript people) and handles rendering, text extraction, and annotation in C. When you call page.get_text(), you're calling into compiled C code that walks the PDF's content stream directly. The Python layer is thin — mostly marshalling data between Python objects and C structs.
pdfplumber is built on top of pdfminer.six, which is pure Python. Every PDF operator — every Tm, Td, TJ — gets interpreted by Python code. pdfplumber adds a spatial analysis layer on top: it groups characters into words, words into lines, and lines into table cells by analyzing x/y coordinates. This is why it understands tables but why it's slow.
Kreuzberg takes a different approach. It's a Python package with a Rust core compiled via PyO3. The Rust layer handles the byte-level PDF parsing and text extraction, while Python handles the orchestration. For scanned PDFs, Kreuzberg shells out to Tesseract OCR automatically — you don't need to detect whether a PDF is scanned or native. It also supports DOCX, PPTX, HTML, and other formats through a single extract_text() call.
Here's what this means in practice:
| PyMuPDF | pdfplumber | Kreuzberg | |
|---|---|---|---|
| Core language | C (MuPDF) | Pure Python (pdfminer) | Rust (PyO3) |
| PDF parsing | MuPDF C library | pdfminer.six | Rust native parser |
| OCR | Optional (separate setup) | None | Tesseract (auto-detected) |
| Table extraction | Basic (text blocks) | Yes (spatial analysis) | No |
| License | AGPL-3.0 | MIT | MIT |
| Other formats | PDF only | PDF only | DOCX, PPTX, HTML, images, etc. |
Installation
All three install via pip, but with different complexity.
# PyMuPDF — ships prebuilt wheels, no system deps
pip install pymupdf
# pdfplumber — pure Python, no compilation
pip install pdfplumber
# Kreuzberg — needs Tesseract for OCR on scanned PDFs
pip install kreuzberg
# On Ubuntu/Debian:
sudo apt install tesseract-ocr
PyMuPDF and pdfplumber are zero-config. Kreuzberg works for native PDFs without Tesseract, but you'll want Tesseract installed for scanned documents — which is the whole point of using Kreuzberg.
The test: same PDF, three parsers
I used a 12-page financial report (native text, not scanned) with a mix of paragraphs, bullet lists, and two tables per page. Here's the extraction code for each.
PyMuPDF
import fitz
import time
start = time.perf_counter()
doc = fitz.open("financial-report.pdf")
text = ""
for page in doc:
text += page.get_text()
elapsed = time.perf_counter() - start
print(f"PyMuPDF: {len(text)} chars in {elapsed:.3f}s")
print(text[:300])
Output:
PyMuPDF: 48,231 chars in 0.018s
Q3 2025 Financial Summary
Revenue Overview
Total revenue for Q3 2025 reached $4.2M, representing a 23% increase
over Q2. Subscription revenue accounted for 78% of total revenue...
Fast. The text reads in natural order. But the tables come out as a flat stream of text — column headers and values are interleaved without structure.
pdfplumber
import pdfplumber
import time
start = time.perf_counter()
text = ""
tables_found = []
with pdfplumber.open("financial-report.pdf") as pdf:
for page in pdf.pages:
text += page.extract_text() or ""
for table in page.extract_tables():
tables_found.append(table)
elapsed = time.perf_counter() - start
print(f"pdfplumber: {len(text)} chars in {elapsed:.3f}s")
print(f"Tables found: {len(tables_found)}")
for row in tables_found[0][:3]:
print(row)
Output:
pdfplumber: 47,892 chars in 0.284s
Tables found: 24
['Category', 'Q2 2025', 'Q3 2025', 'Change']
['Subscription', '$2,890,000', '$3,276,000', '+13.4%']
['Usage-based', '$412,000', '$504,000', '+22.3%']
15x slower than PyMuPDF, but it returned the tables as structured data. Each cell is in the right column. If your pipeline needs tabular data, this difference matters more than the speed gap.
Kreuzberg
import asyncio
from kreuzberg import extract_file
import time
async def main():
start = time.perf_counter()
result = await extract_file("financial-report.pdf")
elapsed = time.perf_counter() - start
print(f"Kreuzberg: {len(result.content)} chars in {elapsed:.3f}s")
print(result.content[:300])
asyncio.run(main())
Output:
Kreuzberg: 47,105 chars in 0.042s
Q3 2025 Financial Summary
Revenue Overview
Total revenue for Q3 2025 reached $4.2M, representing a 23% increase
over Q2. Subscription revenue accounted for 78% of total revenue...
Kreuzberg's async API is clean. The Rust parser is fast — not quite PyMuPDF's C speed, but close. No table extraction, but the text quality is good and reading order is preserved.
Speed benchmarks
I ran each library against three document sets and measured wall-clock time. Machine: M2 MacBook Pro, 16GB RAM, Python 3.12.
Native text PDFs (100 documents, 847 pages total)
| Library | Total time | Pages/second |
|---|---|---|
| PyMuPDF | 0.94s | 901 |
| Kreuzberg | 2.1s | 403 |
| pdfplumber | 14.7s | 58 |
Scanned PDFs (20 documents, 63 pages total)
| Library | Total time | Pages/second |
|---|---|---|
| PyMuPDF | 0.08s* | 788* |
| Kreuzberg | 41.2s | 1.5 |
| pdfplumber | 0.31s* | 203* |
*PyMuPDF and pdfplumber extracted zero useful text from scanned PDFs because they don't do OCR. The times are fast because they're essentially returning empty strings. Kreuzberg is the only one that actually extracted text here, via Tesseract.
Mixed batch (50 documents, 312 pages, ~40% scanned)
| Library | Useful text extracted | Total time |
|---|---|---|
| PyMuPDF | 62% of pages | 0.38s |
| Kreuzberg | 98% of pages | 27.4s |
| pdfplumber | 61% of pages | 5.1s |
This is where Kreuzberg's automatic OCR detection pays off. It identified the scanned pages and ran Tesseract on them without any configuration. PyMuPDF and pdfplumber silently returned empty or garbage text for those pages.
Feature comparison
| Feature | PyMuPDF | pdfplumber | Kreuzberg |
|---|---|---|---|
| Text extraction | Excellent | Good | Good |
| Table extraction | No | Yes | No |
| OCR (scanned PDFs) | No (manual setup possible) | No | Yes (auto) |
| Image extraction | Yes | Yes | No |
| Annotation support | Yes | No | No |
| PDF writing/editing | Yes | No | No |
| Multi-format support | No | No | Yes (DOCX, PPTX, etc.) |
| Async API | No | No | Yes |
| Memory usage (100 pages) | ~45MB | ~120MB | ~60MB |
| Install size | ~30MB | ~2MB | ~15MB |
When to use each
Use PyMuPDF when:
- Speed is your top priority
- You're processing native text PDFs (not scanned)
- You need PDF editing or annotation features
- You're OK with AGPL licensing (or buy a commercial license)
# Quick script to extract text from a folder of native PDFs
import fitz
from pathlib import Path
for pdf_path in Path("invoices/").glob("*.pdf"):
doc = fitz.open(str(pdf_path))
text = "\n".join(page.get_text() for page in doc)
txt_path = pdf_path.with_suffix(".txt")
txt_path.write_text(text)
print(f"{pdf_path.name}: {len(text)} chars")
Use pdfplumber when:
- You need to extract tables with column structure intact
- You're processing invoices, financial reports, or forms
- Speed isn't critical (batch jobs, not real-time)
- You need MIT licensing
# Extract all tables from a PDF into CSV files
import pdfplumber
import csv
with pdfplumber.open("report.pdf") as pdf:
for i, page in enumerate(pdf.pages):
for j, table in enumerate(page.extract_tables()):
with open(f"table_p{i}_t{j}.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(table)
Use Kreuzberg when:
- Your PDFs are a mix of scanned and native
- You need OCR without managing Tesseract yourself
- You're processing multiple file formats (not just PDF)
- You want an async pipeline
# Process a mixed folder of documents (PDFs, DOCX, images)
import asyncio
from kreuzberg import extract_file
from pathlib import Path
async def process_all():
results = {}
for path in Path("documents/").iterdir():
if path.suffix.lower() in (".pdf", ".docx", ".pptx", ".png", ".jpg"):
result = await extract_file(str(path))
results[path.name] = result.content
print(f"{path.name}: {len(result.content)} chars")
return results
documents = asyncio.run(process_all())
The AGPL question
PyMuPDF's AGPL license trips people up. If you distribute software that uses PyMuPDF, you need to either release your source code under AGPL or buy a commercial license from Artifex. For internal tools and scripts that never leave your company, AGPL is usually fine. For SaaS products, it's a legal conversation you should have.
Kreuzberg and pdfplumber are both MIT-licensed. No restrictions.
Combining parsers
Nothing stops you from using two libraries in the same pipeline. A practical pattern I've used:
import fitz
import pdfplumber
def extract_smart(pdf_path: str) -> dict:
# Use PyMuPDF for fast text extraction
doc = fitz.open(pdf_path)
full_text = "\n".join(page.get_text() for page in doc)
# Use pdfplumber only for pages that contain tables
tables = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
page_tables = page.extract_tables()
if page_tables:
tables.extend(page_tables)
return {"text": full_text, "tables": tables}
You get PyMuPDF's speed for text and pdfplumber's table detection where you need it.
On the generation side
Extracting data from existing PDFs is half the story. The other half is generating new PDFs from that data — rebuilt invoices, reformatted reports, merged documents.
If your pipeline extracts data from PDFs and then needs to produce new PDFs from the results, LightningPDF's API handles the generation side. Send HTML (with your extracted data templated in), get a PDF back. It pairs well with any of these three parsers — extract with whichever fits your PDFs, generate with an API call:
import requests
import fitz
# Extract data from source PDF
doc = fitz.open("raw-report.pdf")
extracted = "\n".join(page.get_text() for page in doc)
# Generate a clean, branded PDF from the extracted data
html = f"""
<h1>Reformatted Report</h1>
<div style="font-family: sans-serif; line-height: 1.6;">
{extracted.replace(chr(10), '<br>')}
</div>
"""
response = requests.post(
"https://lightningpdf.dev/api/v1/pdf/generate",
headers={"Authorization": "Bearer lpdf_your_key"},
json={"html": html}
)
with open("clean-report.pdf", "wb") as f:
f.write(response.content)
Extract with the right parser for your PDFs. Generate with an API that handles the rendering.
LightningPDF Team
Building fast, reliable PDF generation tools for developers.