"python""pdf""extraction""comparison""kreuzberg""pymupdf""pdfplumber"

Kreuzberg vs PyMuPDF vs pdfplumber: Which PDF Parser Should You Use?

A head-to-head comparison of Kreuzberg, PyMuPDF, and pdfplumber for Python PDF parsing. Benchmarks, architecture differences, and code examples to help you pick the right tool.

By LightningPDF Team · April 1, 2026 · 6 min read

Last week a coworker asked me which Python PDF parser to use and I answered with a question: "What's in the PDF?"

It's not a dodge. These three libraries — Kreuzberg, PyMuPDF, and pdfplumber — make fundamentally different architectural decisions. Picking the wrong one means either rewriting your pipeline in a month or tolerating extraction bugs that poison your data.

I ran all three against the same set of PDFs and measured speed, accuracy, and developer experience. Here's what I found.

Architecture: Rust FFI vs C binding vs pure Python

Understanding what's under the hood matters because it predicts where each library will break.

PyMuPDF wraps the MuPDF C library via SWIG bindings. MuPDF is maintained by Artifex (the Ghostscript people) and handles rendering, text extraction, and annotation in C. When you call page.get_text(), you're calling into compiled C code that walks the PDF's content stream directly. The Python layer is thin — mostly marshalling data between Python objects and C structs.

pdfplumber is built on top of pdfminer.six, which is pure Python. Every PDF operator — every Tm, Td, TJ — gets interpreted by Python code. pdfplumber adds a spatial analysis layer on top: it groups characters into words, words into lines, and lines into table cells by analyzing x/y coordinates. This is why it understands tables but why it's slow.

Kreuzberg takes a different approach. It's a Python package with a Rust core compiled via PyO3. The Rust layer handles the byte-level PDF parsing and text extraction, while Python handles the orchestration. For scanned PDFs, Kreuzberg shells out to Tesseract OCR automatically — you don't need to detect whether a PDF is scanned or native. It also supports DOCX, PPTX, HTML, and other formats through a single extract_text() call.

Here's what this means in practice:

	PyMuPDF	pdfplumber	Kreuzberg
Core language	C (MuPDF)	Pure Python (pdfminer)	Rust (PyO3)
PDF parsing	MuPDF C library	pdfminer.six	Rust native parser
OCR	Optional (separate setup)	None	Tesseract (auto-detected)
Table extraction	Basic (text blocks)	Yes (spatial analysis)	No
License	AGPL-3.0	MIT	MIT
Other formats	PDF only	PDF only	DOCX, PPTX, HTML, images, etc.

Installation

All three install via pip, but with different complexity.

# PyMuPDF — ships prebuilt wheels, no system deps
pip install pymupdf

# pdfplumber — pure Python, no compilation
pip install pdfplumber

# Kreuzberg — needs Tesseract for OCR on scanned PDFs
pip install kreuzberg
# On Ubuntu/Debian:
sudo apt install tesseract-ocr

PyMuPDF and pdfplumber are zero-config. Kreuzberg works for native PDFs without Tesseract, but you'll want Tesseract installed for scanned documents — which is the whole point of using Kreuzberg.

The test: same PDF, three parsers

I used a 12-page financial report (native text, not scanned) with a mix of paragraphs, bullet lists, and two tables per page. Here's the extraction code for each.

PyMuPDF

import fitz
import time

start = time.perf_counter()
doc = fitz.open("financial-report.pdf")
text = ""
for page in doc:
    text += page.get_text()
elapsed = time.perf_counter() - start

print(f"PyMuPDF: {len(text)} chars in {elapsed:.3f}s")
print(text[:300])

Output:

PyMuPDF: 48,231 chars in 0.018s
Q3 2025 Financial Summary
Revenue Overview
Total revenue for Q3 2025 reached $4.2M, representing a 23% increase
over Q2. Subscription revenue accounted for 78% of total revenue...

Fast. The text reads in natural order. But the tables come out as a flat stream of text — column headers and values are interleaved without structure.

pdfplumber

import pdfplumber
import time

start = time.perf_counter()
text = ""
tables_found = []

with pdfplumber.open("financial-report.pdf") as pdf:
    for page in pdf.pages:
        text += page.extract_text() or ""
        for table in page.extract_tables():
            tables_found.append(table)

elapsed = time.perf_counter() - start

print(f"pdfplumber: {len(text)} chars in {elapsed:.3f}s")
print(f"Tables found: {len(tables_found)}")
for row in tables_found[0][:3]:
    print(row)

Output:

pdfplumber: 47,892 chars in 0.284s
Tables found: 24
['Category', 'Q2 2025', 'Q3 2025', 'Change']
['Subscription', '$2,890,000', '$3,276,000', '+13.4%']
['Usage-based', '$412,000', '$504,000', '+22.3%']

15x slower than PyMuPDF, but it returned the tables as structured data. Each cell is in the right column. If your pipeline needs tabular data, this difference matters more than the speed gap.

Kreuzberg

import asyncio
from kreuzberg import extract_file
import time

async def main():
    start = time.perf_counter()
    result = await extract_file("financial-report.pdf")
    elapsed = time.perf_counter() - start
    print(f"Kreuzberg: {len(result.content)} chars in {elapsed:.3f}s")
    print(result.content[:300])

asyncio.run(main())

Output:

Kreuzberg: 47,105 chars in 0.042s
Q3 2025 Financial Summary
Revenue Overview
Total revenue for Q3 2025 reached $4.2M, representing a 23% increase
over Q2. Subscription revenue accounted for 78% of total revenue...

Kreuzberg's async API is clean. The Rust parser is fast — not quite PyMuPDF's C speed, but close. No table extraction, but the text quality is good and reading order is preserved.

Speed benchmarks

I ran each library against three document sets and measured wall-clock time. Machine: M2 MacBook Pro, 16GB RAM, Python 3.12.

Native text PDFs (100 documents, 847 pages total)

Library	Total time	Pages/second
PyMuPDF	0.94s	901
Kreuzberg	2.1s	403
pdfplumber	14.7s	58

Scanned PDFs (20 documents, 63 pages total)

Library	Total time	Pages/second
PyMuPDF	0.08s*	788*
Kreuzberg	41.2s	1.5
pdfplumber	0.31s*	203*

*PyMuPDF and pdfplumber extracted zero useful text from scanned PDFs because they don't do OCR. The times are fast because they're essentially returning empty strings. Kreuzberg is the only one that actually extracted text here, via Tesseract.

Mixed batch (50 documents, 312 pages, ~40% scanned)

Library	Useful text extracted	Total time
PyMuPDF	62% of pages	0.38s
Kreuzberg	98% of pages	27.4s
pdfplumber	61% of pages	5.1s

This is where Kreuzberg's automatic OCR detection pays off. It identified the scanned pages and ran Tesseract on them without any configuration. PyMuPDF and pdfplumber silently returned empty or garbage text for those pages.

Feature comparison

Feature	PyMuPDF	pdfplumber	Kreuzberg
Text extraction	Excellent	Good	Good
Table extraction	No	Yes	No
OCR (scanned PDFs)	No (manual setup possible)	No	Yes (auto)
Image extraction	Yes	Yes	No
Annotation support	Yes	No	No
PDF writing/editing	Yes	No	No
Multi-format support	No	No	Yes (DOCX, PPTX, etc.)
Async API	No	No	Yes
Memory usage (100 pages)	~45MB	~120MB	~60MB
Install size	~30MB	~2MB	~15MB

When to use each

Use PyMuPDF when:

Speed is your top priority
You're processing native text PDFs (not scanned)
You need PDF editing or annotation features
You're OK with AGPL licensing (or buy a commercial license)

# Quick script to extract text from a folder of native PDFs
import fitz
from pathlib import Path

for pdf_path in Path("invoices/").glob("*.pdf"):
    doc = fitz.open(str(pdf_path))
    text = "\n".join(page.get_text() for page in doc)
    txt_path = pdf_path.with_suffix(".txt")
    txt_path.write_text(text)
    print(f"{pdf_path.name}: {len(text)} chars")

Use pdfplumber when:

You need to extract tables with column structure intact
You're processing invoices, financial reports, or forms
Speed isn't critical (batch jobs, not real-time)
You need MIT licensing

# Extract all tables from a PDF into CSV files
import pdfplumber
import csv

with pdfplumber.open("report.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        for j, table in enumerate(page.extract_tables()):
            with open(f"table_p{i}_t{j}.csv", "w", newline="") as f:
                writer = csv.writer(f)
                writer.writerows(table)

Use Kreuzberg when:

Your PDFs are a mix of scanned and native
You need OCR without managing Tesseract yourself
You're processing multiple file formats (not just PDF)
You want an async pipeline

# Process a mixed folder of documents (PDFs, DOCX, images)
import asyncio
from kreuzberg import extract_file
from pathlib import Path

async def process_all():
    results = {}
    for path in Path("documents/").iterdir():
        if path.suffix.lower() in (".pdf", ".docx", ".pptx", ".png", ".jpg"):
            result = await extract_file(str(path))
            results[path.name] = result.content
            print(f"{path.name}: {len(result.content)} chars")
    return results

documents = asyncio.run(process_all())

The AGPL question

PyMuPDF's AGPL license trips people up. If you distribute software that uses PyMuPDF, you need to either release your source code under AGPL or buy a commercial license from Artifex. For internal tools and scripts that never leave your company, AGPL is usually fine. For SaaS products, it's a legal conversation you should have.

Kreuzberg and pdfplumber are both MIT-licensed. No restrictions.

Combining parsers

Nothing stops you from using two libraries in the same pipeline. A practical pattern I've used:

import fitz
import pdfplumber

def extract_smart(pdf_path: str) -> dict:
    # Use PyMuPDF for fast text extraction
    doc = fitz.open(pdf_path)
    full_text = "\n".join(page.get_text() for page in doc)

    # Use pdfplumber only for pages that contain tables
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            if page_tables:
                tables.extend(page_tables)

    return {"text": full_text, "tables": tables}

You get PyMuPDF's speed for text and pdfplumber's table detection where you need it.

On the generation side

Extracting data from existing PDFs is half the story. The other half is generating new PDFs from that data — rebuilt invoices, reformatted reports, merged documents.

If your pipeline extracts data from PDFs and then needs to produce new PDFs from the results, LightningPDF's API handles the generation side. Send HTML (with your extracted data templated in), get a PDF back. It pairs well with any of these three parsers — extract with whichever fits your PDFs, generate with an API call:

import requests
import fitz

# Extract data from source PDF
doc = fitz.open("raw-report.pdf")
extracted = "\n".join(page.get_text() for page in doc)

# Generate a clean, branded PDF from the extracted data
html = f"""
<h1>Reformatted Report</h1>
<div style="font-family: sans-serif; line-height: 1.6;">
  {extracted.replace(chr(10), '<br>')}
</div>
"""

response = requests.post(
    "https://lightningpdf.dev/api/v1/pdf/generate",
    headers={"Authorization": "Bearer lpdf_your_key"},
    json={"html": html}
)

with open("clean-report.pdf", "wb") as f:
    f.write(response.content)

Extract with the right parser for your PDFs. Generate with an API that handles the rendering.

LightningPDF Team

Building fast, reliable PDF generation tools for developers.

Kreuzberg vs PyMuPDF vs pdfplumber: Which PDF Parser Should You Use?

Architecture: Rust FFI vs C binding vs pure Python

Installation

The test: same PDF, three parsers

PyMuPDF

pdfplumber

Kreuzberg

Speed benchmarks

Native text PDFs (100 documents, 847 pages total)

Scanned PDFs (20 documents, 63 pages total)

Mixed batch (50 documents, 312 pages, ~40% scanned)

Feature comparison

When to use each

The AGPL question

Combining parsers

On the generation side

Related Articles

Extract Tables from PDFs: 5 Methods That Actually Work

PDF to JSON: How to Extract Structured Data from PDFs

Best PDF Extraction APIs Compared: Textract vs Document AI vs the Rest

Ready to generate PDFs?