Kreuzberg vs PyMuPDF vs pdfplumber: Which PDF Parser Should You Use?

A head-to-head comparison of Kreuzberg, PyMuPDF, and pdfplumber for Python PDF parsing. Benchmarks, architecture differences, and code examples to help you pick the right tool.

By LightningPDF Team · · 6 min read

Last week a coworker asked me which Python PDF parser to use and I answered with a question: "What's in the PDF?"

It's not a dodge. These three libraries — Kreuzberg, PyMuPDF, and pdfplumber — make fundamentally different architectural decisions. Picking the wrong one means either rewriting your pipeline in a month or tolerating extraction bugs that poison your data.

I ran all three against the same set of PDFs and measured speed, accuracy, and developer experience. Here's what I found.

Architecture: Rust FFI vs C binding vs pure Python

Understanding what's under the hood matters because it predicts where each library will break.

PyMuPDF wraps the MuPDF C library via SWIG bindings. MuPDF is maintained by Artifex (the Ghostscript people) and handles rendering, text extraction, and annotation in C. When you call page.get_text(), you're calling into compiled C code that walks the PDF's content stream directly. The Python layer is thin — mostly marshalling data between Python objects and C structs.

pdfplumber is built on top of pdfminer.six, which is pure Python. Every PDF operator — every Tm, Td, TJ — gets interpreted by Python code. pdfplumber adds a spatial analysis layer on top: it groups characters into words, words into lines, and lines into table cells by analyzing x/y coordinates. This is why it understands tables but why it's slow.

Kreuzberg takes a different approach. It's a Python package with a Rust core compiled via PyO3. The Rust layer handles the byte-level PDF parsing and text extraction, while Python handles the orchestration. For scanned PDFs, Kreuzberg shells out to Tesseract OCR automatically — you don't need to detect whether a PDF is scanned or native. It also supports DOCX, PPTX, HTML, and other formats through a single extract_text() call.

Here's what this means in practice:

PyMuPDF pdfplumber Kreuzberg
Core language C (MuPDF) Pure Python (pdfminer) Rust (PyO3)
PDF parsing MuPDF C library pdfminer.six Rust native parser
OCR Optional (separate setup) None Tesseract (auto-detected)
Table extraction Basic (text blocks) Yes (spatial analysis) No
License AGPL-3.0 MIT MIT
Other formats PDF only PDF only DOCX, PPTX, HTML, images, etc.

Installation

All three install via pip, but with different complexity.

# PyMuPDF — ships prebuilt wheels, no system deps
pip install pymupdf

# pdfplumber — pure Python, no compilation
pip install pdfplumber

# Kreuzberg — needs Tesseract for OCR on scanned PDFs
pip install kreuzberg
# On Ubuntu/Debian:
sudo apt install tesseract-ocr

PyMuPDF and pdfplumber are zero-config. Kreuzberg works for native PDFs without Tesseract, but you'll want Tesseract installed for scanned documents — which is the whole point of using Kreuzberg.

The test: same PDF, three parsers

I used a 12-page financial report (native text, not scanned) with a mix of paragraphs, bullet lists, and two tables per page. Here's the extraction code for each.

PyMuPDF

import fitz
import time

start = time.perf_counter()
doc = fitz.open("financial-report.pdf")
text = ""
for page in doc:
    text += page.get_text()
elapsed = time.perf_counter() - start

print(f"PyMuPDF: {len(text)} chars in {elapsed:.3f}s")
print(text[:300])

Output:

PyMuPDF: 48,231 chars in 0.018s
Q3 2025 Financial Summary
Revenue Overview
Total revenue for Q3 2025 reached $4.2M, representing a 23% increase
over Q2. Subscription revenue accounted for 78% of total revenue...

Fast. The text reads in natural order. But the tables come out as a flat stream of text — column headers and values are interleaved without structure.

pdfplumber

import pdfplumber
import time

start = time.perf_counter()
text = ""
tables_found = []

with pdfplumber.open("financial-report.pdf") as pdf:
    for page in pdf.pages:
        text += page.extract_text() or ""
        for table in page.extract_tables():
            tables_found.append(table)

elapsed = time.perf_counter() - start

print(f"pdfplumber: {len(text)} chars in {elapsed:.3f}s")
print(f"Tables found: {len(tables_found)}")
for row in tables_found[0][:3]:
    print(row)

Output:

pdfplumber: 47,892 chars in 0.284s
Tables found: 24
['Category', 'Q2 2025', 'Q3 2025', 'Change']
['Subscription', '$2,890,000', '$3,276,000', '+13.4%']
['Usage-based', '$412,000', '$504,000', '+22.3%']

15x slower than PyMuPDF, but it returned the tables as structured data. Each cell is in the right column. If your pipeline needs tabular data, this difference matters more than the speed gap.

Kreuzberg

import asyncio
from kreuzberg import extract_file
import time

async def main():
    start = time.perf_counter()
    result = await extract_file("financial-report.pdf")
    elapsed = time.perf_counter() - start
    print(f"Kreuzberg: {len(result.content)} chars in {elapsed:.3f}s")
    print(result.content[:300])

asyncio.run(main())

Output:

Kreuzberg: 47,105 chars in 0.042s
Q3 2025 Financial Summary
Revenue Overview
Total revenue for Q3 2025 reached $4.2M, representing a 23% increase
over Q2. Subscription revenue accounted for 78% of total revenue...

Kreuzberg's async API is clean. The Rust parser is fast — not quite PyMuPDF's C speed, but close. No table extraction, but the text quality is good and reading order is preserved.

Speed benchmarks

I ran each library against three document sets and measured wall-clock time. Machine: M2 MacBook Pro, 16GB RAM, Python 3.12.

Native text PDFs (100 documents, 847 pages total)

Library Total time Pages/second
PyMuPDF 0.94s 901
Kreuzberg 2.1s 403
pdfplumber 14.7s 58

Scanned PDFs (20 documents, 63 pages total)

Library Total time Pages/second
PyMuPDF 0.08s* 788*
Kreuzberg 41.2s 1.5
pdfplumber 0.31s* 203*

*PyMuPDF and pdfplumber extracted zero useful text from scanned PDFs because they don't do OCR. The times are fast because they're essentially returning empty strings. Kreuzberg is the only one that actually extracted text here, via Tesseract.

Mixed batch (50 documents, 312 pages, ~40% scanned)

Library Useful text extracted Total time
PyMuPDF 62% of pages 0.38s
Kreuzberg 98% of pages 27.4s
pdfplumber 61% of pages 5.1s

This is where Kreuzberg's automatic OCR detection pays off. It identified the scanned pages and ran Tesseract on them without any configuration. PyMuPDF and pdfplumber silently returned empty or garbage text for those pages.

Feature comparison

Feature PyMuPDF pdfplumber Kreuzberg
Text extraction Excellent Good Good
Table extraction No Yes No
OCR (scanned PDFs) No (manual setup possible) No Yes (auto)
Image extraction Yes Yes No
Annotation support Yes No No
PDF writing/editing Yes No No
Multi-format support No No Yes (DOCX, PPTX, etc.)
Async API No No Yes
Memory usage (100 pages) ~45MB ~120MB ~60MB
Install size ~30MB ~2MB ~15MB

When to use each

Use PyMuPDF when:

  • Speed is your top priority
  • You're processing native text PDFs (not scanned)
  • You need PDF editing or annotation features
  • You're OK with AGPL licensing (or buy a commercial license)
# Quick script to extract text from a folder of native PDFs
import fitz
from pathlib import Path

for pdf_path in Path("invoices/").glob("*.pdf"):
    doc = fitz.open(str(pdf_path))
    text = "\n".join(page.get_text() for page in doc)
    txt_path = pdf_path.with_suffix(".txt")
    txt_path.write_text(text)
    print(f"{pdf_path.name}: {len(text)} chars")

Use pdfplumber when:

  • You need to extract tables with column structure intact
  • You're processing invoices, financial reports, or forms
  • Speed isn't critical (batch jobs, not real-time)
  • You need MIT licensing
# Extract all tables from a PDF into CSV files
import pdfplumber
import csv

with pdfplumber.open("report.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        for j, table in enumerate(page.extract_tables()):
            with open(f"table_p{i}_t{j}.csv", "w", newline="") as f:
                writer = csv.writer(f)
                writer.writerows(table)

Use Kreuzberg when:

  • Your PDFs are a mix of scanned and native
  • You need OCR without managing Tesseract yourself
  • You're processing multiple file formats (not just PDF)
  • You want an async pipeline
# Process a mixed folder of documents (PDFs, DOCX, images)
import asyncio
from kreuzberg import extract_file
from pathlib import Path

async def process_all():
    results = {}
    for path in Path("documents/").iterdir():
        if path.suffix.lower() in (".pdf", ".docx", ".pptx", ".png", ".jpg"):
            result = await extract_file(str(path))
            results[path.name] = result.content
            print(f"{path.name}: {len(result.content)} chars")
    return results

documents = asyncio.run(process_all())

The AGPL question

PyMuPDF's AGPL license trips people up. If you distribute software that uses PyMuPDF, you need to either release your source code under AGPL or buy a commercial license from Artifex. For internal tools and scripts that never leave your company, AGPL is usually fine. For SaaS products, it's a legal conversation you should have.

Kreuzberg and pdfplumber are both MIT-licensed. No restrictions.

Combining parsers

Nothing stops you from using two libraries in the same pipeline. A practical pattern I've used:

import fitz
import pdfplumber

def extract_smart(pdf_path: str) -> dict:
    # Use PyMuPDF for fast text extraction
    doc = fitz.open(pdf_path)
    full_text = "\n".join(page.get_text() for page in doc)

    # Use pdfplumber only for pages that contain tables
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            if page_tables:
                tables.extend(page_tables)

    return {"text": full_text, "tables": tables}

You get PyMuPDF's speed for text and pdfplumber's table detection where you need it.

On the generation side

Extracting data from existing PDFs is half the story. The other half is generating new PDFs from that data — rebuilt invoices, reformatted reports, merged documents.

If your pipeline extracts data from PDFs and then needs to produce new PDFs from the results, LightningPDF's API handles the generation side. Send HTML (with your extracted data templated in), get a PDF back. It pairs well with any of these three parsers — extract with whichever fits your PDFs, generate with an API call:

import requests
import fitz

# Extract data from source PDF
doc = fitz.open("raw-report.pdf")
extracted = "\n".join(page.get_text() for page in doc)

# Generate a clean, branded PDF from the extracted data
html = f"""
<h1>Reformatted Report</h1>
<div style="font-family: sans-serif; line-height: 1.6;">
  {extracted.replace(chr(10), '<br>')}
</div>
"""

response = requests.post(
    "https://lightningpdf.dev/api/v1/pdf/generate",
    headers={"Authorization": "Bearer lpdf_your_key"},
    json={"html": html}
)

with open("clean-report.pdf", "wb") as f:
    f.write(response.content)

Extract with the right parser for your PDFs. Generate with an API that handles the rendering.

L

LightningPDF Team

Building fast, reliable PDF generation tools for developers.

Ready to generate PDFs?

Start free with 100 PDFs per month. No credit card required.

Get Started Free