How to Extract Text from PDFs in Python (Without Losing Your Mind)

A practical guide to extracting text from PDFs in Python. Covers PyMuPDF, pdfplumber, and when you should skip extraction entirely and just generate a new PDF.

By LightningPDF Team · · 5 min read

I've spent more hours than I'd like to admit trying to extract text from PDFs. It always starts the same way: "How hard can it be? It's just text in a document."

Then you open the PDF spec and realize it's 756 pages long, the "text" is actually a sequence of glyph positioning commands, and half your PDFs were scanned on a fax machine in 2003.

Here's what actually works in 2026, what doesn't, and when you should stop extracting text and just generate a new PDF from scratch.

The quick version

If you just need something that works for most PDFs:

import fitz  # PyMuPDF

doc = fitz.open("report.pdf")
text = ""
for page in doc:
    text += page.get_text()

print(text[:500])

Install it with pip install pymupdf. That's it. For 80% of PDFs — the ones with actual text layers, not scanned images — this is all you need.

The three Python libraries worth using

I've tested about a dozen Python PDF libraries over the years. Three are worth your time. The rest are either abandoned, painfully slow, or produce garbled output.

1. PyMuPDF (fitz) — the fast one

PyMuPDF wraps the MuPDF C library. It's fast, handles most PDFs correctly, and doesn't make you think about PDF internals.

import fitz

doc = fitz.open("invoice.pdf")

for page_num, page in enumerate(doc):
    text = page.get_text("text")
    print(f"--- Page {page_num + 1} ---")
    print(text)

When to use it: You need speed. You're processing hundreds of PDFs. You don't care about table structure.

When to skip it: You need to extract tables with their column alignment intact. PyMuPDF gives you the text but doesn't understand table semantics.

Speed: I've clocked it at around 50-100 pages per second on a typical laptop. A 200-page report takes about 2-3 seconds.

2. pdfplumber — the table one

pdfplumber is built on top of pdfminer and adds table detection. If your PDFs have tables and you need the data in rows and columns, this is the one.

import pdfplumber

with pdfplumber.open("financial-report.pdf") as pdf:
    for page in pdf.pages:
        # Plain text
        text = page.extract_text()

        # Tables as lists of lists
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

When to use it: Invoices, financial reports, anything with tabular data.

When to skip it: Speed matters. pdfplumber is 5-10x slower than PyMuPDF because it does a lot more work analyzing the page layout.

3. Marker — the AI-powered one

Marker is the new kid. It uses deep learning models to convert PDFs to Markdown, handling complex layouts, headers, footnotes, and even equations. It's what you want for academic papers and complex reports.

pip install marker-pdf
marker_single input.pdf output/ --output_format markdown

When to use it: Complex layouts. Research papers. Anything where PyMuPDF gives you a jumbled mess of headers and footnotes mixed together.

When to skip it: You need real-time extraction (it's slow), you're processing thousands of PDFs (it needs a GPU for reasonable speed), or your PDFs are simple.

The extraction quality problem

Here's something nobody tells you upfront: the quality of text extraction depends almost entirely on how the PDF was created, not on which library you use.

Native text PDFs (created by Word, Chrome, LaTeX): All three libraries extract text accurately. PyMuPDF is fastest.

Scanned PDFs (images of text): You need OCR. None of the libraries above do this well on their own. You need Tesseract or a cloud OCR service.

PDFs with weird fonts (custom encoding, subset fonts): This is where things get ugly. The text might look fine visually but extract as Ǝǩǭ˩ǠˢǞǡ because the font uses a custom character mapping. No library handles all of these correctly.

PDFs exported from design tools (Illustrator, InDesign): Text is often converted to outlines (vector paths). There's no text to extract — it's literally just shapes. You need OCR even though the PDF isn't a scan.

When to stop extracting and just generate

I see a lot of developers doing this: they receive data (from an API, database, or form submission), generate a PDF, and then later need to extract that data back out of the PDF.

This is like writing a letter, photographing it, and then running OCR to read it back. You already have the data. You don't need extraction — you need to keep the data and the PDF separate.

If you're building a pipeline that goes:

data → generate PDF → store PDF → extract data from PDF

Just skip the last step. Store the original data alongside the PDF.

And if you need to generate the PDF in the first place, a simple API call is all it takes:

import requests

response = requests.post(
    "https://lightningpdf.dev/api/v1/pdf/download",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "html": "<h1>Invoice #2024-001</h1><p>Total: $49.00</p>"
    }
)

with open("invoice.pdf", "wb") as f:
    f.write(response.content)

This generates a pixel-perfect PDF from HTML in under a second. No headless browser to manage, no Puppeteer memory leaks, no Playwright Docker images.

The practical decision tree

Here's how I decide which approach to use:

  1. Do you already have the source data? → Don't extract. Generate a new PDF with LightningPDF.
  2. Is the PDF a simple text document? → PyMuPDF. Fast and accurate.
  3. Does the PDF have tables you need? → pdfplumber. Slower but preserves structure.
  4. Is the PDF complex (academic, multi-column)? → Marker. Slow but handles layout.
  5. Is the PDF a scan? → You need OCR. Try Tesseract first, then Google Document AI if Tesseract isn't good enough.
  6. Is the text garbled after extraction? → The PDF has custom font encoding. You might be stuck. Try a different library, or if possible, re-request the document in a different format.

What I'd do today

If I were building a document pipeline from scratch in 2026:

  • Generation: LightningPDF API for creating PDFs from HTML/templates
  • Simple extraction: PyMuPDF for reading text out of existing PDFs
  • Table extraction: pdfplumber when I need tabular data
  • OCR: Google Document AI for scanned documents (it's pricey but accurate)
  • Storage: Keep the source data separate from the PDF so I never need to extract it again

The PDF format was designed for printing, not for data interchange. Every hour you spend fighting PDF extraction is an hour you could spend building your actual product. Use the right tool for each job and move on.


Need to generate PDFs from HTML, not extract from them? Try LightningPDF free — paste HTML, get a PDF back in milliseconds. 100 PDFs/month, no credit card required.

L

LightningPDF Team

Building fast, reliable PDF generation tools for developers.

Ready to generate PDFs?

Start free with 100 PDFs per month. No credit card required.

Get Started Free