Extract Tables from PDFs: 5 Methods That Actually Work

A hands-on comparison of five ways to extract tables from PDFs in Python: pdfplumber, Camelot, Tabula, AWS Textract, and manual regex. With code, benchmarks, and honest pros and cons.

By LightningPDF Team · · 5 min read

I had a PDF with a 47-row financial table. I needed it in a DataFrame. "There must be a library for this," I thought. Four hours later I was manually adjusting column detection thresholds and questioning my career choices.

Table extraction from PDFs is hard because PDFs don't have tables. They have lines and text placed at specific coordinates. What looks like a table to your eyes is, to a PDF parser, just a bunch of text blocks that happen to be aligned. Some PDFs draw borders (which helps). Some don't (which means the parser has to guess where columns start and end based on whitespace).

Here are five approaches I've used in production, with code, benchmarks, and an honest assessment of each.

Test setup

I ran each method against three types of PDF tables:

  • Clean: 20-row table with visible borders, exported from Excel
  • Borderless: 35-row table with no grid lines, exported from Google Sheets
  • Messy: 47-row financial table with merged cells, subtotal rows, and footnotes

Accuracy = percentage of cells extracted correctly. Timing on an M2 MacBook Air.

Method 1: pdfplumber

pdfplumber is my default. It uses the positions of lines and text characters to detect table boundaries. Works especially well when tables have visible borders.

import pdfplumber
import pandas as pd

def extract_tables_plumber(pdf_path: str) -> list[pd.DataFrame]:
    tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables({
                "vertical_strategy": "lines",
                "horizontal_strategy": "lines",
                "snap_tolerance": 5,
            })

            for table in page_tables:
                # First row is usually headers
                if len(table) > 1:
                    df = pd.DataFrame(table[1:], columns=table[0])
                    # Clean up whitespace
                    df = df.map(lambda x: x.strip() if isinstance(x, str) else x)
                    tables.append(df)

    return tables

dfs = extract_tables_plumber("financial_report.pdf")
for i, df in enumerate(dfs):
    print(f"Table {i + 1}: {len(df)} rows x {len(df.columns)} columns")
    print(df.head())
    print()

For borderless tables, switch to text-based detection:

with pdfplumber.open(pdf_path) as pdf:
    page = pdf.pages[0]
    table = page.extract_table({
        "vertical_strategy": "text",
        "horizontal_strategy": "text",
        "min_words_vertical": 3,
        "min_words_horizontal": 1,
    })

Results:

PDF type Accuracy Time
Clean (bordered) 100% 0.3s
Borderless 85% 0.4s
Messy 60% 0.5s

Pros: Pure Python, no Java dependency, good API, handles most bordered tables perfectly.

Cons: Struggles with borderless tables and merged cells. The text strategy for column detection is hit-or-miss.

Install: pip install pdfplumber

Method 2: Camelot

Camelot is the library that takes table extraction seriously. It has two modes: lattice (for tables with lines) and stream (for tables without). The stream mode uses a clustering algorithm to detect column positions from text alignment.

import camelot

def extract_tables_camelot(pdf_path: str, pages: str = "all") -> list:
    # Try lattice first (bordered tables)
    tables = camelot.read_pdf(
        pdf_path,
        pages=pages,
        flavor="lattice",
        line_scale=40
    )

    if len(tables) == 0:
        # Fall back to stream (borderless tables)
        tables = camelot.read_pdf(
            pdf_path,
            pages=pages,
            flavor="stream",
            edge_tol=50,
            row_tol=10
        )

    results = []
    for table in tables:
        df = table.df
        # Use first row as header
        df.columns = df.iloc[0]
        df = df[1:].reset_index(drop=True)
        results.append({
            "dataframe": df,
            "accuracy": table.accuracy,
            "page": table.page,
        })

    return results

results = extract_tables_camelot("financial_report.pdf", pages="1-3")
for r in results:
    print(f"Page {r['page']}, accuracy: {r['accuracy']:.1f}%")
    print(r["dataframe"].head())

Camelot's killer feature is the accuracy score. It tells you how confident it is about the extraction. If accuracy drops below 80%, you know to review that table manually.

You can also visualize what Camelot detected:

# Debug: see what Camelot thinks the table looks like
tables = camelot.read_pdf("report.pdf", pages="1", flavor="lattice")
if len(tables) > 0:
    camelot.plot(tables[0], kind="contour").show()

Results:

PDF type Accuracy Time
Clean (bordered) 100% 0.8s
Borderless 92% 1.2s
Messy 72% 1.5s

Pros: Best open-source option for borderless tables. The accuracy metric is genuinely useful. stream mode handles column detection well.

Cons: Requires Ghostscript installed (brew install ghostscript or apt install ghostscript). Slower than pdfplumber. The edge_tol and row_tol parameters need tuning per document type.

Install: pip install camelot-py[cv] ghostscript

Method 3: Tabula (via tabula-py)

Tabula is a Java library with a Python wrapper. It's been around since 2012 and powers a lot of newsroom data extraction. If you've used the Tabula GUI app, this is the same engine.

import tabula

def extract_tables_tabula(pdf_path: str, pages: str = "all") -> list:
    # Tabula returns a list of DataFrames
    dfs = tabula.read_pdf(
        pdf_path,
        pages=pages,
        multiple_tables=True,
        lattice=True  # set False for borderless
    )

    return [df for df in dfs if len(df) > 0]

dfs = extract_tables_tabula("financial_report.pdf")
for i, df in enumerate(dfs):
    print(f"Table {i + 1}: {df.shape}")
    print(df.head())

For tricky tables, you can define the area to extract from:

# Coordinates: [top, left, bottom, right] in PDF points
dfs = tabula.read_pdf(
    "report.pdf",
    pages="1",
    area=[100, 50, 500, 550],  # restrict to this region
    columns=[200, 350, 450],    # force column boundaries
)

Results:

PDF type Accuracy Time
Clean (bordered) 98% 2.1s
Borderless 80% 2.4s
Messy 55% 3.0s

Pros: Battle-tested, good at bordered tables, the area and columns parameters give manual control when auto-detection fails.

Cons: Requires Java Runtime Environment (JRE). The JVM startup adds ~2 seconds to every run. Slower than pdfplumber and Camelot. The Python wrapper occasionally throws cryptic Java stack traces.

Install: pip install tabula-py (plus JRE 8+)

Method 4: AWS Textract

When open-source tools can't handle the complexity — scanned documents, complex merged cells, tables spanning multiple pages — Textract is the most reliable option I've found.

import boto3
import json

def extract_tables_textract(pdf_path: str) -> list[list[list[str]]]:
    """Extract tables using AWS Textract. Returns list of tables,
    each table is a list of rows, each row is a list of cell values."""
    client = boto3.client("textract", region_name="us-east-1")

    with open(pdf_path, "rb") as f:
        response = client.analyze_document(
            Document={"Bytes": f.read()},
            FeatureTypes=["TABLES"]
        )

    # Build block map
    blocks = {b["Id"]: b for b in response["Blocks"]}

    tables = []
    for block in response["Blocks"]:
        if block["BlockType"] != "TABLE":
            continue

        rows = {}
        for rel in block.get("Relationships", []):
            if rel["Type"] != "CHILD":
                continue
            for cell_id in rel["Ids"]:
                cell = blocks[cell_id]
                if cell["BlockType"] != "CELL":
                    continue

                row_idx = cell["RowIndex"]
                col_idx = cell["ColumnIndex"]

                # Get cell text
                cell_text = ""
                for cell_rel in cell.get("Relationships", []):
                    if cell_rel["Type"] == "CHILD":
                        for word_id in cell_rel["Ids"]:
                            word = blocks[word_id]
                            if word["BlockType"] == "WORD":
                                cell_text += word["Text"] + " "

                if row_idx not in rows:
                    rows[row_idx] = {}
                rows[row_idx][col_idx] = cell_text.strip()

        # Convert to list of lists
        if rows:
            max_col = max(max(cols.keys()) for cols in rows.values())
            table = []
            for row_idx in sorted(rows.keys()):
                row = [rows[row_idx].get(c, "") for c in range(1, max_col + 1)]
                table.append(row)
            tables.append(table)

    return tables

tables = extract_tables_textract("messy_financial.pdf")
for i, table in enumerate(tables):
    print(f"Table {i + 1}: {len(table)} rows x {len(table[0])} cols")
    for row in table[:3]:
        print(row)

Results:

PDF type Accuracy Time Cost
Clean (bordered) 100% 1.5s $0.015/page
Borderless 97% 1.5s $0.015/page
Messy 93% 1.8s $0.015/page

Pros: Handles everything — scans, borderless tables, merged cells, multi-page tables. The accuracy on messy documents is significantly better than any open-source tool.

Cons: Costs $0.015/page for table extraction. Requires AWS account. Synchronous API only handles single-page documents; multi-page PDFs need the async start_document_analysis flow. Network latency adds to processing time.

Method 5: Manual regex (the nuclear option)

Sometimes the table is so weird that no library can detect it. Maybe the "table" is actually just aligned text with no borders and inconsistent spacing. In those cases, I extract all text and parse it with regex.

import fitz
import re
import pandas as pd

def extract_table_regex(pdf_path: str, page_num: int = 0) -> pd.DataFrame:
    """Last resort: extract a table using regex patterns."""
    doc = fitz.open(pdf_path)
    page = doc[page_num]

    # Get text preserving some layout info
    text = page.get_text("text")
    lines = text.split("\n")

    # Example: financial table with format
    # "Revenue         1,234,567    1,456,789    18.0%"
    pattern = r'^([A-Za-z][\w\s&]+?)\s{2,}([\d,]+(?:\.\d+)?)\s+([\d,]+(?:\.\d+)?)\s+([\d.]+%?)$'

    rows = []
    for line in lines:
        match = re.match(pattern, line.strip())
        if match:
            rows.append({
                "item": match.group(1).strip(),
                "year_1": match.group(2).replace(",", ""),
                "year_2": match.group(3).replace(",", ""),
                "change": match.group(4),
            })

    doc.close()
    return pd.DataFrame(rows)

df = extract_table_regex("annual_report.pdf", page_num=3)
print(df)

For even harder cases, use positional text extraction:

def extract_by_position(pdf_path: str, page_num: int = 0) -> list[dict]:
    """Extract text with position info, then cluster into rows and columns."""
    doc = fitz.open(pdf_path)
    page = doc[page_num]

    words = page.get_text("words")  # list of (x0, y0, x1, y1, word, block, line, word_num)

    # Group by y-coordinate (rows)
    row_tolerance = 3  # points
    rows = {}
    for x0, y0, x1, y1, word, *_ in words:
        row_key = round(y0 / row_tolerance) * row_tolerance
        if row_key not in rows:
            rows[row_key] = []
        rows[row_key].append({"x": x0, "text": word})

    # Sort each row by x position
    result = []
    for y in sorted(rows.keys()):
        row_words = sorted(rows[y], key=lambda w: w["x"])
        result.append(row_words)

    doc.close()
    return result

Results: Totally depends on your regex quality. For one-off extractions where I have time to craft the pattern, I get 100% accuracy. For anything automated, I don't use this approach.

Pros: Works on literally anything if you're willing to write the pattern. No dependencies beyond PyMuPDF.

Cons: Fragile, not generalizable, time-consuming to write. Every new table layout needs new code.

Summary table

Method Best for Accuracy range Speed Dependencies
pdfplumber Bordered tables 60-100% Fast Pure Python
Camelot Borderless tables 72-100% Medium Ghostscript
Tabula Bordered + manual tuning 55-98% Slow (JVM) Java
AWS Textract Scans, messy docs 93-100% Medium (network) AWS account
Manual regex One-off weird tables 0-100% Fast PyMuPDF

My general workflow:

  1. Try pdfplumber first. If the table has borders, it usually works.
  2. If borderless, switch to Camelot with stream mode.
  3. If the document is scanned or truly messy, use Textract.
  4. If it's a one-off with a weird layout, write a regex.

The upstream fix: generate table-friendly PDFs

After years of extracting tables, I've noticed a pattern. The hardest PDFs to parse are always the ones generated by tools that treat tables as visual elements rather than data structures.

When a PDF is generated from actual HTML <table> elements — with proper <thead>, <tbody>, <tr>, <td> tags — every extraction library performs well. The table has real structure in the PDF, not just visually aligned text.

If you're generating PDFs that other people (or systems) will need to extract tables from, use HTML-based PDF generation. LightningPDF renders HTML to PDF through a real browser engine, which means your <table> markup translates into a properly structured PDF. Headers stay as headers, cells stay in their rows and columns, and anyone downstream can extract the data with pdfplumber in three lines of code.

The best table extraction code is the code you don't have to debug.

L

LightningPDF Team

Building fast, reliable PDF generation tools for developers.

Ready to generate PDFs?

Start free with 100 PDFs per month. No credit card required.

Get Started Free