Extract Tables from PDFs: 5 Methods That Actually Work
A hands-on comparison of five ways to extract tables from PDFs in Python: pdfplumber, Camelot, Tabula, AWS Textract, and manual regex. With code, benchmarks, and honest pros and cons.
I had a PDF with a 47-row financial table. I needed it in a DataFrame. "There must be a library for this," I thought. Four hours later I was manually adjusting column detection thresholds and questioning my career choices.
Table extraction from PDFs is hard because PDFs don't have tables. They have lines and text placed at specific coordinates. What looks like a table to your eyes is, to a PDF parser, just a bunch of text blocks that happen to be aligned. Some PDFs draw borders (which helps). Some don't (which means the parser has to guess where columns start and end based on whitespace).
Here are five approaches I've used in production, with code, benchmarks, and an honest assessment of each.
Test setup
I ran each method against three types of PDF tables:
- Clean: 20-row table with visible borders, exported from Excel
- Borderless: 35-row table with no grid lines, exported from Google Sheets
- Messy: 47-row financial table with merged cells, subtotal rows, and footnotes
Accuracy = percentage of cells extracted correctly. Timing on an M2 MacBook Air.
Method 1: pdfplumber
pdfplumber is my default. It uses the positions of lines and text characters to detect table boundaries. Works especially well when tables have visible borders.
import pdfplumber
import pandas as pd
def extract_tables_plumber(pdf_path: str) -> list[pd.DataFrame]:
tables = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
page_tables = page.extract_tables({
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"snap_tolerance": 5,
})
for table in page_tables:
# First row is usually headers
if len(table) > 1:
df = pd.DataFrame(table[1:], columns=table[0])
# Clean up whitespace
df = df.map(lambda x: x.strip() if isinstance(x, str) else x)
tables.append(df)
return tables
dfs = extract_tables_plumber("financial_report.pdf")
for i, df in enumerate(dfs):
print(f"Table {i + 1}: {len(df)} rows x {len(df.columns)} columns")
print(df.head())
print()
For borderless tables, switch to text-based detection:
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[0]
table = page.extract_table({
"vertical_strategy": "text",
"horizontal_strategy": "text",
"min_words_vertical": 3,
"min_words_horizontal": 1,
})
Results:
| PDF type | Accuracy | Time |
|---|---|---|
| Clean (bordered) | 100% | 0.3s |
| Borderless | 85% | 0.4s |
| Messy | 60% | 0.5s |
Pros: Pure Python, no Java dependency, good API, handles most bordered tables perfectly.
Cons: Struggles with borderless tables and merged cells. The text strategy for column detection is hit-or-miss.
Install: pip install pdfplumber
Method 2: Camelot
Camelot is the library that takes table extraction seriously. It has two modes: lattice (for tables with lines) and stream (for tables without). The stream mode uses a clustering algorithm to detect column positions from text alignment.
import camelot
def extract_tables_camelot(pdf_path: str, pages: str = "all") -> list:
# Try lattice first (bordered tables)
tables = camelot.read_pdf(
pdf_path,
pages=pages,
flavor="lattice",
line_scale=40
)
if len(tables) == 0:
# Fall back to stream (borderless tables)
tables = camelot.read_pdf(
pdf_path,
pages=pages,
flavor="stream",
edge_tol=50,
row_tol=10
)
results = []
for table in tables:
df = table.df
# Use first row as header
df.columns = df.iloc[0]
df = df[1:].reset_index(drop=True)
results.append({
"dataframe": df,
"accuracy": table.accuracy,
"page": table.page,
})
return results
results = extract_tables_camelot("financial_report.pdf", pages="1-3")
for r in results:
print(f"Page {r['page']}, accuracy: {r['accuracy']:.1f}%")
print(r["dataframe"].head())
Camelot's killer feature is the accuracy score. It tells you how confident it is about the extraction. If accuracy drops below 80%, you know to review that table manually.
You can also visualize what Camelot detected:
# Debug: see what Camelot thinks the table looks like
tables = camelot.read_pdf("report.pdf", pages="1", flavor="lattice")
if len(tables) > 0:
camelot.plot(tables[0], kind="contour").show()
Results:
| PDF type | Accuracy | Time |
|---|---|---|
| Clean (bordered) | 100% | 0.8s |
| Borderless | 92% | 1.2s |
| Messy | 72% | 1.5s |
Pros: Best open-source option for borderless tables. The accuracy metric is genuinely useful. stream mode handles column detection well.
Cons: Requires Ghostscript installed (brew install ghostscript or apt install ghostscript). Slower than pdfplumber. The edge_tol and row_tol parameters need tuning per document type.
Install: pip install camelot-py[cv] ghostscript
Method 3: Tabula (via tabula-py)
Tabula is a Java library with a Python wrapper. It's been around since 2012 and powers a lot of newsroom data extraction. If you've used the Tabula GUI app, this is the same engine.
import tabula
def extract_tables_tabula(pdf_path: str, pages: str = "all") -> list:
# Tabula returns a list of DataFrames
dfs = tabula.read_pdf(
pdf_path,
pages=pages,
multiple_tables=True,
lattice=True # set False for borderless
)
return [df for df in dfs if len(df) > 0]
dfs = extract_tables_tabula("financial_report.pdf")
for i, df in enumerate(dfs):
print(f"Table {i + 1}: {df.shape}")
print(df.head())
For tricky tables, you can define the area to extract from:
# Coordinates: [top, left, bottom, right] in PDF points
dfs = tabula.read_pdf(
"report.pdf",
pages="1",
area=[100, 50, 500, 550], # restrict to this region
columns=[200, 350, 450], # force column boundaries
)
Results:
| PDF type | Accuracy | Time |
|---|---|---|
| Clean (bordered) | 98% | 2.1s |
| Borderless | 80% | 2.4s |
| Messy | 55% | 3.0s |
Pros: Battle-tested, good at bordered tables, the area and columns parameters give manual control when auto-detection fails.
Cons: Requires Java Runtime Environment (JRE). The JVM startup adds ~2 seconds to every run. Slower than pdfplumber and Camelot. The Python wrapper occasionally throws cryptic Java stack traces.
Install: pip install tabula-py (plus JRE 8+)
Method 4: AWS Textract
When open-source tools can't handle the complexity — scanned documents, complex merged cells, tables spanning multiple pages — Textract is the most reliable option I've found.
import boto3
import json
def extract_tables_textract(pdf_path: str) -> list[list[list[str]]]:
"""Extract tables using AWS Textract. Returns list of tables,
each table is a list of rows, each row is a list of cell values."""
client = boto3.client("textract", region_name="us-east-1")
with open(pdf_path, "rb") as f:
response = client.analyze_document(
Document={"Bytes": f.read()},
FeatureTypes=["TABLES"]
)
# Build block map
blocks = {b["Id"]: b for b in response["Blocks"]}
tables = []
for block in response["Blocks"]:
if block["BlockType"] != "TABLE":
continue
rows = {}
for rel in block.get("Relationships", []):
if rel["Type"] != "CHILD":
continue
for cell_id in rel["Ids"]:
cell = blocks[cell_id]
if cell["BlockType"] != "CELL":
continue
row_idx = cell["RowIndex"]
col_idx = cell["ColumnIndex"]
# Get cell text
cell_text = ""
for cell_rel in cell.get("Relationships", []):
if cell_rel["Type"] == "CHILD":
for word_id in cell_rel["Ids"]:
word = blocks[word_id]
if word["BlockType"] == "WORD":
cell_text += word["Text"] + " "
if row_idx not in rows:
rows[row_idx] = {}
rows[row_idx][col_idx] = cell_text.strip()
# Convert to list of lists
if rows:
max_col = max(max(cols.keys()) for cols in rows.values())
table = []
for row_idx in sorted(rows.keys()):
row = [rows[row_idx].get(c, "") for c in range(1, max_col + 1)]
table.append(row)
tables.append(table)
return tables
tables = extract_tables_textract("messy_financial.pdf")
for i, table in enumerate(tables):
print(f"Table {i + 1}: {len(table)} rows x {len(table[0])} cols")
for row in table[:3]:
print(row)
Results:
| PDF type | Accuracy | Time | Cost |
|---|---|---|---|
| Clean (bordered) | 100% | 1.5s | $0.015/page |
| Borderless | 97% | 1.5s | $0.015/page |
| Messy | 93% | 1.8s | $0.015/page |
Pros: Handles everything — scans, borderless tables, merged cells, multi-page tables. The accuracy on messy documents is significantly better than any open-source tool.
Cons: Costs $0.015/page for table extraction. Requires AWS account. Synchronous API only handles single-page documents; multi-page PDFs need the async start_document_analysis flow. Network latency adds to processing time.
Method 5: Manual regex (the nuclear option)
Sometimes the table is so weird that no library can detect it. Maybe the "table" is actually just aligned text with no borders and inconsistent spacing. In those cases, I extract all text and parse it with regex.
import fitz
import re
import pandas as pd
def extract_table_regex(pdf_path: str, page_num: int = 0) -> pd.DataFrame:
"""Last resort: extract a table using regex patterns."""
doc = fitz.open(pdf_path)
page = doc[page_num]
# Get text preserving some layout info
text = page.get_text("text")
lines = text.split("\n")
# Example: financial table with format
# "Revenue 1,234,567 1,456,789 18.0%"
pattern = r'^([A-Za-z][\w\s&]+?)\s{2,}([\d,]+(?:\.\d+)?)\s+([\d,]+(?:\.\d+)?)\s+([\d.]+%?)$'
rows = []
for line in lines:
match = re.match(pattern, line.strip())
if match:
rows.append({
"item": match.group(1).strip(),
"year_1": match.group(2).replace(",", ""),
"year_2": match.group(3).replace(",", ""),
"change": match.group(4),
})
doc.close()
return pd.DataFrame(rows)
df = extract_table_regex("annual_report.pdf", page_num=3)
print(df)
For even harder cases, use positional text extraction:
def extract_by_position(pdf_path: str, page_num: int = 0) -> list[dict]:
"""Extract text with position info, then cluster into rows and columns."""
doc = fitz.open(pdf_path)
page = doc[page_num]
words = page.get_text("words") # list of (x0, y0, x1, y1, word, block, line, word_num)
# Group by y-coordinate (rows)
row_tolerance = 3 # points
rows = {}
for x0, y0, x1, y1, word, *_ in words:
row_key = round(y0 / row_tolerance) * row_tolerance
if row_key not in rows:
rows[row_key] = []
rows[row_key].append({"x": x0, "text": word})
# Sort each row by x position
result = []
for y in sorted(rows.keys()):
row_words = sorted(rows[y], key=lambda w: w["x"])
result.append(row_words)
doc.close()
return result
Results: Totally depends on your regex quality. For one-off extractions where I have time to craft the pattern, I get 100% accuracy. For anything automated, I don't use this approach.
Pros: Works on literally anything if you're willing to write the pattern. No dependencies beyond PyMuPDF.
Cons: Fragile, not generalizable, time-consuming to write. Every new table layout needs new code.
Summary table
| Method | Best for | Accuracy range | Speed | Dependencies |
|---|---|---|---|---|
| pdfplumber | Bordered tables | 60-100% | Fast | Pure Python |
| Camelot | Borderless tables | 72-100% | Medium | Ghostscript |
| Tabula | Bordered + manual tuning | 55-98% | Slow (JVM) | Java |
| AWS Textract | Scans, messy docs | 93-100% | Medium (network) | AWS account |
| Manual regex | One-off weird tables | 0-100% | Fast | PyMuPDF |
My general workflow:
- Try pdfplumber first. If the table has borders, it usually works.
- If borderless, switch to Camelot with
streammode. - If the document is scanned or truly messy, use Textract.
- If it's a one-off with a weird layout, write a regex.
The upstream fix: generate table-friendly PDFs
After years of extracting tables, I've noticed a pattern. The hardest PDFs to parse are always the ones generated by tools that treat tables as visual elements rather than data structures.
When a PDF is generated from actual HTML <table> elements — with proper <thead>, <tbody>, <tr>, <td> tags — every extraction library performs well. The table has real structure in the PDF, not just visually aligned text.
If you're generating PDFs that other people (or systems) will need to extract tables from, use HTML-based PDF generation. LightningPDF renders HTML to PDF through a real browser engine, which means your <table> markup translates into a properly structured PDF. Headers stay as headers, cells stay in their rows and columns, and anyone downstream can extract the data with pdfplumber in three lines of code.
The best table extraction code is the code you don't have to debug.
LightningPDF Team
Building fast, reliable PDF generation tools for developers.