PDF to JSON: How to Extract Structured Data from PDFs

Three practical approaches to extracting structured data from PDFs into JSON: regex on raw text, template-based extraction, and AI-powered extraction with code for each.

By LightningPDF Team · · 4 min read

A client sent me 1,200 supplier invoices as PDFs and asked me to "just get the data into our system." Each PDF had the same basic info — vendor name, invoice number, line items, totals — but across about 15 different layouts. Some were two-column, some had tables, one vendor apparently designed their invoices in PowerPoint.

PDF to JSON extraction is one of those problems that sounds like it should have been solved ten years ago. It hasn't. But depending on your constraints, there are three approaches that actually work, and I've shipped production systems with all three.

Approach 1: Regex on extracted text

If your PDFs come from a known set of templates (say, invoices from 5 vendors you work with regularly), pattern matching on raw text is surprisingly effective. It's fast, free, and you don't need an internet connection.

Step 1: Extract the text

import fitz  # pip install pymupdf

def get_text(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text("text") + "\n"
    doc.close()
    return text

raw = get_text("invoice_acme.pdf")
print(raw[:500])

Step 2: Build patterns for your templates

import re
import json

def extract_invoice_acme(text: str) -> dict:
    """Extract structured data from Acme Corp invoice layout."""

    invoice_num = re.search(r'Invoice\s*#?\s*:?\s*(\w+-\d+)', text)
    date = re.search(r'Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})', text)
    total = re.search(r'Total\s*:?\s*\$?([\d,]+\.\d{2})', text)

    # Line items: "Widget A    5    $10.00    $50.00"
    line_pattern = r'([A-Za-z][\w\s]{2,30}?)\s{2,}(\d+)\s+\$?([\d,]+\.\d{2})\s+\$?([\d,]+\.\d{2})'
    lines = re.findall(line_pattern, text)

    return {
        "vendor": "Acme Corp",
        "invoice_number": invoice_num.group(1) if invoice_num else None,
        "date": date.group(1) if date else None,
        "total": float(total.group(1).replace(",", "")) if total else None,
        "line_items": [
            {
                "description": desc.strip(),
                "quantity": int(qty),
                "unit_price": float(price.replace(",", "")),
                "amount": float(amt.replace(",", ""))
            }
            for desc, qty, price, amt in lines
        ]
    }

result = extract_invoice_acme(raw)
print(json.dumps(result, indent=2))

Output:

{
  "vendor": "Acme Corp",
  "invoice_number": "INV-20260312",
  "date": "03/12/2026",
  "total": 1250.00,
  "line_items": [
    {
      "description": "Widget A",
      "quantity": 5,
      "unit_price": 10.00,
      "amount": 50.00
    }
  ]
}

When this works

  • You have a small number of known PDF templates (under 20)
  • The templates don't change often
  • You need fast processing (this runs at thousands of pages per second)
  • You don't want to pay for an API

When it breaks

  • Vendor changes their invoice layout and your regex silently extracts wrong data
  • PDFs with inconsistent spacing (regex depends on whitespace patterns)
  • More than ~20 templates and the maintenance burden gets ugly

I've seen production systems with 50+ regex templates running for years. They work, but every new vendor means a day of writing and testing patterns. It's boring work and nobody wants to maintain it.

Approach 2: Template-based extraction with coordinates

If regex feels too fragile, you can extract text by position. Define regions on the page where specific fields appear, and grab whatever text falls in that box. This is how most commercial PDF extraction tools work under the hood.

import fitz

def extract_by_regions(pdf_path: str, regions: dict) -> dict:
    """Extract text from defined rectangular regions of a PDF.

    regions: {field_name: (x0, y0, x1, y1)} in points (72 points = 1 inch)
    """
    doc = fitz.open(pdf_path)
    page = doc[0]  # first page

    result = {}
    for field, rect in regions.items():
        area = fitz.Rect(rect)
        text = page.get_textbox(area).strip()
        result[field] = text

    doc.close()
    return result

# Define regions for a specific invoice template
# These coordinates come from measuring the PDF layout
acme_regions = {
    "invoice_number": (400, 80, 560, 100),   # top-right area
    "date":           (400, 105, 560, 125),
    "vendor_name":    (50, 80, 250, 110),     # top-left area
    "subtotal":       (450, 600, 560, 620),
    "tax":            (450, 625, 560, 645),
    "total":          (450, 650, 560, 670),
}

data = extract_by_regions("invoice_acme.pdf", acme_regions)
print(json.dumps(data, indent=2))

To find the right coordinates, I use this helper:

def debug_page_layout(pdf_path: str, page_num: int = 0):
    """Print all text blocks with their positions."""
    doc = fitz.open(pdf_path)
    page = doc[page_num]

    blocks = page.get_text("dict")["blocks"]
    for block in blocks:
        if "lines" in block:
            bbox = block["bbox"]
            text = ""
            for line in block["lines"]:
                for span in line["spans"]:
                    text += span["text"]
            if text.strip():
                print(f"  [{bbox[0]:.0f}, {bbox[1]:.0f}, {bbox[2]:.0f}, {bbox[3]:.0f}]  {text.strip()[:60]}")

    doc.close()

debug_page_layout("invoice_acme.pdf")

This prints every text block with its bounding box. Pick the coordinates that enclose the fields you want, add some padding, and you have a template.

The advantage over regex

Coordinate-based extraction doesn't care about text content. If the vendor changes "Invoice Number" to "Invoice No." or "Inv #", your extraction still works because you're reading from the same position on the page. It only breaks if they redesign the layout.

Extracting tables by position

For line items in a table, combine positional extraction with line-by-line parsing:

def extract_table_region(pdf_path: str, table_rect: tuple, columns: list) -> list[dict]:
    """Extract a table from a defined region using column x-coordinates.

    columns: [(name, x_start, x_end), ...]
    """
    doc = fitz.open(pdf_path)
    page = doc[0]

    # Get all text in the table area with position info
    area = fitz.Rect(table_rect)
    blocks = page.get_text("dict", clip=area)["blocks"]

    # Group text spans by y-coordinate (same row)
    rows = {}
    for block in blocks:
        if "lines" not in block:
            continue
        for line in block["lines"]:
            for span in line["spans"]:
                y = round(span["bbox"][1], 0)  # round to group nearby text
                if y not in rows:
                    rows[y] = []
                rows[y].append({
                    "x": span["bbox"][0],
                    "text": span["text"].strip()
                })

    # Assign text to columns
    result = []
    for y in sorted(rows.keys()):
        row_data = {}
        for col_name, x_start, x_end in columns:
            cell_text = " ".join(
                s["text"] for s in rows[y]
                if x_start <= s["x"] < x_end
            )
            row_data[col_name] = cell_text
        if any(v for v in row_data.values()):
            result.append(row_data)

    doc.close()
    return result

# Define table area and column boundaries
line_items = extract_table_region(
    "invoice_acme.pdf",
    table_rect=(50, 250, 560, 580),
    columns=[
        ("description", 50, 250),
        ("quantity", 250, 320),
        ("unit_price", 320, 420),
        ("amount", 420, 560),
    ]
)

This is tedious to set up but reliable for stable templates. I've used this in production for processing bank statements where the layout hasn't changed in years.

Approach 3: AI-powered extraction

When you have too many templates to maintain, or the PDFs are unstructured, throw an LLM at it. This is the approach that's gotten dramatically better in the last year.

Using GPT-4o with vision

The simplest approach — convert the PDF page to an image and ask the model to extract data:

import openai
import fitz
import base64
import json

def pdf_page_to_image(pdf_path: str, page_num: int = 0, dpi: int = 200) -> str:
    """Convert a PDF page to a base64-encoded PNG."""
    doc = fitz.open(pdf_path)
    page = doc[page_num]
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    pix = page.get_pixmap(matrix=mat)
    img_bytes = pix.tobytes("png")
    doc.close()
    return base64.b64encode(img_bytes).decode()

def extract_with_vision(pdf_path: str, schema: dict) -> dict:
    """Extract structured data from a PDF using GPT-4o vision."""
    client = openai.OpenAI()
    img_b64 = pdf_page_to_image(pdf_path)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"""Extract the following fields from this document image.
Return valid JSON matching this schema:
{json.dumps(schema, indent=2)}

Rules:
- Use null for fields you can't find
- Dates in ISO 8601 format (YYYY-MM-DD)
- Monetary values as floats, no currency symbols
- Return ONLY the JSON, no explanation"""
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{img_b64}",
                        "detail": "high"
                    }
                }
            ]
        }],
        temperature=0,
        max_tokens=2000
    )

    text = response.choices[0].message.content
    # Strip markdown code fences if present
    text = text.strip().removeprefix("```json").removesuffix("```").strip()
    return json.loads(text)

schema = {
    "vendor_name": "string",
    "invoice_number": "string",
    "date": "string (ISO 8601)",
    "line_items": [{"description": "string", "quantity": "int", "unit_price": "float", "amount": "float"}],
    "subtotal": "float",
    "tax": "float",
    "total": "float"
}

result = extract_with_vision("invoice_unknown_vendor.pdf", schema)
print(json.dumps(result, indent=2))

This handles virtually any layout. I've tested it on handwritten invoices, scanned documents from the 90s, and invoices in languages I don't speak. It works about 95% of the time.

Cost: about $0.01-0.03 per page at GPT-4o pricing. For 1,200 invoices, that's $12-36. Fast enough for batch processing — about 2-3 seconds per page.

Using a local model for sensitive documents

If you can't send documents to OpenAI (healthcare, finance, legal), use a local model. Qwen2-VL or LLaVA work well enough for structured extraction:

# Using Ollama with a vision model
import ollama
import base64

def extract_with_local_model(image_b64: str, schema: dict) -> dict:
    response = ollama.chat(
        model="llava:13b",
        messages=[{
            "role": "user",
            "content": f"""Extract data from this document image as JSON.
Schema: {json.dumps(schema)}
Return ONLY valid JSON.""",
            "images": [image_b64]
        }]
    )
    return json.loads(response["message"]["content"])

The accuracy drops to about 80-85% compared to GPT-4o, but it's free and private. For many use cases that's good enough, especially if you add a validation step.

Choosing your approach

Criteria Regex Template/coords AI vision
Setup time Medium High Low
Maintenance High Medium None
Cost per page ~$0 ~$0 $0.01-0.03
Accuracy (known templates) 95%+ 98%+ 95%
Accuracy (unknown layouts) 0% 0% 90-95%
Speed 1000+ pages/sec 500+ pages/sec 2-3 sec/page

For a batch of 1,200 invoices from 15 vendors, I ended up using a hybrid: AI extraction for the first pass, then manual review of anything where the total didn't match the sum of line items. Caught about 40 errors out of 1,200 — all edge cases like handwritten corrections on printed invoices.

Or skip extraction entirely

Here's the thing that took me too long to figure out: half the time someone asks me to extract data from PDFs, the data started as structured data. Someone had JSON, or a database row, or a spreadsheet. They generated a PDF from it. Now I'm being asked to reverse that process.

If you control the PDF generation step, you can skip extraction entirely. Keep the structured data, generate the PDF as the output format when a human needs to read it.

LightningPDF does exactly this — you send JSON data and an HTML template, and you get a PDF back. The data never stops being structured. When you need to put it in a database, you already have the JSON. When a customer needs a pretty document, you generate the PDF on demand.

curl -X POST https://api.lightningpdf.dev/api/v1/pdf/generate \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "html": "<h1>Invoice INV-20260312</h1><table><tr><td>Widget A</td><td>$50.00</td></tr></table>"
  }'

The best extraction pipeline is the one you don't need to build.

L

LightningPDF Team

Building fast, reliable PDF generation tools for developers.

Ready to generate PDFs?

Start free with 100 PDFs per month. No credit card required.

Get Started Free