How to Parse PDFs for RAG Pipelines

A practical guide to parsing PDFs for retrieval-augmented generation. Covers chunking strategies, PyMuPDF vs Marker vs LlamaParse, and code for extracting and embedding PDF content.

By LightningPDF Team · · 5 min read

Last week I built a RAG system over a client's internal knowledge base. 400 PDFs — policy documents, onboarding guides, engineering specs. My first attempt gave answers like "according to the document, section 3.2 references the applicable policy." Useless. The LLM was regurgitating PDF garbage because my parsing was garbage.

PDF parsing for RAG is its own special hell. You're not just extracting text — you're extracting meaning in chunks that a language model can actually use. Here's what I learned after three rewrites and a lot of wasted OpenAI credits.

Why PDF parsing is the bottleneck in every RAG system

Most RAG tutorials gloss over this part. They show you how to set up a vector store, call an embedding API, and wire up retrieval. The PDF parsing step gets one line: loader = PyPDFLoader("doc.pdf").

Then your system hallucinates because:

  • Headers and footers repeat on every page and contaminate chunks
  • Tables get flattened into nonsensical strings like "Revenue 2024 2025 Growth 1.2M 1.8M 50%"
  • Multi-column layouts merge into interleaved gibberish
  • Page breaks split sentences mid-thought
  • Figures and captions end up as orphaned text fragments

The quality of your RAG system is capped by the quality of your parsing. I've seen teams spend weeks tuning retrieval parameters when the real problem was that their chunks contained "Page 47 of 112 | Confidential" every 300 tokens.

The three parsers worth trying in 2026

1. PyMuPDF — fast, free, good enough for clean PDFs

If your PDFs are born-digital (exported from Word, Google Docs, or a PDF generation API), PyMuPDF handles them well. It's fast — about 50 pages per second on an M2 Mac — and the text comes out in reading order most of the time.

import fitz  # pip install pymupdf

def extract_text_pymupdf(pdf_path: str) -> list[dict]:
    """Extract text by page with metadata."""
    doc = fitz.open(pdf_path)
    pages = []
    for i, page in enumerate(doc):
        text = page.get_text("text")
        pages.append({
            "page": i + 1,
            "text": text.strip(),
            "char_count": len(text.strip())
        })
    return pages

pages = extract_text_pymupdf("knowledge_base/onboarding.pdf")
print(f"Extracted {len(pages)} pages, {sum(p['char_count'] for p in pages)} chars")

Where it falls apart: multi-column layouts, PDFs with embedded fonts that don't map to Unicode (surprisingly common in older academic papers), and anything scanned.

2. Marker — layout-aware, handles complex documents

Marker is an open-source library from VikParuchuri (the guy behind Surya OCR). It uses a combination of ML models to understand document layout — columns, headers, tables, figures — before extracting text. The output is Markdown, which is great for RAG because it preserves structure.

pip install marker-pdf
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

models = create_model_dict()
converter = PdfConverter(artifact_dict=models)

result = converter("complex_report.pdf")
markdown_text = result.markdown

# Marker outputs clean markdown with headers, tables, etc.
print(markdown_text[:1000])

On a GPU, Marker processes about 5-8 pages per second. On CPU, expect 1-2 pages per second. That's 10x slower than PyMuPDF, but the quality difference on complex documents is significant.

The trade-off: Marker requires ~4GB of GPU memory for the layout models. If you're processing thousands of documents, you'll want a machine with a decent GPU or you'll be waiting a while.

3. LlamaParse — cloud-based, best quality, costs money

LlamaIndex's LlamaParse is a cloud API. You upload a PDF, it returns structured Markdown. Under the hood it uses multimodal models to understand the document, so it handles tables, figures, and weird layouts better than anything else I've tested.

from llama_parse import LlamaParse

parser = LlamaParse(
    api_key="llx-...",
    result_type="markdown",
    num_workers=4,
    language="en"
)

documents = parser.load_data("annual_report.pdf")

for doc in documents:
    print(doc.text[:500])
    print("---")

Pricing: 1,000 free pages/day, then $0.003/page. For my 400-document project (~6,000 pages), that's about $18. Not bad, but it adds up at scale.

The real advantage: LlamaParse handles tables remarkably well. Where PyMuPDF gives you "Q1 Q2 Q3 Q4 Revenue 1.2 1.4 1.5 1.8", LlamaParse gives you a proper Markdown table. That matters a lot when your users ask questions about specific numbers.

The chunking problem (this is where most RAG systems fail)

Extracting text is step one. Step two is splitting it into chunks for embedding. Get this wrong and your retrieval will be useless no matter how good your embeddings are.

Bad chunking: fixed-size splits

This is what most tutorials show:

# DON'T DO THIS for anything serious
def naive_chunk(text: str, chunk_size: int = 500) -> list[str]:
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunks.append(" ".join(words[i:i + chunk_size]))
    return chunks

This splits mid-sentence, mid-paragraph, mid-table. A chunk might start with "million in Q3" and contain the first half of a policy statement. The embedding captures none of the actual meaning.

Better chunking: recursive with overlap

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # characters, not tokens
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "],
    length_function=len
)

chunks = splitter.split_text(extracted_markdown)
print(f"Created {len(chunks)} chunks")
print(f"Avg chunk size: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")

This tries to split on paragraph boundaries first, then sentences, then words. The 200-character overlap means context isn't lost at boundaries. For most use cases, this is the sweet spot.

Best chunking: section-aware with metadata

If your parser outputs Markdown (Marker and LlamaParse both do), you can chunk by section headers:

import re

def chunk_by_sections(markdown: str, max_chunk_size: int = 1500) -> list[dict]:
    """Split markdown into chunks by headers, keeping hierarchy."""
    sections = re.split(r'(^#{1,3}\s+.+$)', markdown, flags=re.MULTILINE)

    chunks = []
    current_header = ""
    current_text = ""

    for part in sections:
        if re.match(r'^#{1,3}\s+', part):
            # Save previous chunk if it has content
            if current_text.strip():
                chunks.append({
                    "header": current_header,
                    "text": current_text.strip(),
                    "char_count": len(current_text.strip())
                })
            current_header = part.strip()
            current_text = ""
        else:
            current_text += part

    # Don't forget the last section
    if current_text.strip():
        chunks.append({
            "header": current_header,
            "text": current_text.strip(),
            "char_count": len(current_text.strip())
        })

    # Split oversized chunks
    final_chunks = []
    for chunk in chunks:
        if chunk["char_count"] > max_chunk_size:
            sub_splitter = RecursiveCharacterTextSplitter(
                chunk_size=max_chunk_size,
                chunk_overlap=150
            )
            sub_texts = sub_splitter.split_text(chunk["text"])
            for sub in sub_texts:
                final_chunks.append({
                    "header": chunk["header"],
                    "text": sub,
                    "char_count": len(sub)
                })
        else:
            final_chunks.append(chunk)

    return final_chunks

The header metadata is gold for retrieval. When a user asks "what is the refund policy?", a chunk tagged with ## Refund Policy will score higher than a random paragraph that mentions refunds in passing.

Full pipeline: PDF to embeddings

Here's a complete script that takes a folder of PDFs and produces chunks ready for a vector store:

import fitz
import os
import json
from langchain_text_splitters import RecursiveCharacterTextSplitter

def extract_and_chunk(pdf_dir: str, chunk_size: int = 1000) -> list[dict]:
    """Extract text from all PDFs in a directory and chunk them."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=200,
        separators=["\n\n", "\n", ". ", " "]
    )

    all_chunks = []

    for filename in os.listdir(pdf_dir):
        if not filename.endswith(".pdf"):
            continue

        filepath = os.path.join(pdf_dir, filename)
        doc = fitz.open(filepath)

        for page_num, page in enumerate(doc):
            text = page.get_text("text").strip()
            if len(text) < 50:  # skip near-empty pages
                continue

            # Remove common noise
            lines = text.split("\n")
            cleaned = "\n".join(
                line for line in lines
                if not line.strip().startswith("Page ")
                and "Confidential" not in line
            )

            page_chunks = splitter.split_text(cleaned)

            for i, chunk_text in enumerate(page_chunks):
                all_chunks.append({
                    "id": f"{filename}:p{page_num + 1}:c{i}",
                    "source": filename,
                    "page": page_num + 1,
                    "text": chunk_text,
                    "char_count": len(chunk_text)
                })

        doc.close()

    return all_chunks

# Run it
chunks = extract_and_chunk("./knowledge_base")
print(f"Total chunks: {len(chunks)}")
print(f"Avg size: {sum(c['char_count'] for c in chunks) / len(chunks):.0f} chars")

# Save for embedding
with open("chunks.json", "w") as f:
    json.dump(chunks, f, indent=2)

Then embed with whatever provider you prefer:

import openai
import json

client = openai.OpenAI()

with open("chunks.json") as f:
    chunks = json.load(f)

# Batch embedding (OpenAI supports up to 2048 inputs per request)
batch_size = 100
all_embeddings = []

for i in range(0, len(chunks), batch_size):
    batch = chunks[i:i + batch_size]
    texts = [c["text"] for c in batch]

    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )

    for j, emb in enumerate(response.data):
        chunks[i + j]["embedding"] = emb.embedding

    print(f"Embedded {min(i + batch_size, len(chunks))}/{len(chunks)}")

# Save with embeddings
with open("chunks_embedded.json", "w") as f:
    json.dump(chunks, f)

For 6,000 pages of documents, this produces about 15,000-20,000 chunks. Embedding with text-embedding-3-small costs roughly $0.40 total. The parsing is the expensive part — either in compute time (Marker) or API cost (LlamaParse).

Quick comparison

Parser Speed Quality (clean PDFs) Quality (complex) Cost GPU needed
PyMuPDF ~50 pages/sec 9/10 4/10 Free (AGPL) No
Marker ~2-8 pages/sec 9/10 8/10 Free (GPL) Recommended
LlamaParse ~5 pages/sec 9/10 9/10 $0.003/page No

My rule of thumb: if your documents are clean born-digital PDFs, PyMuPDF is plenty. If you're dealing with scans, multi-column layouts, or complex tables, start with Marker and move to LlamaParse if you need better table handling.

The part nobody talks about: PDF generation quality affects RAG quality

Here's something I wish I'd realized earlier. Half my RAG parsing problems came from how the PDFs were generated in the first place.

PDFs generated from HTML with a real rendering engine produce clean, extractable text. The text layer matches what you see. Headers are actual headers. Tables have proper row/column structure.

PDFs generated from screenshots, or tools that render text as vector paths, or drag-and-drop design tools? Those are parsing nightmares. The "text" is often a series of positioned glyphs with no logical reading order.

If you're building a system that generates PDFs that will later be indexed for RAG — reports, documentation, invoices — the generation step matters as much as the parsing step. Using an HTML-to-PDF approach (like LightningPDF) means your PDFs will have clean text layers with proper document structure. You can use semantic HTML — real <h1> tags, <table> elements, <p> tags — and those translate directly into a well-structured PDF that any parser can extract cleanly.

It's a lot easier to generate a good PDF than to fix a bad one after the fact.

L

LightningPDF Team

Building fast, reliable PDF generation tools for developers.

Ready to generate PDFs?

Start free with 100 PDFs per month. No credit card required.

Get Started Free