Skip to content
K Kashif Ullah
← All posts
By · · 10 min read ·
  • #ocr
  • #tesseract
  • #langchain
  • #pdf

Extracting Structured Data from PDFs with Tesseract and LangChain

A two-stage pipeline that turns scanned invoices, contracts, and forms into typed JSON your back office can actually use.

PDF extraction is one of those problems where you can either ship in a week or lose six months. The trap is reaching for a single tool — pure OCR is brittle, pure LLM is expensive and hallucinates, pure regex collapses on the second new vendor format. The fix is a two-stage pipeline, with each stage doing what it’s good at. I’ve built this pipeline for invoices, legal contracts, medical forms, and logistics manifests, and the architecture holds up across all of them.

Here’s the pipeline at a glance — the same diagram I sketch on a call when a client asks “wait, why two stages?”:

   PDF / scan


 ┌───────────────┐   has embedded
 │ text-layer?   │── text ──▶ skip OCR ─┐
 └──────┬────────┘                      │
        │ scanned                       │
        ▼                               │
 ┌───────────────┐                      │
 │ preprocess    │ deskew · threshold   │
 │ (OpenCV)      │ · upscale 400dpi     │
 └──────┬────────┘                      │
        ▼                               │
 ┌───────────────┐                      │
 │ Tesseract     │ image_to_data →      │
 │ + bbox/conf   │ words + coords       │
 └──────┬────────┘                      │
        └──────────────┬────────────────┘

              ┌──────────────────┐
              │ LLM normalize    │ Pydantic schema
              │ (do NOT infer)   │ + self-rating
              └────────┬─────────┘

            ┌─────────────────────┐
            │ confidence < 0.7 ?  │── yes ─▶ human review queue
            └────────┬────────────┘
                     │ no

                typed JSON

Why Single-Tool Approaches Fail

Before diving into the solution, it’s worth understanding why the obvious approaches fall short at scale.

Pure OCR (Tesseract alone, no post-processing) gives you raw text with no structure. You get “Invoice Total 1,234.56” as a flat string, and now you need regex or heuristics to figure out which number is the total, which is the invoice number, and which is the date. This works for one document format. It breaks the moment a second vendor sends invoices with a different layout.

Pure LLM (send the entire page image to GPT-4o or Claude) is magical in demos and painful in production. It’s 10–100× more expensive per page than OCR. It hallucinates values — I’ve seen it invent invoice numbers that look plausible but don’t exist in the document. And you give up your data-residency story, which matters for legal, medical, and financial documents.

Pure regex works until it doesn’t. Business documents are messy: inconsistent spacing, varying date formats, OCR artifacts that turn “0” into “O.” One client had invoices from 47 different vendors, each with a unique layout. Regex rules for all 47 would be an unmaintainable nightmare.

The two-stage pipeline solves this by letting OCR handle what it’s good at (reading pixels into text with spatial awareness) and letting the LLM handle what it’s good at (understanding messy text and mapping it to a schema).

Stage 1: OCR with Layout Awareness

Tesseract gets unfairly maligned. With the right page-segmentation mode and a little preprocessing, it’s accurate enough for 95% of typed business documents — and it runs locally, for free, with no API rate limits or per-page charges.

Preprocessing That Actually Matters

The quality of your OCR output depends almost entirely on image preprocessing. Three steps make the biggest difference:

  1. Deskew — even a 2-degree rotation kills accuracy on dense text. I use OpenCV’s minAreaRect on detected text regions to calculate the skew angle, then rotate the image to correct it.

  2. Adaptive thresholding — for scanned documents with uneven lighting or shadows. cv2.adaptiveThreshold with a Gaussian method and a block size tuned to your document’s font size converts the image to clean black-on-white text.

  3. Resolution upscaling — Tesseract needs at least 300 DPI for reliable results. For PDFs, I render pages at 400 DPI using pdf2image. For already-rasterized images, I upscale with Lanczos interpolation if the effective DPI is below 300.

Use image_to_data, Not image_to_string

This is the single most important Tesseract tip I can give. pytesseract.image_to_string() gives you flat text. pytesseract.image_to_data(output_type=Output.DATAFRAME) gives you every word with its bounding box coordinates, confidence score, and block/paragraph/line grouping.

With bounding boxes, you can reconstruct tables by clustering words into rows and columns based on their Y and X coordinates. You can group line items. You can identify headers vs. values based on their position on the page. And critically, you know where on the page each value came from, which matters for the confidence scoring I’ll describe in Stage 2.

Here’s what image_to_data actually returns versus the flat-string version — the difference that makes everything downstream possible:

$ python ocr_demo.py invoice_0042.png
# image_to_string()  →  unusable flat text:
"ACME LTD Invoice 00042 Date 2026-03-01 Total 1,234.56"

# image_to_data()  →  structured, with coordinates + confidence:
 level  text         conf   left  top   width  height
 5      Invoice      96.4   412   88    74     19
 5      00042        91.1   494   88    52     19
 5      Date         95.8   412   118   38     18
 5      2026-03-01   88.2   460   118   96     18
 5      Total        97.0   412   642   46     20
 5      1,234.56     72.3   470   642   84     20   ← low conf, flag it

That 72.3 on the total is exactly the kind of field I auto-route to review — the number a human should eyeball before it hits the accounting system.

Handling Tables

Tables are the hardest part of document extraction. My approach:

  1. Detect table boundaries using horizontal and vertical line detection with OpenCV morphological operations.
  2. Within each table cell, run Tesseract on the cropped region.
  3. Reconstruct the table as a list of dictionaries, using the first row as headers.
  4. If line detection fails (borderless tables), fall back to column clustering based on X-coordinate gaps between word groups.

This isn’t perfect — deeply nested tables and merged cells still cause problems — but it handles 90% of business documents correctly.

Stage 2: LLM Normalization with a Schema

Once you have structured text with spatial metadata, the LLM’s job is small and well-defined: take messy OCR output and a target Pydantic schema, return validated JSON. I use LangChain’s with_structured_output() to enforce the schema at the model level.

The prompt template I use follows this pattern:

You are a document extraction system. Given the OCR text below,
extract the requested fields into the provided JSON schema.
Only extract values that are explicitly present in the text.
If a field is not found, set it to null.
Do not infer or calculate values.

The instruction “do not infer or calculate values” is critical. Without it, the LLM will helpfully compute subtotals, guess missing dates, and fill in fields based on what “makes sense.” That’s exactly the kind of hallucination that causes real problems in accounting and legal workflows.

Per-Field Confidence Scoring

This is the feature that makes the system usable in real back offices. For every extracted field, I compute two confidence signals:

  1. OCR confidence — the average Tesseract confidence score for the words in the region where the value was found.
  2. LLM self-rating — I ask the model to rate its confidence on a 1–5 scale for each field as part of the structured output schema.

I average these two signals and surface any field below a threshold (typically 0.7) for human review. In practice, this means a human reviewer only needs to check 5–15% of fields instead of reviewing every document end-to-end. One client processing 500 invoices per day reduced their review time from 4 hours to 35 minutes using this approach.

What About Pure Multimodal Models?

Sending a PDF page directly to GPT-4o or Gemini Vision works and is sometimes the right call — especially for low-volume, high-variance documents where building an OCR pipeline isn’t worth the investment.

But for production workloads, consider:

  • Cost: Processing 10,000 pages per month through GPT-4o Vision costs roughly $200–400. The same volume through Tesseract + a small LLM call for normalization costs $15–30.
  • Hallucination: Multimodal models still hallucinate values, especially numbers. The OCR-first approach gives you a ground-truth text layer to validate against.
  • Data residency: Tesseract runs entirely on your infrastructure. For HIPAA, SOC 2, or GDPR-regulated documents, this matters.
  • Latency: OCR + schema extraction takes 2–4 seconds per page. Vision API calls take 5–15 seconds and are subject to rate limits.

For high-volume work, the two-stage approach wins. For low-volume “we get a weird PDF once a week,” the multimodal route is fine.

To make the choice concrete: on my last invoicing project I chose Tesseract + a small normalization LLM over GPT-4o Vision, even though Vision would have meant less code to write. The client processed ~12,000 pages a month. Vision would have run roughly $300/month and shipped every page to a third party; my pipeline ran about $25/month and kept the document bytes on their own infrastructure, which their compliance team required. I traded a week of pipeline engineering for ~$3,300/year in savings and a clean data-residency story. If they’d been doing 200 pages a month, I’d have told them to just use Vision and skip the pipeline entirely — the engineering wouldn’t have paid for itself.

Drop in a real screenshot here. A side-by-side of an input invoice and the extracted JSON (with one field highlighted in the review queue) is the single most convincing thing you can put in this post. Replace this note with that image once you have a sample you’re allowed to share.

Production Checklist

Before deploying a document extraction pipeline, verify these items:

  1. Schema first. Define your Pydantic model before you write a line of OCR code. The schema drives everything: what you extract, what you validate, what you surface for review.
  2. Test set. Twenty real documents, hand-labeled with expected outputs. That’s your evaluation baseline. Run it on every code change.
  3. Confidence in, confidence out. Every field gets a score. Every document gets an aggregate score. Route low-confidence documents to human review automatically.
  4. Audit trail. Store the original page image, the raw OCR output, and the final extracted JSON together. You’ll need all three the first time someone asks “where did that number come from?”
  5. Format detection. Not every PDF needs OCR. Digital-native PDFs (created from Word, LaTeX, or HTML) have an embedded text layer. Check for it first with pdfplumber or PyMuPDF and skip OCR if the text layer is present. It’s faster and more accurate.
  6. Error handling. Some pages will be blank, upside-down, or in a language Tesseract doesn’t support. Detect these cases early and route them to a manual queue instead of producing garbage output.

Real-World Performance Numbers

Across three production deployments handling invoices, purchase orders, and shipping manifests:

  • Accuracy on key fields (invoice number, date, total, vendor name): 94–97% fully automated, 99.5%+ after human review of flagged fields
  • Processing speed: 2.8 seconds per page average (OCR + LLM extraction)
  • Human review rate: 8–12% of fields flagged, taking an average of 6 seconds per flagged field to verify
  • Cost per page: $0.003–0.008 depending on LLM choice (GPT-4o-mini vs. Claude Haiku)

Frequently Asked Questions

Can this pipeline handle handwritten documents?

Tesseract’s handwriting recognition is limited. For handwritten content, I use a specialized model like Google’s Cloud Vision API or a fine-tuned TrOCR model as a drop-in replacement for Tesseract in Stage 1. The rest of the pipeline — schema extraction, confidence scoring, human review — works the same way regardless of the OCR engine.

How do you handle multi-language documents?

Tesseract supports over 100 languages. I auto-detect the language using langdetect on the first page’s raw text, then pass the appropriate language pack to Tesseract. For documents that mix languages (English headers with Arabic or Urdu content), I run Tesseract in multi-language mode with both language packs loaded simultaneously.

What’s the minimum number of sample documents needed to build a pipeline?

You can build and deploy a working pipeline with as few as 10–15 sample documents from each format you need to handle. The schema-driven approach means you don’t need thousands of training examples — you need enough samples to define your Pydantic schema and build a representative test set.

How do you handle PDF forms with fillable fields?

Fillable PDF forms are actually easier than scanned documents. I extract the form field values directly using PyMuPDF or pdfrw — no OCR needed. The LLM normalization step still applies if the field values need cleaning or type conversion, but the extraction itself is deterministic and 100% accurate.

Does this work for tables that span multiple pages?

Yes, but it requires additional logic. I detect continuation tables by checking if a table at the top of a page has the same column structure as a table at the bottom of the previous page. If so, I merge them before sending to the LLM. This heuristic works well for standard business documents but may need tuning for unusually complex layouts.


Need a document-extraction pipeline that’s yours and not a SaaS rental? I build those.

Keep reading