2025 Sole developer

AI Data Extractor for Images & PDFs

Upload an image or PDF, name the fields you want, and receive validated structured output. Built on Tesseract OCR with an LLM normalization layer.

Python
Tesseract OCR
LangChain
FastAPI

View on GitHub →

The problem

Most document-extraction tools either dump raw text or lock you into a fixed schema. Real businesses need their schema — invoice numbers, dates, totals, specific party names — and need it reliable.

The approach

Two-stage pipeline: a Tesseract layout-aware OCR pass produces structured text blocks, then a LangChain extraction chain (Pydantic-typed) coerces the blocks into the user-defined schema. A confidence score is attached per field so downstream code can flag uncertain values for review.

Outcome

Handles invoices, receipts, scanned forms, and mixed-language PDFs.
Pydantic schemas double as API contract and as the LLM’s output validator.
FastAPI endpoint is the unit of deployment — drops into any back-office stack.

Need something similar?

If this is the kind of problem you're working on, I can help.

Get in touch →