Skip to content
K Kashif Ullah
← All projects
2025 Sole developer

AI Data Extractor for Images & PDFs

Upload an image or PDF, name the fields you want, and receive validated structured output. Built on Tesseract OCR with an LLM normalization layer.

The problem

Most document-extraction tools either dump raw text or lock you into a fixed schema. Real businesses need their schema — invoice numbers, dates, totals, specific party names — and need it reliable.

The approach

Two-stage pipeline: a Tesseract layout-aware OCR pass produces structured text blocks, then a LangChain extraction chain (Pydantic-typed) coerces the blocks into the user-defined schema. A confidence score is attached per field so downstream code can flag uncertain values for review.

Outcome

  • Handles invoices, receipts, scanned forms, and mixed-language PDFs.
  • Pydantic schemas double as API contract and as the LLM’s output validator.
  • FastAPI endpoint is the unit of deployment — drops into any back-office stack.

Need something similar?

If this is the kind of problem you're working on, I can help.

Get in touch →