Building Reliable Data Pipelines with AI Tools: Using Python and Pydantic to Validate AI Document Extraction

One of the first things businesses are understanding is AI is a very useful tool to pull their relevant data out of all those messy documents and files they're drowning in. Think PDFs, invoices, contracts, spreadsheets, you name it. They want to understand their data and use it in their day-to-day operations.

AI tools are very powerful for analyzing texts with skills that we might think of as reading, understanding, and recording information. The problem is that AI tools will make mistakes and when they make mistakes, they have the tendency to compound those errors as they complete their responses.

In a lot of real-world applications, these are tasks might be delegated to less experienced staff but they're important steps in processes where accuracy is important. For example:

An intern reading documents and preparing summaries or reports to support or plug into other workflows.
Maybe a staff tax accountant needs to read dozens of pages of financial documents to record a few key numbers and details about a certain transaction to make accurate entries.
Maybe you need the data from hundreds of images or pdfs in a structured format to analyze with Excel.
Or you want to have a centralized source of information about contract terms and requirements.

But we all know interns and staff-level folks aren’t always 100% accurate, the same is true for AI tools.

Using Python and Pydantic to Control AI Output. This is where knowing a bit of python and the packages available in python can really help control the output from AI tools. Specifically, Pydantic, a python package with extremely useful capabilities, turned messy receipt images into a perfectly formatted data table of receipt data along with notes about validation errors.

That’s the reality people will face when trying to integrate AI into their existing workflows. If you want useful, consistent, and structured data from inconsistently formatted documents, you’ll need a way to make sure data provided by an AI tool matches your expectations of format and quality.

How does Pydantic help? You can see my code at the bottom of the article, but I used Pydantic to define a “Receipt” model—basically a blueprint that spells out what data fields and data types we expect (items, subtotal, taxes, etc.) from the AI extraction. When an AI tool extracts data from an image or PDF, Pydantic checks every field in the model to see if it matches our field definitions.

Here’s the data table of the summary receipt data created after using GPT-4o to extract fields from my Receipt model. Green cells were accurate according to my review and the red cells were errors.

It’s important to say that just because Pydantic validates the type of data extracted by AI tools, it doesn’t mean you’re going to get 100% accurate data. The red cells in the screenshot show some extraction errors I identified. However, the package also allows you to perform a kind of validation that triggers according to criteria you can set.

For example, you can notice a ‘validation_error’ column in the screenshot that populated according to validation logic I added to the process. In my code below, there’s a ‘model_validator’ that adds up amounts extracted for subtotal, tax, fees, and discount to compare the total to what was extracted as the grand_total by the model.

If the amounts don’t agree, it attaches a note to the record to show what was expected and what was calculated. This is a streamlined way to identify records that need to be reviewed and corrected with a customize note about the issue.

Why should anyone care? Pydantic ensures that as soon as new extracted data arrives from an AI tool, it’s checked against whatever standards you know should apply to that data. The main advantage of this kind of logic is that any anomalies pop up right away. It can focus manual reviews by flagging possible errors and gives a dependable structure for integrating data into spreadsheets or other data processes later on.

Here’s the code from the Pydantic model that ensures we get the data types and validation we need.

Here’s a link to the full code used for the receipt extraction: https://github.com/scottlabbe/GPT-4o_receipt_extraction