Created on 2024-12-30 14:56
Published on 2024-12-30 16:32
I conducted an experiment to gauge how well GPT-4o can read and extract information from 20 JPEG images of receipts. These receipts vary in complexity: some include only one item, while others contain 15–20 items.
I focused on extracting two categories of data:
Summary Attributes (occur once per receipt): store name, purchase date, total price, tax, payment method, etc.
Item Details (occur multiple times per receipt): item names, prices, and quantities.
The model performed best on summary data (like store name, date, and payment method) and struggled more with detailed item information (especially prices and totals).
Here are some observations I made from a few times running the images through GPT-4o.
Data quality proved to be the most critical factor in successful extraction. Dark images, wrinkled receipts, and paper folds significantly impacted accuracy. This was especially noticeable with angled receipts, where the spatial relationship between item names, quantities, and prices became distorted, making it difficult for the model to correctly match values across rows.
The receipt summary data (store name, payment method, date) consistently achieved higher accuracy than the detailed line items. This aligns with how large language models like GPT-4o fundamentally work, they excel at recognizing patterns in text and understanding context, which is perfect for identifying store names or payment methods that follow predictable formats. For example, store name is almost always at the top of the receipt, similarly, dates, totals, and payment methods are consistently at the bottom of the receipts. There's no need to track items across lines of the receipts for extraction tasks like this.
Despite the model's general struggles with detailed numerical data, it demonstrated an unexpected ability to integrate multiple tax amounts into a single, accurate total. For example, when presented with separate lines for different kinds of taxes, the model didn't just extract these as individual items but intelligently combined them into a single, correct tax amount.
This capability shows that while the model may struggle with line-by-line price extraction, it has a decent understanding of how different components relate to each other in the context of a receipt's overall structure. I thought it was interesting to consider this strength in working with numbers in contrast to its challenges with individual line item prices and quantities discussed above.
I ran the images through a python program I created that uses the GPT-4o API to extract the details from the images. The program uses pydantic python package to validate the data output by the model, this ensures numbers, dates, and text are correctly formatted. I downloaded the receipt images from Kaggle and I did not resize or adjust the images at all before the extraction. (https://www.kaggle.com/datasets/trainingdatapro/ocr-receipts-text-detection)
#ArtificialIntelligence #AI #GPT4 #DataExtraction #ComputerVision #AIExperiments #AIAutomation #DataScience