Created on 2025-01-31 03:36
Published on 2025-02-05 03:45
PDFs are everywhere, in auditing or government administration roles, we frequently analyze documents like reports, contracts, and legislative briefs delivered as PDFs. When you start using AI tools, like Large Language Models (LLMs) on those documents, the varied formatting present in documents can cause misleading interpretations or outright errors in the output.
These errors can lead overall distrust in the models and apprehension about integrating AI tools into workflows. AI tools based on LLMs are text prediction models and they can’t tell the difference between the body of a text and graphic elements like text boxes and other elements that provide context for human readers. It’s worth understanding how PDFs and AI tools interact with each other.
Meant for Humans, Not AI Tools: PDFs emphasize visual appearance. This makes them great for consistent viewing across devices or sharing widely with audiences like the public or the wider organization, but less ideal for accurate AI analysis.
Complex Layouts: PDFs contain multi-column text, embedded images, tables, headers, footers, and many of these elements are arranged with the purpose of be visually appealing. AI models struggle with parsing visual elements and need logical text structures to understand how sections relate to other sections.
Scanned Documents: Some PDFs are literally images of text like scanned documents. Even with OCR, the accuracy of converting those document images into actual text depends on the quality of the scan.
When we begin to think about using PDFs in workflows that include AI tools, we have to consider the best way to convert PDFs into a format that is more friendly to use with AI interfaces. Learning about this conversion can allow us to be mindful of the areas that AI tools might struggle to utilize our data in ways that are helpful and reliable.
In fact, many people use ChatGPT to interact with their documents without realizing there’s a hidden conversion happening in the background. Although the model conceals this process, before AI can read your PDF, the text must be extracted in a way that preserves its logical structure so the tool can interpret it accurately.
One popular format for pairing with AI tools is Markdown, it’s a simple, readable text format where styling is indicated with symbols (like * for italics). Instead of unseen formatting code (as in Word), Markdown lays out structure right in the text. This makes it easier for AI to grasp the true content, rather than wrestling with PDF’s invisible layout coordinates.
Here are some examples of what you can expect to see when you convert a PDF to something like markdown content. I ran a quick experiment to convert a pdf that includes a lot of design, formatting, tables, and graphics with a python package called PyMuPDF to see if anything was lost in the conversion of a pdf file to a more AI friendly format.
Overall the quality was better than I was expecting, it correctly identified text boxes vs paragraphs, copied structured tables accurately, and was able to emphasize the passages emphasized in the actual report. It completely ignored the complicated graphics we spent days designing and had mixed success recreating tables in the report.
Here's the pdf report.
Here's a comparison of the beginning of the report where it appropriately recorded the overall conclusion and background information describing some of the context and key terms in logical order.
Report:
Converted Markdown File:
Heavily customized graphics created to explain a process or flow of data were completely omitted from the markdown file.
Report:
Converted Markdown File:
Here's some examples where the tables were recreated both accurately and inaccurately. The only significant difference I see between the two tables is that the inaccurate example contained multiple bulleted lists within a table. This is a great example of how LLMs could easily miss or misunderstand facts in your PDF and respond with poor or inaccurate responses simply because the model was fed messy data.
Report:
Converted Markdown File:
Report:
Converted Markdown File:
When integrating AI into PDF workflows, understanding how different formatting can impact interpretation of information is essential. Converting PDFs into structured, AI-friendly formats like Markdown is an important step in using AI tools to put your data to work. However, recognizing the limitations, such as missing graphics or misinterpreted table structures, remains crucial for avoiding inaccurate outputs and improving reliability using AI tools in real-life workflows.