Created on 2025-03-25 13:33
Published on 2025-06-02 11:46
Every organization has documents on shared drives—hard to find, hard to use, and often forgotten due to factors such as organizational silos, lack of awareness, insufficient metadata, or resistance to new technologies. Teams have shared folders of data inherited from previous teams or team members. In a lot of cases, new staff or team members aren’t given dedicated time to become familiar with this old data even though it might have been foundational to the current team workflows.
During my time as a legislative auditor, I produced massive amounts of documentation used to plan audits, learn about an agency and their processes, develop findings and conclusions, and communicate results to a wide audience.
Just think of all the documents created in organizations that are potentially never used again after the immediate project or need is satisfied:
Final reports and recommendations
Compliance requirements documentation
Research reports and literature reviews
Process and procedure documents
Contracts and contractor monitoring reports
Annual reports and strategic plans
Interview transcripts and meeting minutes
Technical specifications and handbooks
Budget justifications and cost allocation methodologies
Training materials and risk assessments
These documents can represent hundreds or thousands of hours of work, yet it’s difficult to leverage this knowledge base for future work because the information is trapped in files tucked away in old project files unknown to people not involved in creating them.
What if there was a better way to unlock the value in these document repositories? One popular technique to turn a set of files into an AI knowledge base is called Retrieval Augmented Generation or RAG.
RAG involves integrating a knowledge base (like a folder full of pdfs) into a searchable index of data that can be retrieved and fed to the language model to generate responses to a query.
Indexing - Documents are broken down into meaningful chunks and stored in a searchable database.
Retrieval - When a user poses a natural language question, the system searches the indexed documents for relevant information.
Augmentation - The retrieved content is combined with the user's query to enhance context.
Generation - The LLM generates responses informed by the retrieved documents.
There potential benefits of embracing AI frameworks like RAG are significant.
Access to Domain-Specific Knowledge - Incorporate up-to-date information from domain-specific databases or documents, ensuring responses are informed by the latest and most relevant data.
Harnessing the Untapped Value of Legacy Content - Effectively revitalize and utilize legacy documents that may have been underutilized due to their age, format, or lack of awareness.
I put together this Google Colab notebook to breakdown this process a little more for anyone that wants to try it out. The notebook should open with some pdfs included in a Reports folder. Feel free to put your own reports in there and change the questions based on what’s included in them. One thing you will need is a paid OpenAI account and an API key to use the model.
https://colab.research.google.com/drive/11ZXW4WeTSGsvmIAF1epVhQ29-Yik28Cg?usp=sharing
The objective was straightforward: transform a set of static documents into an interactive knowledge base without requiring complex infrastructure. The last step in the notebook actually will show you the model’s response to the query along with the top sources retrieved to fill out the model’s response.
For my test case, I used audit reports that I had helped create as a legislative auditor. In some cases, I wrote the report; in others, I was a team member performing testing. I focused on my own work because I wanted to easily spot any errors in the responses—a critical step in evaluating AI solutions before implementing them into workflows.
If you want to follow along with the example questions I set up, you'll need to follow the links below to download the reports and upload them to the notebook.
An Audit Report on Blue Cross Blue Shield of Texas, a Managed Care Organization - https://sao.texas.gov/Reports/Main/21-025.pdf
An Audit Report on The Health and Human Services Commission’s Use of Remedies in Managed Care Contracts - https://sao.texas.gov/reports/main/20-008.pdf
An Audit Report on Healthcare Services at the Juvenile Justice Department - https://sao.texas.gov/Reports/Main/23-027.pdf
An Audit Report on The Health and Human Services Commission’s Oversight of the Medical Transportation Program - https://sao.texas.gov/Reports/Main/22-021.pdf
An Audit Report on Cook Children’s Health Plan, A Managed Care Organization - https://sao.texas.gov/Reports/Main/22-036.pdf
With a tool like this, each team member can search through the collective knowledge of past work in their own way, new team members can have easy access to institutional knowledge, and teams can make more informed decisions about approaches and directions for new projects.
Example questions from the notebook:
What are common audit issues identified with managed care organizations?
Why is it important for states and managed care organizations to sufficiently monitor pharmacy benefit managers?
What is the process to ensure that managed care organizations submit accurate financial information to the state?
What are areas have fared well in audits of managed care organizations?
What were the audit objectives for audit projects at the Juvenile Justice Division?
How does the state ensure medical transportation providers comply with state rules?
This notebook uses a small but powerful closed model, meaning the pdfs you upload are made available to OpenAI’s get-4o-mini model. A solution like this is probably not appropriate for files that have sensitive, confidential, or proprietary information.
This notebook also uses OpenAI to create an index of searchable text and an engine for generating responses from your documents so there will be a cost to using this notebook, although it will be minimal for a small collection of pdfs.
Let me know if you have questions or ideas about this kind of tool framework.
#AI #RAG #RetrievalAugmentedGeneration #KnowledgeManagement #DocumentAI #AuditInnovation #LegislativeAudit