Test It to Trust It: Making AI Work For You

Soon, every organization is going to have to figure out ways to add AI to their business processes. At least, they'll have to figure out ways to use AI to make their workflows and employees more efficient and more capable.

There are significant obstacles to adopting AI tools in workflows — they hallucinate, they lack transparency and explainability, they are easy to confuse — not to mention that slight variations in the prompt can impact the output in unpredictable ways. If your business has risk management, compliance, or quality control processes, these obstacles might be large enough to slow down AI adoption or prevent useful solutions from even being considered at all.

When integrating AI tools into your workflows, you need clear answers to several key questions:

How accurate is the model on our data and various data formats?
What are the costs associated with using the model on our data?
How will we deal with irregular or malicious inputs?
How will we know when to switch to a new model?
Are there limitations to the how the model can use our data?

The answers to these questions will drive how businesses will approach using AI to improve their processes and augment their staff’s ability to execute on their tasks.

Why Evaluations are Essential

Without customized evaluation methods to answer these questions, you can't effectively assess which tools truly meet your needs versus those that just seem promising in controlled demonstrations. Evaluations allow you to choose the tool that best aligns with your specific use cases, data formats, and domain challenges, rather than relying on theoretical benchmarks.

Custom evaluations let you assess factors like accuracy, bias, speed, and resource consumption in a context that matters to your business. Maybe some tasks can work with a small, cheap, fast model while other tasks might require an expensive model that uses reasoning before responding, either way, this is information you need to create and track.

Establishing baselines and continuously comparing new AI models against them allows you to track improvements in models. This means you can quickly identify and adopt new models that outperform current ones without relying on vibes and generic, abstract benchmarks. An additional benefit to of having customized evaluations is being able to iterate and improve performance via prompt engineering and data integration with actual evidence to point you in the right direction.

Evaluating different models on your data helps determine not only which models are most effective but also which offer the best return on investment. This is important information when considering automating tasks or deciding whether to build custom solutions versus leveraging existing platforms.

Example: Evaluating LLM Accuracy and Cost

One popular way businesses are integrating AI into their organization is taking a knowledge base of pdfs, reports, powerpoint presentations, policies, and procedures, transforming them into a common format easy for AI tools to read, hooking up this data to an AI model, and allowing it to search through the data to respond to research queries, create reports, or analyze historical performance.

In this example, I'm going to perform a test to see how well and at what costs models can respond to questions about one pdf report. I’m going to go walk through one report I’m very familiar with, I’ll provide it to 3 different models, and evaluate the accuracy and cost of answering 10 questions.

Evaluation Details

Here's the report: https://sao.texas.gov/SAOReports/ReportNumber?id=21-025

I came up with 10 general questions about the report details, they're included below with he model responses. When evaluating the model responses, the answer must include the exact wording from the report. For instance, question #2 asks: “What state government agency manages the program?” The correct answer is “Health and Human Services Commission.” A model response like “The program is managed by the Health and Human Services Commission (Commission) of Texas.” would be acceptable since it contains the required answer verbatim.

This kind of evaluation is only possible using a programming language like Python and the model’s API connection to quickly evaluate response accuracy and calculate how much each query costs to answer.

Models Tested

gpt-4o-mini: - A small, fast, cheap OpenAI model they say is good for “focused tasks.”

gemini-2.0-flash: - A larger model from Google, their “most capable” model and extremely cheap for the time being.

llama-7b: A very small, open source model from Meta. Since Llama models are open source, they can be run locally (on the right hardware) without having to send requests to a model provider’s servers.

Results

Here’s a summary table of the accuracy and cost metrics.

Here's the full results:

Discussion

Overall, the Gemini model was the most accurate model, maybe not surprising since it was the largest, most capable model I tested. It was also the cheapest at the moment making it a massive bargain.

In you review some of the errors in the test, like #4 or #8, you might notice they were marked incorrect because they didn't use the exact wording included in the report, however both the Gemini and GPT models got the point of the answer across in their responses. Also, #9 was counted incorrect because of some formatting differences between the answer and the model response. Maybe this can fixed with a better prompt or maybe it's not a big deal for your use case, all the more reason to develop your own tests to measure what matters for your use case.

Here's the code used for the evaluation: https://github.com/scottlabbe/llm_extract_evaluation

Conclusion

In summary, rigorous testing is the only reliable way to determine if AI truly meets your business needs. This evaluation was a simple example to demonstrate what an evaluation can look like but as you refine your testing approach, you’ll not only build a stronger case for AI integration but also ensure that each step of that integration delivers tangible value.

In one of the upcoming posts, I’m going to iterate a few versions of the prompt to see how much I can improve the accuracy of the model responses with more detailed instructions, examples, or validation.

#AI #AIEvaluation #LLM #TestingAI