AI / LLMs
Jan 7, 2026
14 min

Evaluating AI Systems

Making LLM-based applications 'behave'

By Adam Ingwersen

Deterministic AI

We've observed staggering improvement in the creativity, reliability and capabilities of AI applications, in particular LLMs over the past few years. A lot of the improvements are due to improvements to the models themselves, increased size in the training data and significantly increased amount of compute consumed on training and inference. The cost of per token has been greatly decreased with scale and efficiency gains - but the number of floating point operations required to train models has equally skyrocketed.

Stanford HAI AI Index Report 2025, page 56Stanford HAI AI Index Report 2025, page 56

For many reasons then, it's practically infeasible for most endeavours to train models to a similar level of sophistication as the leading LLM providers. The major LLM providers happily provide API access to their models, allowing small teams to utilise these models. But why would anyone want to buy a product if isn't able to do something in addition to these APIs?

In many cases the LLM applications that end up serving customers well, are specialised / verticalised to specific use-cases, and are able to provide more rigid guardrails than the vanilla LLMs, creating an experience like Lovable, Harvey or Toast. These companies obviously have accomplished great distribution and refined their user experience, but they've also ensured that their systems are resilient and avoid spewing nonsense. These companies have ensured that their AI product is systematically evaluated.

So what do we mean when we say evaluate?

Evals in a nutshell

Unlike traditional software, AI systems are non-deterministic - and you may have noticed interacting with ChatGPT, that asking the same question multiple times can yield different results. In many business applications, this presents a challenge when operating in regulated spaces, with customers or with other systems.

Evaluations (or evals) are a way to identify and mitigate these issues. By running a battery of tests on your AI system, you can identify potential issues before they become a problem and increase confidence that your system is performing as expected.

In a way, evals are like quality control for AI systems. Just like one would test a software product before releasing it; evals serve a similar purpose for AI applications. The format though, tends to be a bit different - and the data one needs to collect is usually not available to the engineer building the system.

Golden datasets

Let's imagine that you're operating a company leveraging AI in the legal space to redline contracts. You want to ensure that the LLM your system is using, doesn't just make things up. Prior to releasing a new feature or update, you want to be confident that the system performs reliably and to specification.

To do so, you first must define what good looks like. The way you do so is by creating what is usually called a "golden dataset". This is a set of inputs and excpected output pairs that are verifiably correct (or incorrect) that you can then test against.

Creating good golden datasets is a non-trivial task, and requires a deep understanding of the domain and the AI system you're testing.

There are a few ways to obtain a golden dataset:

  1. Manually creating a dataset from scratch for the task 1.1. This is a great start for many, but doesn't scale well
  2. Establish processes that allows to create continous dataset from other business operations 2.1. This is a great option for companies that already have processes in place, e.g. redlining at a legal firm
  3. Use existing datasets 3.1. This is a great option for companies that already have datasets in place
  4. Procure datasets from third parties such as Scale AI or Hugging Face 4.1. This is a great option to get started, but only if the dataset is appropriate for your use case

Whatever you choose, this step is critical to the success of your evaluation process - otherwise you're flying blind.

The hidden complexity

We've established that non-deterministic AI systems can be monitored and tested with evaluations. But if you're building an application with access to mutliple LLMs, custom system- and user prompts, finetuned models and other layers the complexity from a testing perspective increases sharply.

We can take a look at a simple, but realistic example to illustrate this.

Combinations = LLMs x SystemPrompts x Golden Data Pairs

In the case where you're supporting the 3 major LLM providers (and usually a few variations of their models), 10 system prompts, and 100 golden data pairs, you'd need to run 6000 test runs - for this one feature / update.

Add document retrieval from mutliple sources, query compression steps, and a few other layers, and the number of combinations skyrockets.

Understanding this complexity and planning for it is critical to the success of your evaluation process - and in turn the success of your AI application.

Metrics

In order to then evaluate the results of these test runs, you'd need to define a set of metrics to measure. Some common "standard" metrics include: Faithfulness, Topic Adherence and Context Recall.

These are generic metrics that are great for indicating drifts in performance, and give an overview of the system's behavior. But most verticals or use-cases require some more custom metrics to be defined.

Let's again take the legal redlining example. In this case, you'd want to define metrics that measure the system's ability to redline the exactly correct piece of text, and ensure that not too much or too little text is redlined. You'd also want to measure the system's ability to redline the text in a way that is consistent with the rest of the document, and that the redlined text is easy to read and understand.

You'd thus want to create a metric, call it "Redline Accuracy", which could be measured as the percentage of redlined text that is correct. Setting up such a metric in an eval suite or as a custom part of your CI/CD pipeline, would now help you understand improvements in this metric if you change a system prompt, LLM model or to benchmark a finetuned model against the base model.

Learnings from the field

Evals is an often-overlooked topic among companies that aren't AI-native or don't have strong AI engineering teams. But setting up good evaluation strategies and processes is one of the initiatives that will be critical in building a truly differentiated product experience - and is one of the few things that can be considered intellectual property when building on top of LLMs.

Exploring this topic early and investing in it will help build confidence in your AI product, and help guide your team towards product improvements.

Ready to elevate your technology strategy?

Book a consultation to discuss how we can help you build robust, scalable solutions that drive real business value.

Book Consultation