AI Evaluations as they should be

The Hitchhiker's
Guide to Evals

AI Evaluations are hard, but they don't have to be.

Who is this for?

I aim to provide a mildly-technical guide to understanding AI Evaluations (Evals) here. This guide is a synthesis of many excellent Eval focused resources and my own domain knowledge from building evals at Meta & LOOT. Mastering the information here will give you enough knowledge to make you dangerous, at least when it comes to assessing AI models and/or AI-based product quality.

My expectation is that the average person coming to this site is working on an AI-based product or is interested in implementing evals better than vibes. To that end, I won't spend a lot of time explaining why evals are important and will instead focus on what it takes to build evals.

Each section outlined in this main page is discussed in-depth on the linked sub-pages, so feel free to click into any area that interests you. One last note: All the content here is human-written, and I am notoriously bad at mentally modeling what you will know, so I have tried my best to make the site as AI-accessible as possible in hopes that you pass the URL to your favorite chatbot for it can explain any areas I may have glossed over.

TL;DR

  • 01 Good evals are hard but necessary.
  • 02 If you're at 100% on all your evals, you're probably doing it wrong.
  • 03 Look at your data.
  • 04 People are important.
  • 05 Your evals should be FAST: Falsifiable, Actionable, Specific, and Tractable.

01 · What are you measuring?

Evaluations

Evaluating the quality of your AI-based product presents an entirely different set of challenges than measuring traditional software. Where traditional software is deterministic, i.e., users will always arrive at the same place if they click the buttons in the same order, AI & agent-based products are inherently black boxes. In this new world, inputs and outputs are unbounded by default, creating a massive surface area for unexpected user & agent behaviors.

The core goal of Evals is to provide a window into understanding your model/product.

The best way to do that is by identifying what "good" and "bad" looks like. Once you understand how your product is failing, it becomes much easier to fix. In practice, this is the difference between hearing feedback like "Your product sucks" versus knowing "The agent focused on hotel reservations fails when booking for parties of 4 or more". When you have that kind of information, you can fix the underlying issue then loop through the eval process again to make the product even better.

While the loop is what you will practice day-to-day, this page is organized conceptually with each section building upon the last. If you want to jump ahead, feel free to click on a part of the loop that interests you.

Evals can be applied to a wide variety of characteristics of your model or agent. The common types of measured characteristics are:

  1. Model Behavior
  2. Reasoning
  3. Agentic Behavior
  4. Knowledge
  5. Safety
  6. CIA (Confidentiality, Integrity, Availability)

For the bulk of this site we will focus on agentic behavior since that is one of the most common use cases.

Go deeper: Evaluations

02 · What do good evals look like?

FAST Evals

But knowing generally what you want to measure doesn't really tell us how to measure it. FAST evals solve this by defining a rubric for what a good judge should look like:

  • Falsifiable evals help us only look at problems that can be undoubtedly proven wrong.
  • Actionable evals help us know what to change when the judge finds a failure mode.
  • Specific evals limit the measurement to a single property. Generic evals like "Quality" & "Relevance" fail this check since they could mean failures across a large swath of levers.
  • Tractable evals are economically viable judges where the severity or frequency of failures make a judge worth building.

Not all evals will pass all 4 gates all of the time. There are many edge cases where metrics must be created that are not immediately actionable (e.g., sentiment metrics) but are still useful for your overall understanding of your product.

Go deeper: FAST

03 · Can you trace what happened?

Observability

Before we can measure anything at all, we must log it. Unlike traditional deterministic software, agentic systems are unbounded in their inputs and outputs. Across longer time-horizons, this unbounded-ness can nullify the benefits of traditional event-based logging. Steps within spans and traces directly affect the steps that come after, so establishing the causal chain amongst your events is crucial to understanding your product.

Judges placed at the right part of the chain can help capture the effects of any changes you make long before it would be surfaced by your users.

Go deeper: Observability

04 · How does altitude affect our evals?

Measurement Grains

Understanding the level of granularity you want to measure is critical for determining how you are going to build your judge. There are 4 main levels of measurement:

  1. Sessions: The entirety of a multi-turn conversation between a user and the agent
  2. Traces: The transcript of a single user turn (also called a trajectory and may contain multiple agent steps)
  3. Spans: Each step within a trace
  4. Events: Each action taken within a Span

There is a value vs. actionability tradeoff as you get more refined in your measurement grain. The value of improving the pass rates of judges at coarser levels like Sessions and Traces is readily apparent to end users, yet these judges are typically hard to action on. On the other hand, more granular judges that only look at one step in the agent's trajectory will be highly actionable but may not be as useful by themselves.

There are also cost and complexity tradeoffs as you refine your measurement grain. While the most granular levels (Spans & Events) can be cheaply checked with deterministic evals (e.g., "does this sql execute?", "is this the correct tool call format"), the coarser levels often require costly LLM-as-a-judge classification to capture all the potential nuances.

Go deeper: Grains

05 · What is the point of evals?

Eval Goals

As Ben Hylak put it, each eval can "Benchmark-Maxx" or "Raise the floor" but not both at once. In more common language, this delineates between Capability Evals and Regression Evals.

  • Capability evals are built to be hillclimbed. In other words, these types of evals are meant to identify what "good" outcomes look like and define the metric that can be used to measure progress towards that goal. These types of evals will start low and should increase over time.
  • Regression evals are built to protect from known failure modes. These types of evals are typically built by identifying common errors in production then creating a judge to test for those errors after the underlying issue has been fixed. Regression evals should stay near 100%.

A common failure mode I have seen incredibly smart people make at Meta has been a failure to explicitly denote which camp each of their judges belongs to. When regression and capability judges get mixed up, it can be hard to make sense of the state of your product.

Go deeper: Goals

06 · How do I grade the outputs?

Scoring

There are 3 ways to score the outputs of a session, trace, span, or event. In order of cost they are Deterministic (Code-based graders), Probabilistic (ML-based graders), and Human (carbon-based graders).

  • Deterministic graders are fast, cheap, easy to debug, and reproducible. But they often are overly specific and work best at the most granular levels of measurement
  • Probabilistic graders are more expensive than code-based graders. In return for the added cost, these graders are more flexible, more scalable, and can handle nuance. They typically require human calibration before implementation.
  • Human graders are mostly the most accurate, however, they are limited in scalability and are the most expensive. Furthermore not all human graders have the right exposure/context to accurately measure differences in agent outputs

Most evaluation suites consist of some combination of all three types. Lean on deterministic graders whenever possible as these are the easiest to produce and check.

Go deeper: Scoring

07 · Which judges deserve to exist?

Building Judges

There are two trains of thought when it comes to identifying relevant judges: Top-Down and Bottom-Up opportunity identification. Each can be used throughout the eval creation process, but they vary in effectiveness. Bottom-Up (error analysis) is best used when you already have production traffic to read; Top-Down (spec-driven) is best used when you are building ahead of the data, before anything has shown you where it breaks.

Once you have identified the opportunities, you must decide the type of judge to build and the granularity that will give you the greatest leverage. This step combines the eval goal and measurement grain decisions from earlier: a goal x grain matrix shows the best positions for actionable judges, and plotting your whole suite on it can reveal gaps or weaknesses. Something I have seen teams do at Meta is build a very comprehensive suite, yet it is only comprehensive along one axis or another. Not all sections of the matrix need to be filled for every opportunity, and identifying which judges are necessary is more art than science, but using FAST as our rubric helps filter out the weak candidates.

Go deeper: Building Judges

08 · How do I actually build the judge?

Eval Design

Building the judge is as simple as synthesizing all the decisions you have made up until this point (Measurement grain, Eval goal, and Grader Type) then identifying how to wire the data from your traces to your evaluation harness.

There are a few more decisions you need to make before this runs automatically. Input design, scoring design, and harness x environment design are all prerequisites to building a good judge. Without these decisions in place, you will never know if you are measuring the agent, the model, the harness, or some messy combination of all three. In some cases you may end up with a weird variation of Wittgenstein's ruler, where you end up measuring the judge instead of the agent entirely.

Knowing where each of your evals stand is important as well. Eval cards capturing relevant information like the failure/success mode you are looking to measure, motivating traces, measurement unit, and pass criteria are useful for quickly re-orienting yourself when you come back to each judge as your eval suite expands.

Go deeper: Eval Design

09 · Is this judge actually measuring what we want it to?

Judge Calibration

This is essentially the question of internal validity from experimentation reframed for Agents. Trusting that your judges are actually measuring what you need can help when determining whether to prioritize fixing your evals versus pivoting on a product decision. Documenting and communicating this kind of calibration is especially useful when you inevitably find a low scoring regression judge. In cases where the judge is not trusted, the conversation typically devolves into a finger-pointing, loudest-voice competition where the agent builders believe the judge is wrong and the eval builders believe the product is wrong. Played incorrectly, this can burn trust in the evals process completely.

What separates token burning from useful judges typically comes down to how much effort you are willing to put into calibrating your judges. Pressure-testing your created evals should be considered a hard and fast rule for any judge. This can be as simple as manually looking through a few traces or as complex as hiring an independent team to score hundreds of sessions. Documenting the level of rigor and scores can help communicate the level of trust in the judge much more precisely than a binary yes/no. In practice, the comprehensiveness of calibration should scale with the severity of the opportunity (e.g., a flagship capability should likely have more intense judges/pressure testing than a deterministic length checker).

Go deeper: Calibration

10 · How do I do this at scale?

Eval Infrastructure

Most evals start as a script run locally on someone's machine. Designing the proper eval infrastructure is highly contingent on the type of product that you are building. A simple chat app with input-output could theoretically stay as a simple script forever without much drag from not building a full suite. As your product and your eval harness get more complex, this becomes much less tenable.

The primary questions to keep in mind when building out your infrastructure are related to parity and information asymmetry:

  • Parity: Does my evaluation harness accurately reflect the agent harness as it currently stands today?
  • Information Asymmetry: Does my infrastructure insert knowledge that will make it a better judge of the output than simply re-running the agent again?

To borrow from OOP, your evaluation suite should consist of multiple instances of Class Eval which should take in [Judge(s), Dataset(s), Agent Harness] and should output Eval Cards. Creating a reproducible framework makes it easier to maintain and standardizes the judge building methodology. Without this, each judge adds non-trivial maintenance overhead.

In addition to offline infrastructure, it is often useful to maintain online evals. These catch regressions earlier and can be influential in identifying the KPI impacts that judges may be able to serve as proxies for in the future.

If you have the resources, rolling your own eval infrastructure will likely be more useful in the long run. With full control over the entire process, you can customize it to your exact needs without any baggage. This is much more doable with high-powered long-running coding agents like Mythos/Fable that can be (mostly) trusted to keep things secure and up-to-date. That being said, there are many excellent frameworks out there that can serve as great starting points.

Go deeper: Infrastructure

11 · How do we communicate what we have learned?

Communicating Results

Much like traditional Data Analytics roles, finding the information is only half of the battle. The results of your Evals must be communicated at varying levels of nuance and length to stakeholders all over your organization. To simplify this, make use of dashboards and integrate alerting as early as possible. There is nothing worse than running into something failing in production that should have been caught by evals days ahead of time.

One failure mode that I have seen is a tendency for non-technical stakeholders to "reinvent evals from first principles". They tend to see product metrics as the golden truth and evals as an inferior substitute. Unfortunately for LLM-based products, doing enough root cause analysis will require them to push for more and more complex metrics until the only solution becomes building LLM judges. In these cases it would have been simpler and more useful to start with the Eval framework we have established here.

Go deeper: Communicating

12 · How do we use evals to improve our product?

Iterations

Connecting everything back to the product is possibly the most important piece within this guide. Ideally, this should create a loop where feedback from the evals creates opportunities to improve the product which creates opportunities to get more feedback from evals and so on. This can look like experimentation, recursive self-improvement, or something as simple as automatic prompt automation.

The real magic happens when this loop gets tighter and faster. The goal here is to make meaningful improvements based on your eval outcomes every week if not every day. Deeply integrating evals into your product development lifecycle can help accelerate this loop even faster.

Go deeper: Iteration

13 · What are the common fail-cases?

Foot-guns

A key idea here is that evals are really the easy-bake oven version of reinforcement learning. We build out harnesses, environments, and prompts then find ways to score the behavior of the model when exposed to those conditions. But we just stop there. Whereas RL must find nuanced, complex technical tricks to suck that information through a straw in thousands of rollouts, eventually achieving meaningful updates at the model-weight level, we can simply update prompts or add tools or simply start over at the harness level. But the close analogue does mean we can borrow a lot of the brilliance from there in order to make our lives easier. This means a lot of the footguns that apply to RL and traditional machine learning disciplines can apply here too. This section maps out many of the most common ones.

Go deeper: Foot-guns

Think it stuck?

A short quiz on everything above. No grades, no leaderboard, just a check on whether the mental model landed.

Take the quiz