01 · What are you measuring?
Evaluations
Evaluating the quality of your AI-based product presents an entirely different set of challenges than measuring traditional software. Where traditional software is deterministic, i.e., users will always arrive at the same place if they click the buttons in the same order, AI & agent-based products are inherently black boxes. In this new world, inputs and outputs are unbounded by default, creating a massive surface area for unexpected user & agent behaviors.
The core goal of Evals is to provide a window into understanding your model/product.
The best way to do that is by identifying what "good" and "bad" looks like. Once you understand how your product is failing, it becomes much easier to fix. In practice, this is the difference between hearing feedback like "Your product sucks" versus knowing "The agent focused on hotel reservations fails when booking for parties of 4 or more". When you have that kind of information, you can fix the underlying issue then loop through the eval process again to make the product even better.
While the loop is what you will practice day-to-day, this page is organized conceptually with each section building upon the last. If you want to jump ahead, feel free to click on a part of the loop that interests you.
Evals can be applied to a wide variety of characteristics of your model or agent. The common types of measured characteristics are:
- Model Behavior
- Reasoning
- Agentic Behavior
- Knowledge
- Safety
- CIA (Confidentiality, Integrity, Availability)
For the bulk of this site we will focus on agentic behavior since that is one of the most common use cases.