Skip to content
Knowledge beta

Evaluations

If you can't measure whether your AI setup got better, you're not engineering. You're guessing.

Overview

An eval is a test suite for your AI work. You collect representative inputs for a task, run them through your prompt or setup, and judge the outputs against criteria you defined before you looked at the results. That last part matters. Defining “good” after seeing the output is rationalization, not evaluation.

Say you have a Claude Project that drafts customer release notes from Jira tickets. Your eval might be ten real tickets spanning bug fixes, features, and breaking changes. You run them through, then score each output: Did it catch the breaking change? Did it match the right tone? Did it hallucinate a feature that wasn’t in the ticket? Change the prompt, run the same ten tickets, compare. Now you know whether the change helped.

The practice scales with the stakes. A weekly status report draft? Spot-check three outputs after a prompt change. A customer-facing knowledge base? Build a repeatable suite with scoring rubrics — automated checks for format and factual grounding, human judgment for tone and nuance. The point isn’t rigor for its own sake. The point is that prompt engineering and context engineering become falsifiable. You make a claim (“this prompt handles edge cases better”), and the eval either confirms or embarrasses you.

This is what makes compounding setups actually compound. Each iteration produces evidence, not vibes.

Resources