Categories  /  Evaluation
08

Evaluation

How AI work is judged.

If you can't measure it, you can't improve it — and if you measure it wrong, you'll improve the wrong thing. Evaluation in agentic AI is harder than it looks. Models are persuasive, output is voluminous, and subjective impressions are unreliable. These terms name the approaches that work, the failures that don't, and the traps that turn evaluation from a quality gate into a rubber stamp.

Structured Judgment

Evaluation where an AI judge scores work against explicit plans, tasks, evidence, and policies rather than subjective impressions.

Policy-Bound Evaluation

Evaluation scored against explicit plans, tasks, evidence, and policies.

Eval Drift

Gradual misalignment between what an evaluation measures and what actually matters.

Judgment Bias

Systematic skew in AI evaluation due to unexamined assumptions, prompt framing, or training artifacts.

Rubric Rot

Decay in evaluation criteria relevance over time — the eval no longer tests what it should.

Eval Capture

When an agent optimizes for passing evaluation rather than doing the actual work.

Eval Integrity

Evaluation that remains aligned with real-world outcomes over time — resistant to Eval Drift, Rubric Rot, and Eval Capture.