Every team building clinical AI eventually says some version of: "we test it." Then a clinical lead asks "how?" and the answer is "we ran ten cases by hand last sprint." That's not an evaluation framework — it's a vibe check. And it's the single most common reason clinical AI pilots stall.
Here's the framework I keep returning to. It's not novel; it's the careful, integrated version of components most teams have heard of but never wired together.
The four parts
1. Golden cases
A versioned set of input/output pairs that represent the canonical scenarios for your system. For each case:
- Input: the structured context (chart, prompt, scenario).
- Expected output shape: the schema the system should produce.
- Reference answer: the correct output, written and reviewed by a clinician.
- Acceptance criteria: what counts as "matches" — strict equality on structured fields, semantic equivalence on free-text fields.
- Failure modes: the specific things the case is testing for. ("Does it correctly not recommend X when Y?" "Does it cite the guideline?")
Golden cases are versioned in your repo. Adding one is a PR. Removing one is a PR. Changing one's expected output is a PR with a clinical reviewer named.
Start small: 30–50 cases per scenario. The point isn't coverage in the test-suite sense; it's representation of the failure modes your clinicians actually care about.
2. The judge
For free-text outputs, equality doesn't work. You need a judge that decides whether the candidate output is acceptable given the reference.
The judge is itself a model call. This sounds circular, and it is — but it's a different call, with a different model or version, and a different prompt focused on evaluating, not generating. Calibrated against human judgments on a hold-out set.
Two patterns:
- Reference-based: the judge sees the reference answer and the candidate, and rates similarity / acceptability.
- Reference-free: the judge sees only the input and the candidate, and rates against criteria. Use this for scenarios where there's no single "right" answer.
Both patterns need calibration. Run them against a set of human-judged outputs and tune until the judge agrees with humans at >85%.
3. Drift monitoring
Run the golden case suite on a schedule. Daily for production systems; per-PR for code/prompt changes.
Track:
- Pass rate — overall and per-scenario.
- Per-case stability — which cases are flaky.
- Confidence drift — for scenarios with calibrated confidence outputs, are the distributions shifting?
Build a dashboard. Make the team look at it. The dashboard is a forcing function.
4. Production telemetry
Golden cases are not enough. You also need to see what's happening in production:
- Volume — calls per scenario per day.
- Distribution shifts — are the inputs changing? (Often the failure mode.)
- Human override rate — how often does the clinician edit or reject the output?
- Time-to-review — how long does the human reviewer take? (A sudden spike means something changed.)
- Escalation rate — how often does the system route to a human?
These are the leading indicators. Pass rate on golden cases is the lagging indicator.
The thing that makes the framework defensible
The framework above is not a guarantee of correctness. It's a system of evidence that you are detecting and responding to changes in behavior.
That distinction matters. Auditors don't expect you to prove the LLM is correct — that would be absurd, and they know it. They expect you to demonstrate:
- You have a methodology for assessing the system's outputs.
- The methodology is documented and versioned.
- The methodology has detected real issues.
- There's a process for responding when it does.
The four-part framework satisfies all four points. Most "we test it" answers don't satisfy any of them.
What to build first
If you're starting from zero, here's the order:
- Define the scenarios. Two or three to start. Resist the urge to do all of them.
- Write 30 golden cases per scenario with a clinical reviewer.
- Implement reference-based judging with calibration on 20 hold-out cases.
- Wire it into CI so every PR runs the suite.
- Add daily scheduled runs with a dashboard.
- Add production telemetry for the same scenarios.
- Then expand to more scenarios.
This gets you to "defensible" in 4–6 weeks of focused work. It's the single best ROI in any clinical AI project I've worked on.
The teams that ship clinical AI to production are the ones that treat evaluation as load-bearing infrastructure, not as testing. If you're staring at "we tested it" and aren't sure what to build first, book a call and we can sketch a 30-day plan together.