Strategy · Jan 15, 2026 · 5 min read

Why clinical AI pilots stall — and the checklist that ships them

The same five things kill digital health AI pilots over and over: undefined success criteria, unbounded scope, no evaluation harness, no named governance owner, and a clinician audience that wasn't in the design room. Here is the checklist that catches all five before week one — and the fix for a pilot that has already stalled.

The pattern is so common it has become predictable. A digital health company decides to run a clinical AI pilot. The first month is energetic. The second month produces a demo that impresses the executive sponsor. The third month is quiet. By month four, the team is frustrated, the sponsor has moved on, and the clinician audience that was supposed to use the system has lost interest.

Across the engagements I have run and the ones I have watched from the outside, the same five things kill clinical AI pilots. Different companies, different conditions, different EHRs, different model providers, same five failures.

If your pilot is stalled and you have ten minutes, walk through the list below. If you are starting a pilot, the same list is the gate to pass before week one.

1. No success criteria written down before the pilot started

This is the most common failure and the one most resistant to retroactive fixes. The pilot was framed as "let's see what AI can do," not "let's see if AI can do X measurably better than the current process."

When success is undefined, every output is debatable. Every output is also defendable, which means no one can call the pilot a failure — but no one can call it a success either. The pilot drifts.

The fix has to happen before the pilot starts. One page. Specific. Signed by the executive sponsor.

The shape of a usable success criterion:

"By the end of week 8, the AI workflow will produce drafts that clinical reviewer A finds acceptable without edits in at least 70% of the 50 sampled cases. The current baseline for templated drafts is 40%."

Notice what is in there: a date, a measurable output, an evaluator, a sample size, a baseline. Notice what is not: "improves clinician experience," "reduces cognitive load," "saves time." Those are downstream metrics. Use them as the case for the pilot, not as the test of the pilot.

2. Scope expanded mid-pilot

What started as one workflow became four. Each one half-built. None defensible. The team is exhausted and the demo deck has gotten worse, not better.

This is the failure mode that comes from success. The first workflow looks promising in week three, the sponsor asks "what about this other workflow," the team says yes, and the rails come off.

The fix is to write the scope and the un-scope on the same page as the success criteria. Two columns:

In scope this pilot: one workflow, named.
Out of scope this pilot: every other workflow people have asked about, named explicitly.

The "out of scope" column is the load-bearing one. Naming things explicitly is what lets the team say no without making it personal.

3. No evaluation harness

Outputs are reviewed by feel. Disagreement about quality is unresolvable. Three people watching the same demo come to three different conclusions and there is no shared ground for the conversation.

This is the failure mode that is most fixable mid-pilot. Spending one week building a small evaluation harness around 30 to 50 golden cases — with a clinician sitting next to the engineer for the labeling — gives the team a shared evaluation surface that wasn't there before.

The harness does not need to be elaborate. A spreadsheet of input cases, an output column for each model version, a clinician-rated quality column, and a small script that runs the cases through the model on demand. That is enough for a first pilot. See evaluation frameworks for clinical AI for the version that scales.

4. No named governance owner

When a question comes up about a borderline output — should the system handle this case, should this case go to a human, what does the team do with a model that gets this wrong in 1 out of 50 — no one knows whose call it is. Decisions do not get made.

The governance owner is the person who breaks ties on cases like this. They have a name. They have a fixed weekly slot. They are present in the design room. For a small pilot, this is usually the clinical lead or the medical director. For a larger one, it is whoever owns the AI Governance Charter — see SOC 2 and AI governance for that role.

Naming this person is a one-line decision. Skipping it is what causes the pilot to bleed weeks on undecided edge cases.

5. Clinicians weren't in the design room

They were invited to use the pilot. They were not in the room when the workflow was designed, the prompts were drafted, the success criteria were written, or the evaluation cases were picked. Their feedback comes late, feels like rework, and lands as resistance.

This is the failure mode that engineering teams are most likely to discount and most likely to be killed by. The clinician audience is not the user of the system — they are the system. Their judgment is what the AI is trying to assist. If they were not in the room when the assistance was designed, the assistance is going to feel off in ways that are hard to articulate and impossible to ignore.

The fix is structural. A clinician (or two — one MD, one nurse, depending on the workflow) is on the design team. Not as a reviewer. As a member. They get a fixed slot. They have decision authority on workflow design.

This costs money. It is the right money to spend.

The checklist, condensed

Before the pilot starts:

Success criteria written down. One page. Sponsor signs.
Scope locked to one workflow. Out-of-scope items named.
Small evaluation harness built. 30 cases. With a clinician.
Governance owner named. Fixed weekly slot.
Clinician on the design team from week one.

That is it. Five items. Each one takes hours to set up and saves weeks of drift.

Fixing a pilot that has already stalled

If your pilot is already stalled, the checklist still works. The order is just different:

Stop adding features. Whatever is in scope today is in scope. Anything else is a future pilot.
Write the success criteria for what is in front of you. Be honest. If the current outputs do not meet the bar a sponsor would have signed on, that's now a known gap, not a hidden one.
Build the eval harness this week. Even if it is rough. The harness ends the "is it good?" debate.
Name the governance owner. Today.
Get a clinician into the design conversation. Not the next demo. The design conversation.

The pilot either restarts with these in place or it ends cleanly. Either outcome is better than another quarter of drift.

If your team is staring at a stalled pilot and a cold-eyed second opinion would help, book a call. Sometimes the problem is none of the five above and it is worth knowing before you spend another quarter on the same approach. The full pilot-shipping framework is in the Healthcare AI Automation Playbook.

pilotsclinical AIshippingstrategydigital health

Next step

Want me to build something like this for your team?

Thirty-minute call. We'll look at the workflow you most wish was already automated and decide if it's a fit.

Book a call

Up next

Strategy · 6 min

From rules to reasoning: the CTO's playbook for pivoting a clinical product to LLMs