HIPAA-compliant AI automation: what auditors actually look for

Most teams designing healthcare AI start with the model. They pick Claude or GPT, write some prompts, get a demo working, and then — usually the week before a board meeting — someone asks the question that makes the room go quiet: "Wait. Does this thing pass a HIPAA audit?"

The answer at that point is almost always "not in its current form." Which is fine, because the model wasn't the part that needed to be audited anyway. The audit is about the system around the model. And that system is usually 70% of the work.

Here's what auditors actually look for in an LLM-powered healthcare workflow, in roughly the order they'll bring it up.

1. A signed BAA that names the right entities

Every vendor in the data path needs a Business Associate Agreement. That includes the model provider (Anthropic, OpenAI, Google, etc.), the cloud (AWS, Azure, GCP), any RAG / vector store, any orchestration layer, any observability tool that touches prompts.

The mistake I see most often: teams have a BAA with their cloud, but their prompts are flowing to an inference API endpoint that isn't covered under that BAA. Anthropic's enterprise tiers cover this. OpenAI's enterprise tier covers this. The free / hobbyist tiers do not. If you can't show the auditor a BAA chain that follows the data, you fail the control.

Design move: before you write a prompt, list every system the prompt or its response will pass through. Each one needs a BAA on file or has to be excluded from the data path.

2. PHI minimization at the prompt level

This is the one that catches well-engineered teams off guard. Even with BAAs in place, the principle of minimum necessary applies to what's in the prompt. If your model is summarizing a chart and you're sending the full encounter note to the API when you only need three fields, an auditor can flag that.

Design move: structure your skill files and prompts to pull the smallest data shape that does the job. Build a tokenization or de-identification layer when the task allows. Treat the prompt as a data export — because that's what it is.

3. Audit logging that survives a review

Every prompt + response pair touching PHI needs to be logged. Not just the model call — the user action that triggered it, the system context that was assembled, the output that was returned, and the human action taken on that output. Auditors want to be able to reconstruct any decision the system contributed to.

Design move: treat your prompt orchestration layer as the audit boundary. Every call goes through a logging middleware. The logs go to an immutable store with retention policies that match your existing PHI retention policy.

4. Access controls that match clinical roles

If the workflow surfaces patient data, the access model has to mirror the EHR. A clinical reviewer can see different patients than an MA can. An admin can see metadata about what the system did but not the contents.

Design move: don't reinvent access control. Federate to your EHR's RBAC if possible. If not, mirror the role structure exactly and document the mapping.

5. Output review and human-in-the-loop where it matters

This is where "AI governance" stops being abstract. Auditors want to see, for any output that influences a clinical decision, either (a) a human review step, (b) a structural reason a review isn't needed (e.g., this is a draft for a clinician to edit, never autonomous), or (c) a documented risk acceptance signed off by the right authority.

Design move: classify every output of your system into one of those three buckets, on paper, before the auditor arrives. The system should enforce the classification, not just describe it.

6. An evaluation harness with golden cases

The single fastest way to lose an auditor's confidence is to say "we tested it" and not be able to point at a test suite. They don't expect 100% coverage. They expect a documented evaluation methodology, a versioned set of golden cases, and a record of what changed when.

Design move: ship the evaluation harness on day one. Treat it like the test suite for a regulated medical device — even when it isn't one. The discipline is what's load-bearing.

7. Drift monitoring and incident response

Models change. Prompts change. Data shifts. The auditor wants to see that you'll notice, and that you have a process when you do.

Design move: build a drift dashboard that runs golden cases on schedule. Define what a "production-affecting" change is. Write the incident-response playbook before the first incident.

8. A documented model selection rationale

Why this model? Why this version? What's your plan if the provider deprecates it? An auditor isn't asking you to predict the future — they're asking you to demonstrate that you've thought about it.

Design move: include a short "model selection" section in your system documentation. Update it quarterly.

The pattern, if you've noticed: HIPAA compliance for AI is mostly a documentation and observability problem, not a model problem. The model is one component in a system that has to be auditable end-to-end.

The teams that ship AI in healthcare quickly are the ones that figure this out at the start. The teams that struggle are the ones that build the demo first and try to retrofit governance later.

If your team is in the "demo first, governance later" position right now and the audit is coming up, book a call. The retrofit is doable, but it goes faster with someone who's already done it.