We previously asked five leading AI models what they thought of UAMS AI Wayfinding. Their response was generous and genuinely useful. But one observation came up across all of them: how will you know if this actually works?
That is the right question, and it is the one we spent this month answering. This post walks through the measurement program UAMS Web Services built to track whether AI Wayfinding is actually improving how AI tools represent UAMS, and how we plan to respond when it is not.
Why measurement is hard
AI model behavior is a moving target. Models update constantly. Any observed improvement in how an AI tool describes UAMS could come from our Wayfinding files, from a model retraining, from third-party content changes, or from vendors tweaking their systems. No controlled experiment is possible.
Unlike search engines, AI tools do not publish performance dashboards. There is no “AI Search Console” that tells us how often UAMS is cited, or how accurately. The effectiveness of the Wayfinding convention also depends on whether AI vendors continue to honor it, which varies across tools and changes over time.
A measurement program that pretended to solve these problems would lose credibility quickly. We built one that names them openly.
How we will measure
Three complementary practices, each capturing something the others cannot.
Monthly systematic scoring. On the first Monday of every month, we will send the same set of questions about UAMS to five AI tools: ChatGPT, Claude, Gemini, Perplexity, and Microsoft Copilot. Each response is scored on a simple three-point rubric (accurate, partial, inaccurate) and logged in a shared sheet. Twelve UAMS questions plus six abstracted peer-comparison questions, across five tools, runs about 90 to 110 minutes per cycle.
Peer comparison. Each month, one peer institution is sampled alongside UAMS. The rotation covers four peers across the year and gives UAMS leadership a genuine sense of where the institution stands relative to peers who either invested in AI readability earlier or did not invest at all.
Continuous incident reporting. A public-facing form lets anyone, UAMS staff or patients or members of the public, report an AI answer that looks wrong. Web Services triages submissions weekly. Most result in content updates on UAMS pages. A few severe cases may be escalated through AI vendor feedback channels.
What we gave up on purpose
The most important design decision was what we chose not to build.
We could have automated the monthly scoring. Browser agents can click links, paste questions, and capture responses. But the measurement program’s validity depends on a human reading each AI response in the tool’s native interface. That reading is the measurement, not a byproduct of it. If an agent scrapes responses and a staffer scores text excerpts in a spreadsheet, the institutional knowledge and incident-detection value disappear. Automation was the easy answer. Reading responses yourself is the honest one.
This decision will be revisited in October 2026. By then, six months of operational data will tell us whether 90 minutes per month is sustainable and whether the human engagement is producing the expected value. If not, we adjust.
What happens with the data
Every monthly pull generates a new tab in a shared measurement sheet. Every monthly pull also generates a summary report committed to a private repository. Reports aggregate UAMS performance by tool, trend over time, and peer comparison. The first real report will land in early May 2026 covering April data.
Quarterly, a deeper audit runs against the accumulated monthly data, using a purpose-built audit tool that identifies persistent misrepresentations and generates recommendations for the Wayfinding files. These quarterly reports go to Web Services leadership.
Annually, at the end of the first year, we synthesize everything into an evidence-based assessment: what changed in how AI tools represent UAMS over twelve months, what we can and cannot attribute to the Wayfinding initiative, and what we recommend for year two.
The honest expectation for year one
We are not going to prove causality. This program cannot separate the effect of the Wayfinding files from the effect of model retraining, content updates, or any of the other variables that move AI behavior month to month. Any institution that claims it can is overselling.
What we will have is a disciplined signal: twelve months of scored data across five AI tools, a catalog of specific misrepresentations identified and addressed, a peer comparison across four institutions, and a reasonable basis for telling UAMS leadership whether the initiative is earning its keep.
That is not a small outcome. Most academic medical centers are not yet measuring this at all. Many will launch AI readability initiatives over the next eighteen months without any way to tell whether they worked. UAMS is choosing to know.
For peers and staff
If you work at a peer academic medical center thinking about similar measurement, we are happy to share the methodology. If you are UAMS staff and have encountered an AI answer about UAMS that looks wrong, please use the report form. Every submission helps us find the gaps the monthly monitoring misses.
The full measurement plan and the question set are documented internally. Web Services is the team of record.