Responding to patient messages using LLMs

 

Study Design:

Using 100 simulated cancer patient scenarios paired with questions, researchers evaluated the impact of using a LLM, GPT-4, to draft responses to patient questions.

Key Results:
Safety: 58% of LLM-generated responses were safe and usable without edits, although 7.7% posed safety risks if used unedited.Reduced Workload: Clinicians reported subjective efficiency improvements with LLM-assistance, which could reduce spend on patient communications and potentially alleviating burnout.

Impact on Clinical Decision-Making:
The content of manual responses was significantly different than the content of LLM draft and LLM-assisted responses. LLM errors tended to arise not from incorrect biomedical factual knowledge, but incorrect clinical gestalt and identification of the urgency of a situation.

Significance:
We found pre-clinical evidence of anchoring based on LLM recommendations. Raising the question: Is using an LLM to assist with documentation simple decision-support, or will clinicians tend to take on the reasoning of the LLMs? Despite being a simulation study, these early findings provide a safety signal indicated a need to thoroughly evaluate LLMs in their intended clinical contexts, reflecting the precise task and level of human oversight. Moving forward, more transparency from EHR vendors and institutions about prompting methods are urgently needed for evaluations. LLM assistance is a promising avenue to reduce clinician workload but has implications that could have downstream effect on patient outcomes. This situation necessitates treating LLMs with the same rigor in evaluation as any other software as a medical device.

 

AIM Investigators