An MIT study finds non-clinical information in patient messages, like typos, extra whitespace, or colorful language, can reduce the accuracy of a large language model deployed to make treatment recommendations. The LLMs were consistently less accurate for female patients, even when all gender markers were removed from the text.
LLMs are not Large Medical Expert Systems. They are Large Language Models, and are evaluated on how convincing their output is, instead of how accurate or useful it is.