ChatBIAS: Demographics Sway LLM Healthcare Recommendations, Study Shows

0
0
ChatBIAS: Demographics Sway LLM Healthcare Recommendations, Study Shows


ChatBIAS: Demographics Sway LLM Healthcare Recommendations, Study Shows
Credit: mathisworks / Getty Images / DigitalVision Vectors

Large language models (LLMs) show promise in healthcare, but a new study reveals a concerning issue: these models may produce biased recommendations based on patients’ sociodemographic labels rather than clinical needs.

Researchers from the Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System analyzed over 1.7 million outputs from nine LLMs across 1,000 emergency cases, each presented with varying demographic profiles. Published in Nature Medicine, the study found that cases labeled as Black, unhoused, or LGBTQIA+ were more frequently directed by the LLMs toward urgent care, invasive procedures, or mental health evaluations—sometimes exceeding clinical recommendations. In contrast, high-income labels led to more advanced diagnostic imaging, while low- and middle-income patients often received fewer tests or less comprehensive care. These findings suggest that AI-driven healthcare systems may inadvertently perpetuate, rather than reduce, healthcare inequities.

LLMs and the growing influence of AI in patient care

LLMs are increasingly used in healthcare to assist with triage, diagnosis, treatment planning, and mental health assessments. Their ability to quickly process large volumes of data is seen as a way to improve efficiency in clinical settings. However, the Mount Sinai study raises concerns that these models may reflect and amplify existing societal biases, particularly in a field where disparities in healthcare outcomes are well-documented. For example, Black women face higher maternal mortality rates, LGBTQ individuals report poorer health outcomes, and low-income populations have less access to advanced diagnostics.

Since LLMs are trained on vast datasets containing human-generated data, they risk inheriting these biases, which could distort medical recommendations and perpetuate inequities in care.

Insights from a large-scale analysis

Led by Girish N. Nadkarni, MD, and Eyal Klang, MD, the study analyzed nine LLMs’ responses to four clinical questions: triage priority, diagnostic testing, treatment approaches, and mental health evaluation. While larger models like GPT-4o and Qwen-2-72B offered more nuanced outputs, they sometimes reinforced harmful biases. Smaller models like Phi-3.5-mini-instruct and Gemma-2-27B-it exhibited more variability and struggled with reliability in complex cases.

Despite variations in model performance, all LLMs showed a common trend: sociodemographic characteristics frequently influenced recommendations more than clinical guidelines or physician judgment. For instance, Black transgender women and unhoused individuals were more likely to receive mental health evaluations, with some cases showing a 43% increase in recommendations compared to other groups. Additionally, high-income patients were more often recommended advanced imaging, such as CT or MRI scans, while low- and middle-income patients were less likely to receive such tests.

Disparities in mental health recommendations

The study found one of the most pronounced disparities in mental health assessment recommendations. Patients labeled as Black transgender women, Black transgender men, or unhoused were significantly more likely to be referred for mental health evaluations, regardless of their clinical need. This trend points to the potential for AI models to perpetuate harmful stereotypes or assumptions that marginalized populations are more likely to require mental health care.

Treatment recommendations also varied based on sociodemographic labels. For example, unhoused patients, Black transgender individuals, and those labeled as Middle Eastern were more often recommended inpatient care, while middle-income and low-income white patients were less likely to receive such referrals. These trends suggest that demographic assumptions in AI models could lead to either over-treatment or under-treatment, depending on the patient’s identity.

Diagnostic testing and invasive procedure biases

The study also revealed disparities in the types of diagnostic tests and interventions recommended based on sociodemographic factors. High-income patients were more likely to be recommended advanced imaging like CT or MRI scans, while low- and middle-income patients received fewer tests. Similarly, patients labeled as unhoused or Black and unhoused were more frequently recommended invasive procedures, indicating a potential overuse of interventions for these groups.

These findings raise concerns that, while AI models aim to optimize decision-making, they may unintentionally reinforce healthcare disparities by prioritizing certain patient groups over others based on demographic labels.

Safeguarding ethical AI in healthcare

The researchers stress the need for ongoing audits and corrective measures to address the biases identified in the study. They advocate for equity-focused prompt engineering, continuous evaluations of AI outputs, and the inclusion of clinician oversight to prevent biased decision-making in healthcare.

As AI systems become more integrated into healthcare, ensuring that these tools do not exacerbate existing inequities is essential. The study calls for increased attention to the ethical development of AI in healthcare and the implementation of safeguards to protect vulnerable populations from biased or inequitable care. The researchers emphasize the importance of balancing technological innovation with ethical responsibility to ensure that AI tools improve, rather than hinder, equitable access to healthcare.



Source link