Google Research Tools to Rid Bias from Healthcare Large Language Models

0
41


Chatbot customer service abstract concept vector illustration.
Chatbot customer service abstract concept vector illustration. Customer service bot, AI in retail, e-commerce chatbot, self-service experience, online client support, web chat abstract metaphor.

When an individual submits a query in a chatbox on a healthcare website, the utmost priority is to avoid offending or discriminating against the user, which could lead to a failure in delivering care.

Google Research published a study in Nature Medicine that lays out tools and techniques to improve the evaluation of possible harms to health equity caused by artificial intelligence (AI) responses generated by large language models (LLMs), like the ones used in Gemini, ChatGPT, and Claude. This method does not provide a comprehensive solution for all text-related AI applications, but it facilitates progress by presenting an AI system that can be employed and refined towards the shared goal of LLMs that promote accessible and equitable healthcare.

Bias-free LLMs for healthcare

LLMs can be helpful in analyzing complex health information based on text, such as clinical notes and interpreting reports. However, if LLMs are used in healthcare without proper safeguards, they could worsen existing gaps in global health outcomes. The risks arise from factors such as disproportionate representation in datasets, enduring health misconceptions related to patient identity, and variations in system performance among different populations. 

Co-lead authors Stephen R. Pfohl, PhD, and Heather Cole-Lewis, PhD, collaborated to develop a flexible framework for human evaluation and “adversarial” datasets—inputs to a machine-learning model intentionally crafted to force the model to make a mistake—called EquityMedQA.  Developing a framework for human assessment rubrics to identify health equity harms and biases is based on an iterative participatory approach involving a diverse group of raters evaluating responses produced by Med-PaLM 2, a large language model from Google Research tailored for the medical field. To design the adversarial dataset, the researchers pooled questions from several sources, including prior literature and “red-teaming”—a term describing the approach of trying to challenge the system for weaknesses.

The researchers show that their methodology reveals biases that purely human and other evaluation techniques may overlook. The findings underscore the necessity of employing a holistic methodology with varied evaluators—comprising physicians, health equity specialists, and consumers from diverse backgrounds—alongside tailored rubrics to identify biases and guarantee that LLMs advance health equity. Additional refinement of this methodology is essential to facilitate the scalable production of adversarial questions.

Developing a solution for all global contexts

This approach is designed to be adaptable across different models, use cases, and sources of harm, though it doesn’t replace the need for context-specific evaluations of the consequences of biases. As such, there is a great need to develop evaluation procedures grounded in the specific contexts in which LLMs are used outside of Western contexts and to recruit specialized raters equipped to evaluate bias in those contexts.



Source link