LLMs Encode Clinical Knowledge :A Review

Last Updated on January 19, 2023 by Editorial Team

Author(s): Ronny Polle

Originally published on Towards AI.

LLMs Encode Clinical Knowledge: A Quick Review

Outline

Introduction
Contributions
Limitations
Conclusion
References

Introduction

In the field of medicine, language is an enabler of key interactions for and between clinicians, researchers, and patients. This provides opportunities for leveraging LLMs for modeling properties of textual data in the medical domain.

There is evidence that LLMs can act as implicit knowledge bases. The weights of these networks store information, resulting in information that is pliable and hence can be operated upon in a representation space. This phenomenon equips LLMs with an ability to form associations between stored information to produce meaningful insights. The unfortunate news is that this associative capability can lead to hallucinations as information stored in weights is unreliable. Hence, current AI models for applications to medicine and healthcare lack the ability to address significant gaps in effectively leveraging language as a tool for mediating real world clinical workflows. For instance, LLMs are found to have the potential to mirror misinformation,bias, and stereotypes in the corpus.

With the profound advances made by LLMs, AI systems are undergoing innovative repurposing and helping address limitations posed by predominantly single-task AI systems.

Contribution

Key contributions made in this study can be summarised across the following three axes;

Dataset benchmark
Framework for human evaluation
Modeling

Firstly, this paper introduces a dataset benchmark for medical question answering called MultiMedQA. This benchmark is a collection of six open-question-answering datasets — MedQA [jin2021disease], MedMCQA [pal2022medmcqa], PubMedQA [jin2019pubmedqa], LiveQA [abacha2017overview], MedicationQA [abacha2019bridging], and MMLU clinical topics [hendrycks2020measuring]). MedQA mirrors the US Medical License Exam (USMLE) style of questions. Furthermore, this benchmark is augmented by HealthSearchQA, the seventh dataset comprising curated commonly searched consumer health queries in English.

Secondly, a robust human-centric evaluation framework is proposed to address some of the current limitations with automated metrics for assessing long-form answer generation, such as the bilingual evaluation understanding metric (BLEU). Clinicians and lay users (non-experts) are captured in the evaluation of the models’ generative output. The clinician’s evaluation is run along twelve different evaluation axes whilst lay users are evaluated along two unique axes. These include — how well the models’ output agrees with scientific consensus; the possibility and the likelihood of harm; evidence of comprehension, reasoning, and retrieval ability; the presence of inappropriate, incorrect, or missing content; the possibility of bias in the answer; answer captures user intent and helpfulness of the answer.

Thirdly, this paper highlighted how well LLMs encode clinical knowledge with major architectural modifications. The authors built on the Pathway Language Model (PaLM) and Flan-PaLM family of LLMs. PaLM is a transformer model architecture composed of a decoder-only setup, with key features such as — SwiGLU activation function to substitute for standard activation functions(ReLU, Swish, GeLU), parallel transformer layers, multi-query attention mechanism, rotary position embeddings (ROPE) as a substitute for absolute or relative position embeddings, shared-input output embeddings, zero bias kernels, and layer normalizations, and use of a SentencePiece vocabulary.

Leveraging the PaLM baseline and instruction prompt tuning paradigms, the authors demonstrated the Flan-PaLM variant to gain superior performance across a suite of evaluation tasks over the baseline.

Moreover, given the key limitations of domain adaptation methods and end-to-end finetuning of the model using copious amounts of in-domain data, the authors successfully investigated prompting and prompt-tuning to aid Flan-PaLM in adapting to the medical domain. The instruction prompt tuning technique designed in this study incorporates soft-prompt learned by prompt-tuning as an initial prefix that is shared across multiple medical datasets, followed by a task-specific human-engineered prompt alongside the original question and/or context.

Limitations

Although promising, the dataset benchmark fails to cover multiple languages and excludes a larger variety of medical and scientific domains, hence partially reflecting real-world clinical workflows. Secondly, although Flan-PaLM was able to reach state-of-the-art performance on several medical questions and answering benchmarks, there are important gaps to be bridged in order for it to reach an expert clinician level on many clinically important axes. Important future directions proposed to help address this challenge include — establishing solid grounds for the models’ responses in authoritative medical sources and accounting for the time-varying nature of medical consensus, the ability to effectively quantify and communicate uncertainty to a generic user-in-the-loop, and multilingual support for responses.

Secondly, it is critical to conduct exhaustive work to improve the human evaluation framework. The pilot rating framework is not very exhaustive as it fails to capture important variations across diverse population groups. Also, the pool of clinicians and lay-users assessing the model responses is limited. Lastly, the evaluation framework failed to investigate the impact of variation in the clinician rater’s medical specialty, demographics, and geography.

Moreover, fairness and equity are underexplored in this study, especially the lacking understanding of how perturbations to demographic identifiers in prompts influence the model outputs. Also, the safety-critical and complex requirements of the medical domain pose an important question as to how the approach of sampling clinicians to participate in identifying the best-demonstration prompts examples and crafting few-shot prompts impacts the overall behavior of LLMs.

In conclusion, I am fascinated by the performance of LLMs demonstrated in this rigorous study. Not only does it exemplify a successful application and evaluation of LLMs in the medical context, but also, it demonstrates exciting directions for future research and improvements.

Thank you for reading 🙂

References

[1] Large Language Models (LLMs) encode clinical knowledge

[2] Galactica: A Large Language Model for Science

[3] PaLM : Scaling Language Modeling with Pathways

LLMs Encode Clinical Knowledge :A Review was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

LLMs Encode Clinical Knowledge :A Review

Author(s): Ronny Polle

LLMs Encode Clinical Knowledge: A Quick Review

Outline

Introduction

Contribution

Limitations

References

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

LLMs Encode Clinical Knowledge :A Review

Author(s): Ronny Polle

LLMs Encode Clinical Knowledge: A Quick Review

Outline

Introduction

Contribution

Limitations

References

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥