
Hallucinations in Healthcare LLMs: Why They Happen and How to Prevent Them
Last Updated on May 13, 2025 by Editorial Team
Author(s): Marie
Originally published on Towards AI.
Hallucinations in Healthcare LLMs: Why They Happen and How to Prevent Them
Building Trustworthy Healthcare LLM Systems — Part 1

TL;DR
LLM hallucinations: AI-generated outputs that sound convincing but contain factual errors or fabricated information — posing serious safety risks in healthcare settings.
Three main types of hallucinations: factual errors (recommending antibiotics for viral infections), fabrications (inventing non-existent studies or guidelines), misinterpretations (drawing incorrect conclusions from real data).
Root causes of hallucinations: probabilistic generation, training that rewards fluency over factual accuracy, lack of real-time verification, and stale or biased data.
Mitigation approaches: Retrieval-Augmented Generation (RAG), domain-specific fine-tuning, advanced prompting, guardrails.
This series: build a hallucination-resistant pipeline for infectious disease knowledge, starting with a PubMed Central corpus.
Hallucinations in medical LLMs aren’t just bugs — they’re safety risks. This series walks through how to ground healthcare language models in real evidence, starting with infectious diseases.
Introduction
LLMs (large language models) are changing how we interact with medical knowledge — summarizing research, answering clinical questions, even offering second opinions. But they still hallucinate — and in medicine that’s a safety risk, not a quirk.
In medical domains, trust is non-negotiable. A hallucinated answer about infectious disease management (e.g., wrong antibiotic, incorrect diagnostic criteria) can directly impact patient safety, so grounding models in verifiable evidence is mandatory.
That’s why this blog series exists. This four-part series will show you how to build a hallucination-resistant workflow, step-by-step:
- Part 1 (this post): what hallucinations are, why they happen and how to build a domain-specific corpus using open access medical literature
- Part 2: Turn that corpus into a RAG pipeline
- Part 3: Add hallucination detection metrics
- Part 4: Put it all together and build a transparent interface to show users the evidence behind the LLM’s responses
What Are Hallucinations in LLMs?
Hallucinations are model-generated outputs that sound correct and coherent, but are not factually correct. They sound convincing but are often false, unverifiable or entirely made up.
Why They Matter in Healthcare
These errors can have serious implications, especially in clinical settings where they might lead to improper treatment recommendations. The wrong recommendation in clinical settings could have life or death consequences, which is why it is critical to mitigate these hallucinations by building transparent, evidence-based systems.

Main Types of Hallucinations
1. Factual Errors
Factual errors happen when LLMs make incorrect claims about verifiable facts. Using our infectious disease example, recommending antibiotics for influenza would be a type of factual error.
2. Fabrications
Fabrications involve LLMs inventing non-existent entities or information. In the context of healthcare, for example, these could be fictional research studies, medical guidelines that don’t exist or made-up technical concepts.
3. Misinterpretations
Misinterpretation happens when LLMs take real information but misrepresents or mis-contextualizes it. For example, a model might reference a study that exists, but draws the wrong conclusions
Why LLMs hallucinate
Large language models hallucinate because they
- don’t truly understand facts like humans do
- simply predict what words should come next based on patterns they’ve observed in their training data.
When these AI systems encounter unfamiliar topics or ambiguous questions, they don’t have the ability to say “I don’t know” and instead generate confident-sounding but potentially incorrect responses. This tendency stems from several factors:
- Their training prioritizes fluent, human-like text over factual caution
- They lack real-time access to verified information sources
- They have no inherent understanding of truth versus fiction.
- Conflicting information in training data can push the model to average contradictory sources.
The problem is compounded by limitations in training data that may contain outdated, biased, or inaccurate information, as well as the fundamental auto-regressive nature of how these models generate text one piece at a time.
How Can We Address Hallucinations?
There are various methods to mitigate or detect hallucinations.
Mitigation Strategies
- Fine-tuning with Domain-Specific Data: The main reason for hallucination lies in knowledge gaps in the model’s training data. This approach helps by introducing domain specific knowledge and can be very powerful to create models that understand better the specialized medical terminology or various nuances in clinical text.
- Retrieval-Augmented Generation (RAG): This method allows the integration of external knowledge sources by retrieving relevant information before generating the answer. It helps by grounding the model outputs in verified external sources instead of relying only on the model’s training data. This is the method we will be focusing on in this series
- Other noteworthy strategies: advanced prompting methods like Chain-of-Thoughts or Few-Shot Learning can help mitigate hallucinations by guiding the model’s answer in the right direction. Rules-based guardrails that screen outputs before they reach users add another safety layer.
Hallucination Detection
- Source-attribution scoring: This method compares the LLM answer to the retrieved documents to detect how much of the answer is grounded in the source. Beyond identifying hallucinations, it also allows to highlight the source behind the LLM answer, which helps building trust and transparency.
- Semantic Entropy Measurement: This method measures uncertainty about the meaning of generated responses and has been developed specifically to address the risk of hallucinations in critical areas involving patient safety for example
- Consistency-Based Methods: This method involves a self-consistency check, where hallucinations can be detected by prompting the model multiple times with the same query and comparing the outputs for consistency.
Some interesting open-access publications to go a bit further:
If you’re interesting in reading recent research on this topic, here’s a few research papers worth reading:
- A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
- Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models
- High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content
- Large language models encode clinical knowledge
Code Walkthrough: Downloading Medical Research from PubMed Central
To reduce hallucinations in healthcare LLMs, grounding them in reliable medical literature is critical. Let’s start by building a corpus from one of the best sources available: PubMed Central (PMC).
This script helps you automate the retrieval of open-access medical papers, making it easy to bootstrap a dataset tailored to your task (e.g., infectious diseases). Here’s how it works:
1. Setup and Environment
import requests
import xml.etree.ElementTree as ET
import json
import os, re, time
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv("NCBI_API_KEY")
email = os.getenv("EMAIL")
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
You’ll need to set your NCBI API key and email in a .env
file.
You can still call the NCBI API without an API key, but this unlocks higher rate limits and it is free
2. Search PMC
Because we are interested in full texts to build our knowledge base, we should only download articles that are open access. To do so, we need to fetch the articles from PMC:
# 1. Search PMC
search_url = f"{base_url}esearch.fcgi"
search_params = {
"db": "pmc",
"term": query,
"retmax": max_results,
"retmode": "json",
"api_key": api_key,
"email": email
}
print("Searching PMC...")
search_resp = requests.get(search_url, params=search_params)
search_resp.raise_for_status()
ids = search_resp.json()["esearchresult"]["idlist"]
This code queries PMC with your search terms (for example “infectious diseases”) and returns a list of document identifiers (PMCIDs).
3. Fetch and Parse Articles
Now we can fetch the full texts using the PMCIDs:
# 2. Batch fetch
fetch_url = f"{base_url}efetch.fcgi"
for i in range(0, len(ids), batch_size):
batch_ids = ids[i : i + batch_size]
fetch_params = {
"db": "pmc",
"id": ",".join(batch_ids),
"retmode": "xml",
"api_key": api_key,
"email": email,
}
time.sleep(delay)
r = requests.get(fetch_url, params=fetch_params)
r.raise_for_status()
Our response is an XML object, so the final step is to parse it and create a dictionary with the relevant sections: pmcid, title, abstract, full_text, publication_date, authors:
root = ET.fromstring(r.content)
for idx, article in enumerate(root.findall(".//article")):
# Extract article details
article_data = {
"pmcid": f"PMC{batch_ids[idx]}",
"title": "",
"abstract": "",
"full_text": "",
"publication_date": "",
"authors": [],
}
# Extract title
title_elem = article.find(".//article-title")
if title_elem is not None:
article_data["title"] = "".join(title_elem.itertext()).strip()
# Extract abstract
abstract_parts = article.findall(".//abstract//p")
if abstract_parts:
article_data["abstract"] = " ".join(
"".join(p.itertext()).strip() for p in abstract_parts
)
# Extract publication date
pub_date = article.find(".//pub-date")
if pub_date is not None:
year = pub_date.find("year")
month = pub_date.find("month")
day = pub_date.find("day")
date_parts = []
if year is not None:
date_parts.append(year.text)
if month is not None:
date_parts.append(month.text)
if day is not None:
date_parts.append(day.text)
article_data["publication_date"] = "-".join(date_parts)
# Extract authors
author_elems = article.findall(".//contrib[@contrib-type='author']")
for author_elem in author_elems:
surname = author_elem.find(".//surname")
given_names = author_elem.find(".//given-names")
author = {}
if surname is not None:
author["surname"] = surname.text
if given_names is not None:
author["given_names"] = given_names.text
if author:
article_data["authors"].append(author)
# Extract full text (combining all paragraphs)
body = article.find(".//body")
if body is not None:
paragraphs = body.findall(".//p")
article_data["full_text"] = " ".join(
"".join(p.itertext()).strip() for p in paragraphs
)
The data can then be saved into a jsonl that will be used in our next step — building our RAG system.
Let’s be mindful of licensing restrictions: While open access literature allows anyone to access and read the content, it doesn’t mean the authors agreed to redistribution of their work.
While this blog post and its content are intended for personal and educational use, if you decide to use this function to build a dataset that will be redistributed or commercialized, it is important to comply with the article’s license agreement. To do so, let’s define a function that will help us pull the license data from the downloaded article:
def detect_cc_license(lic_elem):
"""
Inspect <license> … </license> for Creative Commons URLs or keywords
and return a normalised string such as 'cc-by', 'cc-by-nc', 'cc0', or 'other'.
"""
if lic_elem is None:
return "other"
# 1) gather candidate strings: any ext-link href + full text
candidates: list[str] = []
for link in lic_elem.findall(".//ext-link[@ext-link-type='uri']"):
href = link.get("{http://www.w3.org/1999/xlink}href") or link.get("href")
if href:
candidates.append(href.lower())
candidates.append("".join(lic_elem.itertext()).lower())
# 2) search for CC patterns
for text in candidates:
if "creativecommons.org" not in text and "publicdomain" not in text:
continue
# order matters (most restrictive first)
if re.search(r"by[-_]nc[-_]nd", text):
return "cc-by-nc-nd"
if re.search(r"by[-_]nc[-_]sa", text):
return "cc-by-nc-sa"
if re.search(r"by[-_]nc", text):
return "cc-by-nc"
if re.search(r"by[-_]sa", text):
return "cc-by-sa"
if "/by/" in text:
return "cc-by"
if "publicdomain/zero" in text or "cc0" in text or "public domain" in text:
return "cc0"
return "other"
Here’s a short breakdown of what the licenses mean:

Here’s the full function for PubMed download:
def download_pmc_articles(query,
max_results = 100,
batch_size = 20,
delay = 0.2,
allowed_licenses = {"cc-by", "cc-by-sa", "cc0"},
out_file = "pmc_articles.jsonl"):
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
# 1. Search PMC
search_url = f"{base_url}esearch.fcgi"
search_params = {
"db": "pmc",
"term": query,
"retmax": max_results,
"retmode": "json",
"api_key": api_key,
"email": email
}
print("Searching PMC...")
search_resp = requests.get(search_url, params=search_params)
search_resp.raise_for_status()
ids = search_resp.json()["esearchresult"]["idlist"]
# 2. Batch fetch
fetch_url = f"{base_url}efetch.fcgi"
skipped, saved = 0, 0
with open(out_file, "w") as f:
for i in range(0, len(ids), batch_size):
batch_ids = ids[i:i+batch_size]
fetch_params = {
"db": "pmc",
"id": ",".join(batch_ids),
"retmode": "xml",
"api_key": api_key,
"email": email
}
time.sleep(delay)
r = requests.get(fetch_url, params=fetch_params)
r.raise_for_status()
root = ET.fromstring(r.content)
for idx, article in enumerate(root.findall(".//article")):
# Check license
license = detect_cc_license(article.find(".//license"))
if license not in allowed_licenses:
skipped += 1
continue # skip disallowed license
# Extract article details
article_data = {
"pmcid": f"PMC{batch_ids[idx]}",
"title": "",
"abstract": "",
"full_text": "",
"publication_date": "",
"authors": []
}
# Extract title
title_elem = article.find(".//article-title")
if title_elem is not None:
article_data["title"] = "".join(title_elem.itertext()).strip()
# Extract abstract
abstract_parts = article.findall(".//abstract//p")
if abstract_parts:
article_data["abstract"] = " ".join("".join(p.itertext()).strip() for p in abstract_parts)
# Extract publication date
pub_date = article.find(".//pub-date")
if pub_date is not None:
year = pub_date.find("year")
month = pub_date.find("month")
day = pub_date.find("day")
date_parts = []
if year is not None:
date_parts.append(year.text)
if month is not None:
date_parts.append(month.text)
if day is not None:
date_parts.append(day.text)
article_data["publication_date"] = "-".join(date_parts)
# Extract authors
author_elems = article.findall(".//contrib[@contrib-type='author']")
for author_elem in author_elems:
surname = author_elem.find(".//surname")
given_names = author_elem.find(".//given-names")
author = {}
if surname is not None:
author["surname"] = surname.text
if given_names is not None:
author["given_names"] = given_names.text
if author:
article_data["authors"].append(author)
# Extract full text (combining all paragraphs)
body = article.find(".//body")
if body is not None:
paragraphs = body.findall(".//p")
article_data["full_text"] = " ".join("".join(p.itertext()).strip() for p in paragraphs)
f.write(json.dumps(article_data) + "\n")
saved += 1
print(f"Saved batch {i//batch_size + 1}")
print(f"Downloaded {saved} articles to {out_file}, {skipped} articles removed by license filter")
Now you can call your function with your query to create your corpus. For example:
# Install packages if needed
pip install python-dotenv requests
query = 'bacterial pneumonia treatment'
max_results = 500
batch_size = 50
download_pmc_articles(query, max_results, batch_size)
And that’s it! Now all your articles are saved in a jsonl file and ready to be processed for RAG
What’s Next: Preparing the Data for RAG
In Part 2, we’ll take the domain-specific corpus you just built and use it to power a Retrieval-Augmented Generation (RAG) system — grounding your LLM in real evidence to reduce hallucinations and improve trust.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.