Lexical Bias in Clinical NLP Pipelines

Author(s): Rostislav Markov

Originally published on Towards AI.

Lexical Bias in Clinical NLP Pipelines — Image by the author.

A hospitalization-prediction model scored 97% accuracy. Yet, simply rephrasing “inpatient” as “hospital admission” caused its confidence to collapse. Although the underlying clinical scenario didn’t change, the output did. That’s lexical bias: when a model reacts to specific tokens or words instead of understanding the underlying concept.

Introduction

Clinical NLP systems — fine-tuned transformers like DistilBERT or prompt-based pipelines with GPT — process patient narratives for tasks from diagnostics to outcomes. High test-set accuracy can mask a deeper fragility: an over-reliance on token-level shortcuts that fail to generalize across phrasing and context, both during training and at inference.

In this post, I’ll show:

How small phrasing tweaks derail predictions
Why both training and inference pipelines are vulnerable
A concise robustness checklist to catch and fix these shortcuts

Controlled setup

Using Synthea COVID-19 100K dataset, I generated templated narratives from demographics, comorbidities, encounter history, and medications:

Patient 7f697ae3-cba0–4801–92db-18c1169817d0 is approximately 50 years old, f. Diagnosed with COVID-19 on 2020–03–02. Comorbidities include: Childhood asthma, Acute bronchitis (disorder), Miscarriage in first trimester. Encounter types: wellness (55), ambulatory (22), outpatient (2), emergency (1), urgentcare (1). Medications prescribed: 120 ACTUAT Fluticasone propionate 0.044 MG/ACTUAT Metered Dose Inhaler, NDA020503 200 ACTUAT Albuterol 0.09 MG/ACTUAT Metered Dose Inhaler, diphenhydrAMINE Hydrochloride 25 MG Oral Tablet.

Models predicted patient hospitalization within 14 days of a COVID-19 diagnosis. I compared several training and inference pipelines:

SBERT + Logistic Regression: ~94% accuracy
Fine-tuned DistilBERT: ~97% accuracy
GPT-4o (zero-shot: ~70%, few shot: ~65%)
o3-mini (zero-shot: ~78%, few-shot: ~84%)

Predictions looked promising until I introduced simple paraphrase tests.

Same meaning, different results

DistilBERT showed high variance in prediction confidence for semantically equivalent phrases:

“inpatient (1)”: 99% (Yes)
“one inpatient stay”: 96% (Yes)
“admitted once to the hospital”: 72% (Yes)
“hospital admission: 1 time”: 7% (No)

Only when the exact word “inpatient” (and its tokens ['in', '##patient']) was present did the model confidently predict hospitalization. This is a lexical shortcut: treating token frequency as semantic signal.

Spurious correlations as lexical shortcuts

Embedding-based systems learn directly from token co-occurrence. To explore this, I used LIME (Local Interpretable Model-Agnostic Explanations) with the DistilBERT fine-tuned on my training dataset. LIME perturbs input tokens and fits a local surrogate model to approximate behavior. For example, LIME found:

“inpatient” contributed +0.33 to the hospitalization prediction
The non-causal “molars” received nearly as much (+0.32)
The clinically significant “hypertension” was down-weighted (+0.02)

In my training dataset, unrelated conditions like “molars,” “appendectomy,” or medications like “amlodipine” co-occurred in the narratives labeled as hospitalized. Such co-occurrences are incidental, but models can mistake co-occurrence for causal signal.

Next, I tested whether large language models — despite their vast pretraining — are also prone to lexical bias at inference.

Lexical anchoring in prompting

Class-bias swings

In zero-shot mode, GPT-4o consistently over-predicted hospitalization (positive recall 0.79, precision 0.64). Adjusting the temperature (0 to 0.7) barely changed this positive class bias.

Trying to correct it with few-shot prompts backfired at first: adding examples skewed towards negative cases made model responses overly conservative, missing most hospitalization cases (positive recall < 0.10).

This heavy swing in model responses underscores how brittle few-shot prompting can be in nuanced, domain-specific tasks. An unbalanced few-shot example set can introduce as many new errors as it fixes.

Token-level swings

Having identified class-bias swings in GPT-4o, I turned to o3-mini which delivered more balanced responses (0.67/0.71 precision/recall on positives). Next, I used it to test if model responses are sensitive to minimal-pair paraphrases. For instance, replacing “inpatient” with “admitted once to the hospital” flipped o3-mini’s response (here: in zero-shot):

“Encounter types: … inpatient (1).” ≈ Yes
“Encounter types: … one inpatient stay.” ≈ Yes
“Encounter types: … admitted once to the hospital.” ≈ No

Even in the model that best balanced class predictions I observed signs of lexical bias.

These findings suggest that every prompting strategy, especially in agentic workflows, should be validated with minimal-pair probes and class-calibration checks. Few-shot prompting can mitigate bias, but only if examples are carefully balanced, lexically varied, and cover edge cases. Striking that balance is no easy feat.

To test if surfacing the model’s reasoning steps could mitigate lexical bias, I used Chain-of-Thought (CoT) prompting, asking the model to show its work step by step before giving a final answer.

Chain-of-Thought helps

CoT prompting improved robustness by guiding the model through multi-step reasoning. For the paraphrased narrative “admitted once to the hospital,” CoT corrected the o3-mini’s zero-shot “No” prediction. Here is a typical CoT output on that paraphrase (in zero-shot mode):

“ 1. The patient is 50 years old and male, which are demographic factors associated with a higher risk of severe COVID-19.
2. He has hypertension, a recognized comorbidity that elevates the risk of complications from COVID-19.
3. His encounter history indicates that he was admitted to the hospital once, demonstrating that he has required inpatient care before.
4. With these factors (age, sex, comorbidity, and previous hospitalization), the overall risk assessment suggests a higher likelihood of needing hospital care.

Yes”

By traversing multiple inference paths, CoT seems to dilute the impact of single-token shortcuts, reducing the likelihood of lexical bias. While useful as proxy, CoT does not reveal the actual inner workings of the model’s hidden layers or entirely eliminate token-level shortcuts. For example, when I supplied the label with the prompt, the model simply rationalized it and adjusted the CoT steps to justify the label. In other words, it’s a window into what the model says it’s thinking.

Practical robustness checklist

Lexical bias is real but manageable — both during training and at inference. Here are techniques I recommend:

1. Paraphrase augmentation

Teach the model true semantics by diversifying your training/few-shot examples with varied phrasings of each concept — both positive and negative. In our case:

Positive examples: “hospital admission,” “ward stay,” “discharged then readmitted same day”
Negative controls: “inpatient for a prior fracture,” “no inpatient encounters”

2. Concept normalization

Preprocess data by replacing phrases like “inpatient” or “hospitalized” with normalized tags like <HOSP_ADMIT> to reduce lexical variance. Caution is warranted to avoid collapsing clinically distinct events. While structured vocabularies like SNOMED CT offer more precise and context-aware concepts, robustness still depends on the data diversity.

3. Explainability checks

Independently of your data efforts, test for actual robustness. Use explainability checks to inspect which words are driving predictions. If irrelevant terms appear as top drivers, retrain with better examples or adjust preprocessing. Use minimal-pair or leave-one-out prompt tests to localize which words drive model answers.

4. Confidence alarms

Monitor model confidence/log-probability changes in production. Prompting pipelines offer a self-reported “confidence” signal as lightweight robustness check. A sudden drop should invoke human review or fallback logic.

5. Chain-of-thought prompting

‘Force’ multi-step reasoning to dilute single-token effects, then re-validate with above probes to ensure CoT is not just post-hoc justification. This can improve stability and model confidence, and makes model behavior easier to audit.

Conclusion

Lexical bias can derail predictions in clinical NLP pipelines. A robustness strategy with paraphrase probes, perturbation tests, and confidence alarms helps surface and mitigate shortcuts in model responses.

Don’t just optimize for model accuracy. Test for understanding.

Disclaimer

This work uses synthetic data and a simplified hospitalization prediction task to isolate lexical bias. The setup is not intended to simulate real-world clinical practice, nor is it a substitute for full clinical validation. Rather, it is used to highlight a class of generalization failures in NLP pipelines — in particular when phrasing, rather than meaning, drives model decisions.

While the controlled Synthea setup isolates lexical bias, real clinical notes contain idiosyncratic abbreviations, typos, and variable structure. Future work should validate these findings on de-identified electronic health record data (e.g., MIMIC-III) to confirm generalization.

Prompting pipelines deliver inference-time decisions driven by how strongly certain prompt tokens pull on the model’s learned word associations. In such pipelines, hospitalization is not a supervised label learned from our dataset and instead the model generates a label proxy based on next-token likelihood.

Resources

Colab notebook used in this article.
Geirhos et al. 2020: Shortcut learning in deep neural networks.
Ribeiro et al. 2020: Beyond accuracy: behavioral testing of NLP models with CheckList.
Kojima et al. 2022: Large language models are zero-shot reasoners.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Lexical Bias in Clinical NLP Pipelines

Author(s): Rostislav Markov

Introduction

Controlled setup

Same meaning, different results

Spurious correlations as lexical shortcuts

Lexical anchoring in prompting

Class-bias swings

Token-level swings

Chain-of-Thought helps

Practical robustness checklist

1. Paraphrase augmentation

2. Concept normalization

3. Explainability checks

4. Confidence alarms

5. Chain-of-thought prompting

Conclusion

Disclaimer

Resources

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Lexical Bias in Clinical NLP Pipelines

Author(s): Rostislav Markov

Introduction

Controlled setup

Same meaning, different results

Spurious correlations as lexical shortcuts

Lexical anchoring in prompting

Class-bias swings

Token-level swings

Chain-of-Thought helps

Practical robustness checklist

1. Paraphrase augmentation

2. Concept normalization

3. Explainability checks

4. Confidence alarms

5. Chain-of-thought prompting

Conclusion

Disclaimer

Resources

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement