Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Lexical Bias in Clinical NLP Pipelines
Latest   Machine Learning

Lexical Bias in Clinical NLP Pipelines

Author(s): Rostislav Markov

Originally published on Towards AI.

Lexical Bias in Clinical NLP Pipelines
Image by the author.

A hospitalization-prediction model scored 97% accuracy. Yet, simply rephrasing “inpatient” as “hospital admission” caused its confidence to collapse. Although the underlying clinical scenario didn’t change, the output did. That’s lexical bias: when a model reacts to specific tokens or words instead of understanding the underlying concept.

Introduction

Clinical NLP systems — fine-tuned transformers like DistilBERT or prompt-based pipelines with GPT — process patient narratives for tasks from diagnostics to outcomes. High test-set accuracy can mask a deeper fragility: an over-reliance on token-level shortcuts that fail to generalize across phrasing and context, both during training and at inference.

In this post, I’ll show:

  1. How small phrasing tweaks derail predictions
  2. Why both training and inference pipelines are vulnerable
  3. A concise robustness checklist to catch and fix these shortcuts

Controlled setup

Using Synthea COVID-19 100K dataset, I generated templated narratives from demographics, comorbidities, encounter history, and medications:

Patient 7f697ae3-cba0–4801–92db-18c1169817d0 is approximately 50 years old, f. Diagnosed with COVID-19 on 2020–03–02. Comorbidities include: Childhood asthma, Acute bronchitis (disorder), Miscarriage in first trimester. Encounter types: wellness (55), ambulatory (22), outpatient (2), emergency (1), urgentcare (1). Medications prescribed: 120 ACTUAT Fluticasone propionate 0.044 MG/ACTUAT Metered Dose Inhaler, NDA020503 200 ACTUAT Albuterol 0.09 MG/ACTUAT Metered Dose Inhaler, diphenhydrAMINE Hydrochloride 25 MG Oral Tablet.

Models predicted patient hospitalization within 14 days of a COVID-19 diagnosis. I compared several training and inference pipelines:

  • SBERT + Logistic Regression: ~94% accuracy
  • Fine-tuned DistilBERT: ~97% accuracy
  • GPT-4o (zero-shot: ~70%, few shot: ~65%)
  • o3-mini (zero-shot: ~78%, few-shot: ~84%)

Predictions looked promising until I introduced simple paraphrase tests.

Same meaning, different results

DistilBERT showed high variance in prediction confidence for semantically equivalent phrases:

  • “inpatient (1)”: 99% (Yes)
  • “one inpatient stay”: 96% (Yes)
  • “admitted once to the hospital”: 72% (Yes)
  • “hospital admission: 1 time”: 7% (No)

Only when the exact word “inpatient” (and its tokens ['in', '##patient']) was present did the model confidently predict hospitalization. This is a lexical shortcut: treating token frequency as semantic signal.

Spurious correlations as lexical shortcuts

Embedding-based systems learn directly from token co-occurrence. To explore this, I used LIME (Local Interpretable Model-Agnostic Explanations) with the DistilBERT fine-tuned on my training dataset. LIME perturbs input tokens and fits a local surrogate model to approximate behavior. For example, LIME found:

  • “inpatient” contributed +0.33 to the hospitalization prediction
  • The non-causal “molars” received nearly as much (+0.32)
  • The clinically significant “hypertension” was down-weighted (+0.02)

In my training dataset, unrelated conditions like “molars,” “appendectomy,” or medications like “amlodipine” co-occurred in the narratives labeled as hospitalized. Such co-occurrences are incidental, but models can mistake co-occurrence for causal signal.

Next, I tested whether large language models — despite their vast pretraining — are also prone to lexical bias at inference.

Lexical anchoring in prompting

Class-bias swings

In zero-shot mode, GPT-4o consistently over-predicted hospitalization (positive recall 0.79, precision 0.64). Adjusting the temperature (0 to 0.7) barely changed this positive class bias.

Trying to correct it with few-shot prompts backfired at first: adding examples skewed towards negative cases made model responses overly conservative, missing most hospitalization cases (positive recall < 0.10).

This heavy swing in model responses underscores how brittle few-shot prompting can be in nuanced, domain-specific tasks. An unbalanced few-shot example set can introduce as many new errors as it fixes.

Token-level swings

Having identified class-bias swings in GPT-4o, I turned to o3-mini which delivered more balanced responses (0.67/0.71 precision/recall on positives). Next, I used it to test if model responses are sensitive to minimal-pair paraphrases. For instance, replacing “inpatient” with “admitted once to the hospital” flipped o3-mini’s response (here: in zero-shot):

  1. “Encounter types: … inpatient (1).” ≈ Yes
  2. “Encounter types: … one inpatient stay.” ≈ Yes
  3. “Encounter types: … admitted once to the hospital.” ≈ No

Even in the model that best balanced class predictions I observed signs of lexical bias.

These findings suggest that every prompting strategy, especially in agentic workflows, should be validated with minimal-pair probes and class-calibration checks. Few-shot prompting can mitigate bias, but only if examples are carefully balanced, lexically varied, and cover edge cases. Striking that balance is no easy feat.

To test if surfacing the model’s reasoning steps could mitigate lexical bias, I used Chain-of-Thought (CoT) prompting, asking the model to show its work step by step before giving a final answer.

Chain-of-Thought helps

CoT prompting improved robustness by guiding the model through multi-step reasoning. For the paraphrased narrative “admitted once to the hospital,” CoT corrected the o3-mini’s zero-shot “No” prediction. Here is a typical CoT output on that paraphrase (in zero-shot mode):

“ 1. The patient is 50 years old and male, which are demographic factors associated with a higher risk of severe COVID-19.
2. He has hypertension, a recognized comorbidity that elevates the risk of complications from COVID-19.
3. His encounter history indicates that he was admitted to the hospital once, demonstrating that he has required inpatient care before.
4. With these factors (age, sex, comorbidity, and previous hospitalization), the overall risk assessment suggests a higher likelihood of needing hospital care.

Yes”

By traversing multiple inference paths, CoT seems to dilute the impact of single-token shortcuts, reducing the likelihood of lexical bias. While useful as proxy, CoT does not reveal the actual inner workings of the model’s hidden layers or entirely eliminate token-level shortcuts. For example, when I supplied the label with the prompt, the model simply rationalized it and adjusted the CoT steps to justify the label. In other words, it’s a window into what the model says it’s thinking.

Practical robustness checklist

Lexical bias is real but manageable — both during training and at inference. Here are techniques I recommend:

1. Paraphrase augmentation

Teach the model true semantics by diversifying your training/few-shot examples with varied phrasings of each concept — both positive and negative. In our case:

  • Positive examples: “hospital admission,” “ward stay,” “discharged then readmitted same day”
  • Negative controls: “inpatient for a prior fracture,” “no inpatient encounters”

2. Concept normalization

Preprocess data by replacing phrases like “inpatient” or “hospitalized” with normalized tags like <HOSP_ADMIT> to reduce lexical variance. Caution is warranted to avoid collapsing clinically distinct events. While structured vocabularies like SNOMED CT offer more precise and context-aware concepts, robustness still depends on the data diversity.

3. Explainability checks

Independently of your data efforts, test for actual robustness. Use explainability checks to inspect which words are driving predictions. If irrelevant terms appear as top drivers, retrain with better examples or adjust preprocessing. Use minimal-pair or leave-one-out prompt tests to localize which words drive model answers.

4. Confidence alarms

Monitor model confidence/log-probability changes in production. Prompting pipelines offer a self-reported “confidence” signal as lightweight robustness check. A sudden drop should invoke human review or fallback logic.

5. Chain-of-thought prompting

‘Force’ multi-step reasoning to dilute single-token effects, then re-validate with above probes to ensure CoT is not just post-hoc justification. This can improve stability and model confidence, and makes model behavior easier to audit.

Conclusion

Lexical bias can derail predictions in clinical NLP pipelines. A robustness strategy with paraphrase probes, perturbation tests, and confidence alarms helps surface and mitigate shortcuts in model responses.

Don’t just optimize for model accuracy. Test for understanding.

Disclaimer

This work uses synthetic data and a simplified hospitalization prediction task to isolate lexical bias. The setup is not intended to simulate real-world clinical practice, nor is it a substitute for full clinical validation. Rather, it is used to highlight a class of generalization failures in NLP pipelines — in particular when phrasing, rather than meaning, drives model decisions.

While the controlled Synthea setup isolates lexical bias, real clinical notes contain idiosyncratic abbreviations, typos, and variable structure. Future work should validate these findings on de-identified electronic health record data (e.g., MIMIC-III) to confirm generalization.

Prompting pipelines deliver inference-time decisions driven by how strongly certain prompt tokens pull on the model’s learned word associations. In such pipelines, hospitalization is not a supervised label learned from our dataset and instead the model generates a label proxy based on next-token likelihood.

Resources

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.