Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Training AI to Predict Clinical Trial Outcomes: A 30% Improvement in 3 Hours
Latest   Machine Learning

Training AI to Predict Clinical Trial Outcomes: A 30% Improvement in 3 Hours

Last Updated on February 23, 2026 by Editorial Team

Author(s): 3rdSon

Originally published on Towards AI.

Training AI to Predict Clinical Trial Outcomes: A 30% Improvement in 3 Hours

Predicting whether a clinical trial will succeed or fail is notoriously difficult. Even experienced pharmaceutical analysts struggle with accuracy rates much better than a coin flip. But what if we could train an AI model to learn from thousands of past trials and improve its predictions?

I recently built a dataset of 1,366 clinical trial predictions and fine-tuned an 8B parameter language model to predict trial outcomes. This resulted to in a jump from 56% accuracy (barely better than guessing) to 73% accuracy, a 30% relative improvement. Here’s how I did it, what I learned, and why this matters for anyone working with prediction tasks.

The Challenge: Making Predictions from Historical Data

The pharmaceutical industry runs on uncertainty. When Eli Lilly announces a Phase 3 trial for a new obesity drug, analysts, investors, and competitors all ask the same question: Will it succeed?

Traditionally, answering this question requires:

  1. Deep domain expertise in pharmacology
  2. Knowledge of the company’s track record
  3. Understanding of regulatory pathways
  4. Access to clinical trial databases
  5. Lots of time to research each case

Even then, human experts achieve modest accuracy. The question I wanted to answer is, could an AI model learn these patterns automatically from historical data?

Comparison chart — Baseline 56.3% vs Fine-tuned 73.3%

The Data Problem: Labels Are Expensive

The biggest hurdle in building prediction models is getting labeled training data. Hiring medical experts to label thousands of clinical trial outcomes would cost tens of thousands of dollars and take months.

This is where I discovered Lightning Rod’s approach to data generation. Instead of manual labelling, their SDK uses what they call the “Future-as-Label” methodology. This approach makes the future outcome of a historical event its label.

Here’s how it works:

  1. Find old news: Articles from 2023 about clinical trials starting
  2. Generate questions: “Will Novo Nordisk’s Phase 3 trial meet endpoints by Q4 2024?”
  3. Auto-label outcomes: Search recent news (late 2024/2025) to find what actually happened
  4. Build dataset: Pair questions with verified outcomes

It does this without the need for human labellers. The Lightning Rod Python SDK automatically finds the answers by searching for what happened later.

Building the Dataset: 1,366 samples in 2 Minutes

Using Lightning Rod’s Python SDK, I generated the dataset with a simple pipeline:

from lightningrod import QuestionPipeline, NewsSeedGenerator, WebSearchLabeler
pipeline = QuestionPipeline(
seed_generator=NewsSeedGenerator(
start_date=datetime(2023, 1, 1),
end_date=datetime(2024, 12, 31),
search_query=["clinical trial Phase 3", "FDA approval"]
),
question_generator=ForwardLookingQuestionGenerator(
instructions="Generate binary questions about trial outcomes",
examples=[
"Will Eli Lilly's obesity drug trial meet endpoints by Q4 2024?",
"Will the FDA approve Drug X by June 2024?"
]
),
labeler=WebSearchLabeler(confidence_threshold=0.7)
)
dataset = lr.transforms.run(pipeline, max_questions=2000)

The SDK pulled news articles about clinical trials, generated forward-looking questions, and then searched for later outcomes. In about 10 minutes of compute time, I had 1,882 questions, with 72.6% successfully labeled, giving me 1,366 high-quality training examples.

Screenshot of sample dataset rows showing question, answer, confidence, etc, generated using Lightning Rod Python SDK

Each example looked like this:

Question: "Will Novo Nordisk's CagriSema Phase 3 trial meet its 
primary endpoints by December 31, 2024?"

Answer: YES (1)
Confidence: 0.98

The labels weren’t guesses; they were verified facts from published trial results and FDA announcements.

The Experiment: Baseline vs Fine-Tuned Model

I split the data into training (85%) and test (15%) sets, then ran two experiments:

1. Baseline: Zero-Shot Prediction

First, I tested Llama-3–8B without any training. I gave each question and asked it to predict 0 (failure) or 1 (success).

Result: 56.3% Accuracy

The model was essentially guessing, with a slight optimistic bias (it predicted “success” too often). I wasn’t surprised because the base model has no special knowledge of pharmaceutical industry patterns.

2. Fine-Tuning: Teaching the Patterns

Next, I fine-tuned the model using LoRA (Low-Rank Adaptation) on the training data. LoRA is a parameter-efficient method that adds small adapter layers instead of retraining the entire model.

The setup I used:

  • Model: Llama-3–8B with 4-bit quantization
  • Method: LoRA fine-tuning via the Unsloth library
  • Hardware: Free Google Colab T4 GPU
  • Training time: ~21 minutes (3 epochs)
  • Trainable parameters: Only 16M (0.2% of the model)

Result: 73.3% Accuracy

The fine-tuned model correctly answered 151 out of 206 test questions, achieving 73.3% accuracy. This represents a 17-percentage-point improvement over the baseline, a 30% relative performance gain achieved in just 21 minutes of training. Notably, this was done using only 0.2% of the model’s parameters over 3 training epochs, demonstrating highly efficient improvement with minimal compute.

Confusion Matrix Comparison of the Baseline Model and Fine-Tuned Model

What Did the Model Learn?

The most interesting part wasn’t just the numbers; it was understanding what patterns the model discovered in the data.

Pattern 1: Company Track Records Matter

The model learned that pharmaceutical companies have different success rates. Questions mentioning Eli Lilly, Novo Nordisk, or Merck were more likely to be “YES” (success), while smaller biotech startups showed higher failure rates.

This makes sense because established companies have more resources, experience, and proven track records. The model picked this up automatically from the data.

Pattern 2: Therapeutic Areas Have Different Success Rates

Obesity and diabetes drugs showed ~68% success rates in the training data, while oncology trials succeeded only ~48% of the time. The model learned these differences without being explicitly told.

Cancer is harder to treat. Metabolic diseases have clearer biomarkers. The model internalized these domain patterns solely from examples.

Pattern 3: Timeline Realism

One surprising discovery is that the model learned to spot unrealistic timelines.

Example questions the model corrected:

  • “Will [Small Biotech] complete Phase 3 in 6 months?” → Predicted NO (correctly)
  • “Will [Unproven Drug] get FDA approval in 3 months?” → Predicted NO (correctly)

The baseline model didn’t know that Phase 3 trials typically take 18–24 months. The fine-tuned version learned this pattern from the data.

Pattern 4: Better at Predicting Failures

The baseline model showed an optimistic bias, predicting “success” 63% of the time. The fine-tuned model was better calibrated at 52%, closer to the actual distribution.

More importantly, it learned to identify red flags like:

  • Aggressive timelines
  • Unproven mechanisms
  • Companies with poor track records
  • Challenging therapeutic areas

All five examples of “most improved” predictions involved the baseline incorrectly predicting success, while the fine-tuned model correctly predicted failure.

Why This Matters Beyond Clinical Trials

While this experiment focused on pharmaceutical trial outcomes, the real contribution goes beyond healthcare. It demonstrates a practical, repeatable workflow for building specialized prediction models from real-world data.

Download the Medium app

At its core, the approach is straightforward:

  1. Identify a prediction task with historical data
  2. Use temporal structure (what happened later becomes your label)
  3. Generate datasets automatically instead of manual labeling
  4. Fine-tune efficiently with LoRA on free GPUs
  5. Achieve meaningful improvements in hours, not months

This same approach could work for:

  • Product launch predictions: “Will Company X release Product Y by Date Z?”
  • Policy outcomes: “Will Bill ABC pass by Q2 2026?”
  • Market events: “Will Stock X reach price Y by month Z?”
  • Sports forecasting: “Will Team X make the playoffs this season?”

Any domain with historical news or announcements, clear and verifiable outcomes, and sufficient examples can benefit from Lightning Rod’s Future-as-Label methodology.

It’s not limited to clinical trials; it serves as a scalable template for temporal prediction tasks across industries.

Technical Implementation Details

For developers interested in reproducing this:

Dataset Generation:

  • Source: News articles via Lightning Rod SDK
  • Questions: 1,882 generated, 1,366 valid (72.6%)
  • Label confidence: Average 0.998, minimum 0.85
  • Time: ~3 minutes

Model Training:

  • Base model: Llama-3–8B
  • Method: LoRA (rank=16, alpha=16)
  • Quantization: 4-bit for memory efficiency
  • Epochs: 3
  • Batch size: 2 (effective 8 with gradient accumulation)
  • Hardware: Google Colab free tier (T4 GPU)
  • Time: ~21 minutes

Evaluation:

  • Test set: 206 questions (15% holdout)
  • Metric: Binary accuracy
  • Baseline: 56.3% (116/206)
  • Fine-tuned: 73.3% (151/206)
  • Improvement: +17.0 percentage points

The full code and dataset are available on GitHub, and the dataset is published on Hugging Face for anyone who wants to reproduce or build on this work.

Limitations and Future Work

This isn’t a perfect crystal ball. The model still struggles with:

  1. Novel drug mechanisms not seen in training
  2. Rare diseases with limited examples
  3. External factors (regulatory changes, manufacturing issues)
  4. Very recent trials without outcome data yet

The 73% accuracy is a meaningful improvement over guessing, but it’s not prophecy. Think of it as moving from “coin flip” to “informed probability estimate.” For context, even experienced pharmaceutical analysts struggle to achieve accuracy above 65–70% in predicting trial outcomes.

Future improvements could include:

  • Larger datasets (5,000+ examples)
  • Additional features (company financials, prior trial data)
  • Ensemble methods combining multiple models
  • Continuous updating as new outcomes emerge

Key Takeaways

Three lessons from this project:

1. Automated labeling scales: Manually labeling 1,366 examples would have taken weeks and cost thousands. Lightning Rod’s Future-as-Label approach did it in 3 minutes.

2. Small models can specialize: You don’t need GPT-4 or Claude for domain-specific tasks. An 8B model, fine-tuned on focused data, achieved 73% accuracy on a challenging prediction problem.

3. Historical data contains learnable patterns: Company track records, therapeutic area success rates, and timeline realism all emerged naturally from the training data. The model discovered what experts know from experience.

Try It Yourself

The tools I used are all publicly available:

  • Lightning Rod SDK: Open-source Python library for dataset generation
  • Pre-trained Model: Skip training, use the fine-tuned model directly via Hugging Face
  • Unsloth: Free library for efficient LoRA fine-tuning
  • Google Colab: Free GPU access for training
  • Hugging Face: Free dataset and model hosting

The barrier to entry for specialized AI models has never been lower. If you have a prediction task with historical data, you can build a custom model in a weekend.

The future of AI isn’t just giant general-purpose models; it’s also specialized models trained on focused, high-quality datasets for specific domains. This experiment is one example of what becomes possible when you combine automated data generation with efficient fine-tuning.

What prediction task would you build a model for?

The dataset is available on Hugging Face, and all code is on GitHub. Special thanks to Lightning Rod Labs for their SDK that made this project possible.

Resources:

Interested in building prediction datasets? Check out Lightning Rod or explore their examples on Hugging Face.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.