Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Cancer Research Needs Better Data
Latest   Machine Learning

Cancer Research Needs Better Data

Last Updated on July 25, 2023 by Editorial Team

Author(s): Salvatore Raieli

Originally published on Towards AI.

We have many open questions, and we need data to answer them

image source: here

Multiple sclerosis is a chronic disease that leads to demyelination of the central nervous system. Although millions of people are affected, even today, the cause is unknown. This year, a 20-year longitudinal study analyzed data from 10 million young adults in the US military service. U.S. military personnel are screened for HIV and undergo re-screening after two years. Researchers analyzed 62 million residual serum from the serum samples to find hypotheses about the cause of the onset of multiple sclerosis. The researchers found that those who had contracted Epstein-Barr virus (EBV) infection were 32 times more likely to develop multiple sclerosis.

Such a database could also have been used to identify the causes of cancer. The problem is that the registry of cancer diagnoses is out of date, many individuals are not included, and the data are often incomplete.

Why do we need it, and what are the problems?

unorganized data in science. image by Sear Greyson at unsplash.com

We still have many unanswered questions:

  • Many cancers do not have a clear etiology; we do not know what genetic and environmental factors predispose to the onset of cancer.
  • Second, are there universal factors necessary for the development of cancer? Some researchers talk about the possibility that there is a set of necessary conditions for it to develop.
  • Also, it is unclear why some patients with the same cancer respond to treatment, and others do not.
  • Another question is how much diet affects the response to therapy: there are many conflicting studies (low-fat versus high-fat, obesity role).
  • Although we know that the leading cause of cancer death is metastasis, how it forms and the mechanisms behind metastasis are still unclear. In addition, we do not know why some cancers prefer to metastasize to some tissues more than others (the most accepted hypothesis is called “seed and soil”).
  • Many of the treatments that look promising in research fail in clinical trials (85% of clinical trials despite years of research).
  • Not to mention that many researchers wonder how to treat a disease that continually evolves.
  • the tumor microenvironment facilitates the tumor, might it be possible to reprogram it against the tumor?
  • Many of the therapies today are specific to a particular type of tumor, but researchers wonder if there is a possible drug for all tumors (e.g., whether all tumors are vulnerable to the same type of pathway).
  • Finally, there are researchers wondering whether our knowledge of cancer is sufficient to find a cure.
breast cancer cells. image by National Cancer Institute at unsplash.com

Not to mention that as was recently noted by a recent article in Nature, this is an issue of great social importance:

Right now, for example, it’s hard to determine how people from minority ethnic groups respond to therapies or which risk factors are unique to their cancers. In the United States, Black men are 50% more likely than white men to develop prostate cancer and are twice as likely to die of it. Without large, diverse data sets, we can’t identify unique, targetable genetic or molecular features or lifestyle factors that underlie increased cancer risk in this group and others.

As I discussed in a previous article, data quality heavily influences the results of machine learning models. Even the most sophisticated artificial intelligence model will not produce good results in the presence of a bad dataset (garbage in, garbage out)

A critical analysis of your dataset

Stop finetuning your model: your model is already good, but not your data


The first problem is that cancer is a complex disease, and cancer cells evolve in response to therapy. For this, we need quality data that allow us to eliminate potential confounding factors and have reliable models.

There are cancer registries (even the first ones were established in the early 1900s), and they generally collect information on demographics, diagnosis, tumor histology, treatment, and outcome. There are both general registries and registries focused on a special type of cancer, initiatives at the hospital level and at the national level. Often, however, these registries are not standardized. In addition, more and more information has been collected over the years, but errors and missing entries have also accumulated.

“Without a systematic way to start and keep data clean, bad data will happen.” — Donato Diorio

The second problem is that consent must be obtained. Collecting data and information requires complex bureaucracy and permission from various entities. The collection of samples and data is expensive and laborious organization. Not always both physicians and researchers are willing to take on the burden, nor is the funding always there.

In addition, a patient will see several specialists or may go through several hospitals (perhaps for a second opinion or therapy present at a particular hospital). Fragmented healthcare is another obstacle to data collection. Even if these data are collected, they must then be aggregated and made homogeneous.

How we solve it?

image by Luis Villasmil at unsplash.com

“Data that is loved tends to survive.” — Kurt Bollacker

TGCA is an example of a project that sought to characterize thousands of patients. The result is 2.5 petabytes of data covering more than ten thousand patients and 33 tumor types. This dataset has been used in more than a thousand studies (both bioinformatics but also artificial intelligence). In addition, it has been useful in understanding the benefit of new therapies. There are other projects like the cancer dependency map to bring new insights into new therapies, and others are under study.

The ideal cancer registry would aggregate information from millions of consenting participants; include populations of different ancestry and socio-economic status; collect information going forwards from the time of cancer diagnosis — including imaging, tissue samples and genetic data; and capture participants’ histories by automatically linking to their complete medical records. With these detailed profiles, we could trace cancer diagnoses, health effects and risk of death back to potential risk factors. — source

multimodal learning is now a field that is seeing rapid growth; more and more artificial intelligence models are capable of learning from heterogeneous data (images, medical notes, genomic data, tabular data, etc.). So we need similar datasets to be able to train such models. Some initiatives are being studied in England (UK biobank) and the U.S. (Count Me), but they are still few.

In addition, on the privacy profile, there are new stricter regulations but also new technologies to anonymize patients (blockchain, new artificial intelligence models for anonymization). Countries such as England and Denmark also have been looking at how they can centralize health systems and thus also patient data.

On the other hand, both institutions and universities have realized the importance of having to maintain both databases and the codes that have been developed (either by providing courses to researchers or by hiring specialized figures).

The pandemic itself has brought increased collaboration among various institutions. For example, more than 75 institutions have worked together in the National COVID Cohort Collaborative, with the aim of collecting clinical data from more than 6.5 million people with covid in the US.

Parting thoughts

image by Dominik Scythe at unsplash.com

“We are surrounded by data, but starved for insights.” — Jay Baer

Research needs quality data. Most clinical trials fail after years of research and billions in investment. One of the causes is our incomplete knowledge of the mechanisms of cancer onset and resistance. This is why data collection is essential, but this alone is not enough we need it to be curated, centralized, and available to the community.

Large studies such as TGCA have shown value to the community of scientists. They have been used to test hypotheses and develop increasingly accurate models and new therapies. Medicine and biology are increasingly entangled in data science, is one of the fundamental principles of data science is the quality of the source data.

If you have found it interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

Tutorials on machine learning, artificial intelligence, data science with math explanation and reusable code (in python…


Or feel free to check out some of my other articles on Medium:

Code Reproducibility Crisis in Science And AI

Savi AI and scientific research requires we share more


AI reimagines the world’s 20 most beautiful words

How to translate words that cannot be translated?


Nobel prize Cyberpunk

A computational view of the most important prize and perspective on AI in scientific discovery


How AI Could Help Preserve Art

Art masterpieces are a risk at any time; AI and new technologies can give a hand


Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓