Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Analysis of a Synthetic Breast Cancer Dataset in R
Latest

Analysis of a Synthetic Breast Cancer Dataset in R

Last Updated on January 7, 2023 by Editorial Team

Last Updated on January 21, 2022 by Editorial Team

Author(s): Dr. Marc Jacobs

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Data Analysis

Time-to-event data fullyΒ explored

This post is about me analyzing a synthetic dataset containing 60k records of patients with breast cancer. This is REAL data (in terms of structure), but without any patient privacy revealed. The site where you can request the data can be found here and is in Dutch. The great thing about this is that these datasets not only allow us to analyze patient data, without knowing who the patient is but also allow us to test new algorithms or methods on data with a clinical structure. Hence, if you develop something on synthetic data, it will for sure have to merit on (new) patientΒ data.

So, let's get started. The dataset is big and there are many models that can be made. My goal was to explore the data and to build a clinical decision tool based on survivalΒ models.

I will show standard graphical exploration, missing data analysis, survival models, and refineries such as penalization.

I hope that this post will inspire more people to dive deeper into this dataset. What I made here is by no means an end-product, nor do I wish it to be. There is just too muchΒ there.

Let’s getΒ started!

First of all, I will import the data, explore it from a basic perspective, and then create new variables by refactoring them, combining them, or renamingΒ them.

A lot of data included with many having a completion rate of 100%. Image: byΒ author.

A visual confirmation of a lot of non-missing data. Image: byΒ author.

As you can see, only five variables are really missing most of the time. There are various missing patterns, but nothing exciting. With 60k records, you can play it quite safe. Image: byΒ author.
Further exploration of missing based on certain categorical variables. Nothing that really sticks out. Image: byΒ author.

The missing data patterns do not really worry me, so it's time to look deeper into the data. There is a lot, and R does not like me using traditional plots such as points and lines with so many observations. Hence, this is why heatmaps are best, but I cannot resist the urge to dive deeper into the data and still use good old scatterΒ plots.

Different scatter plots that are mainly looking at the relationship between Survival Time, events, and everything in between such as age, sex, tumor size, stage, and lymph nodes investigated. They look nice, but do really help that much. Image: byΒ author.

Let's try something more, and a lot else. As you can see, I did not go for the traditional scatterplot matrices, since they have no merit hereβ€Šβ€”β€Š75% of the variables included are categorical variables.

Data per inclusion year. Image: byΒ author.
Survival time by inclusion year. Image: byΒ author.
Survival time by a lot of variables, including the NA option in any of them. It is not so easy to find patterns, although they exist. Image: byΒ author.
Survival time by inclusion year. To be honest, the inclusion year does not help. Although a nice marker for an increase in treatment efficacy, it does not offer much more on top of the survival time itself. Image: byΒ author.

The following graph is trying to show a relationship between survival time, patient age, inclusion year, and a variable called gen_stat which is a combined new variable including the three hormonal markers: her2, er, and pr. If any of these markers were found, regardless of the score, they were labeled a one. Hence, a score of three on gen_stat means that a patient scored positive on all threeΒ markers.

More interesting than those markers is the age of the patient. We will show later that age has for sure a cut-offΒ point.

And some spline models on multiple different relationships. Some look really nice, others look funky, which can be quite natural behavior from a spline. Image: byΒ author.

Let's move to the survival analysis itself. This is not the first time I post about conducting survival analysis, and for a deeper introduction please read here andΒ here.

Looks very nice, a lot of data, a lot of events, but a curve that shows very very little. Image: byΒ author.

Clear to see is that the survival curves look good for the earliest included patients, but not so much for the later patients. Breast cancer is disease with a very decent 5-year survival rate. In fact, many are still alive after 9 years. Image: byΒ author.

Here we clearly see a statistical difference, but it is unclear if this is due to real differences in the hazard function or differences in the amount of censoring. Image: byΒ author.

Splitting survival curves by inclusion year is really not a good ideaβ€Šβ€”β€Šinclusion year is just a proxy for the length of the survival curve. Image: byΒ author.

Image: byΒ author.

And then I started splitting by almost every imaginable variable included.

These plots clearly show that I cannot predict survival by a single categorical variable. Image: byΒ author.

We have a lot of data to spend, but not a lot of variables and only a few of them are continuous. Those that are continuous, let's have a look and see if we can find cut-points that would yield completely different survival curves. Or at least, bring curves that have something interesting to say. Do not that this entire process is statistical drivenβ€Šβ€”β€Šnormally, biology should drive hypotheses and a first effort should be made to verify them via the data. Alas, let's go for the statistical approach and start withΒ age.

The cut-point is set at age 72. So, we have higher and lower than 72. Image: byΒ author.
That is a seriously interesting cut-point. Funny enough, if you look at some of the earliest graphs you could already see that age around 70 yielded a steep slope in survival time. Image: byΒ author.

Indeed, age 72 is very interesting. Image: byΒ author.
And also here, age 72 is very interesting, but only in combination with certain co-predictors. Image: byΒ author.

Image: byΒ author.

Now, let's create some survival curves, where we only look above ageΒ 72.

Now, the survival curves are starting to become much more interesting since they are the end of the bigger curve. This is where serious events are starting to take place. A difference in tumor placement can be found, but not hormonal indicators. Image: byΒ author.
And, also differences between tumor size. Image: byΒ author.

And now we see quite some differences. Image: byΒ author.

Good, we have something for age. Let's see if we can also find something for the number of positive lymph nodes. This one is not so easy, since there are mostly quite some lymph nodes examined, but the majority is not positive.

As you can see, a difference is found, can be found, but is found very very early. I wonder how this will hold up later in the model. Image: byΒ author.

And, last but not least, tumorΒ size.

No significant result could be found, looking at tumor size alone, but biologically I would guess it does make sense. So lets see. Image: byΒ author.

Then I could not resist looking at inclusion year. It makes no sense actually since it is a categorical factor, but let's take a lookΒ anyhow.

The cut-point is place at 2015. Which makes senseβ€Šβ€”β€Šthey just did not build up the necessary survival time yet and censored data is not counted the same way as event data. Image: byΒ author.

Okay, so I did some exploratory univariate survival curves. Let's advance and conduct a Cox Regression combining multiple variables.

As you can see, the total fraction of events is around 25%. This should provide us with enough power to find a signal if it exists. Image: byΒ author.

Below is the code for the cos-regression containing splines, looking for the predictive power (concordance), and showing the influence of each of the variables included.

The c-statistic is set at 0.77 which equals the ROC curve in survival analysis. Image: byΒ author.
And an impressive-looking Hazard Ratio plot. Not at all useful for communication, but it looks nice anyhow. Image: byΒ author.

All of the above is useless if we do not first check the assumptions of the semi-parametric Cox regression. This means that the Hazard Ratios should be proportional and stable over time. I will also check if the splines fit the dataΒ well.

P-values are nice to use, but not really helpful. Significant means here that the proportionality assumption can not be kept. This is not good as it is the most important assumption of a Cox regression. If violated, the Hazard Ratio’s are not stable, and thus a function of time. Image: byΒ author.
That we do not have plots for the Schoenfeld residuals, which test the PH assumption, is worrying. Image: byΒ author.

The deviance residuals are the standardized residuals in a Cox regression. You want them to be homogenous.

This looks good, but also strange. Especially to the left. Image: byΒ author.
Looks good as well. Image: byΒ author.

I wanted to dig a bit deeper into the suitability of the splines. There is a function for that, but it does not accept splines. So, just have a glimpse, let's take a look by using cubic polynomials.

For all three, the martingale residuals are NOT linear. If I look at this, I am starting to believe that categorizing continuous data is not such a bad thing, even if it means that I may loose some predictive power. We already saw that the cut-points offer some explanation, although not for all. Image: byΒ author.

So, in the next model, I deleted the spline for the number of positive lymph nodes. The variables probably need a transformation, but let's see how the model performs now, by looking at the Variance Inflation Factor (VIF). The VIF provides an indicator of the cross-correlations of the variables.

Lets take a look at the variance inflation factor. One variable stick out in particular, which is gedrag which means the tumor location. We need to look closer at that. Image: byΒ author.
And now we have plots for the Schoenfeld residuals. They look good! I do not want to go for p-values, because with 60k observations, even the slightest deviations will be seen as statistically significant. Image: byΒ author.
And the residuals and linearity assumption. I can live with these plots. Image: byΒ author.

A sensible next step would be to assess the predictive power of the model. Well, actually, I want to see if the model can discriminate between groups, so I made three risk profiles (I am not even sure if they make biological sense) and see what the modelsΒ predict.

I am not sure if the risk profiles make sense, but they are for sure discriminating. Image: byΒ author.

I already showed that to assess a model, you can look at the residuals and the concordance statistic. In the rms package, there are also some great tools to validate your model via bootstrapping. At particular time points. Let's see how this turnsΒ out.

The bootstrapped calibration plots for 1000, 2000, and 3000 days. Not that great, also not extremely bad, but we can do better for sure. Image: byΒ author.

In this day and age of machine learning, we can go and try for some additional models. The first is actually an old, the Nelson-Aalen estimation which is the non-parametric brother of the Cox regression which is semi-parametric. The Nelson-Aalen focuses exclusively on the hazard function and will provide us with plots to show the influence of each predictor on the hazard function. Quite important information.

And here, the influence of each predictor on the hazard. What you are looking for is a trend. That the confidence interval increases at the end is normal due to the number of censoring going on. As you can see, some of them seem to have little influence, but others definitely do. Image: byΒ author.

Next up, is true machine learning. The Random Forest. To me, this is just a tool to look at which variables were deemed important, not to use it to predict. The reason is that the Random Forest becomes very easily unexplainable.

On the first try, I used the rangerΒ package.

The survival curve to the left, and the predicted survival curve to the right. Image: byΒ author.
The variable important metrics. As you can see, some are really important, whereas others are for sure not important. Based on the p-values, one could only include age, mari, and tumor size. Image: byΒ author.

Let's try another packageβ€Šβ€”β€ŠrandomforestSRC. In addition to running a Random Forest for time-to-event data it also provides nice ways to plot the data. The standard option will build 500Β trees.

The most important metrics, by far. The number of trees in relationship to the error rate, and the variable importance plot. The random forest of this package agrees with the Random Forest of the previous ranger package. Image: byΒ author.
The error rate plotted by tumor kind. Image: byΒ author.
A single tree depicted from the Random Forest. Despite a tree being completely explainable, it becomes very unyielding if you look at it this way. Image: byΒ author.
Predicted survival probabilities. Image: byΒ author.
Partial Dependence Plot for age and tumor kind. Image: byΒ author.
Graphs showing the minimal depth of the variables included. Minimal depth for a variable in a tree equals to the depth of the node which splits on that variable and is the closest to the root of the tree. If it is low then a lot of observations are divided into groups on the basis of this variable. You can see how the values above are related to the Variable importance plot. Image: byΒ author.
Plots showing the relationship between variables and survival as seen by the Random Forest model. Image: byΒ author.
The interactive minimal depthβ€Šβ€”β€Šthe importance of the variables included in relation to the other variables included. Remember, the lower the better. Image: byΒ author.

The random forest models above already clearly showed that some variables are not that interesting whereas others are. Now, one of the best methods to look at variables and their importance is to include the L1 and L2 penalization methodsβ€Šβ€”β€ŠLASSO and Ridge Regression. I posted about these and other variable selection methodsΒ before.

These methods were applied to a subset of the data since the entire dataset gave problems with memory size that I could not solve. As you can see I only subsampled once, but a better way would be to loop this dataset and do it many many times so you get bootstrapped estimates of their importance. For now, we keep it easy and make a subset ofΒ 10k.

First, we will ask the penalized package to profile the L1 and L2 parameters and then search for their optimum. Once we have that, we apply them both. By applying them both, we deploy what is called an elasticΒ net.

Here you can see L1 and L2 working via their plot paths. The difference between them is that L1 will delete variables by shrinking their coefficient to zero, where L2 wil adjust them. The L1 takes a lot of time to come into play and in the end selects only a handful. Image: byΒ author.
Optimal L1 variable selection via shrinkage. Image: byΒ author.
Optimal L2 coefficients. Image: byΒ author.
The coefficients of the Elastic Net. As you can see, some categories of factor variables have been abolished completely. This is for instance the case for tumor stage (stadium) and hormonal receptors (gen_stat). Image: byΒ author.
The prediction models based on the elastic net. Image: byΒ author.
The predicted survival curves for the 10k included patients. The shape seems to be about right but then again, the plot really does not help when you have so many IDs included. Image: byΒ author.

Besides the penalized package there is also the GLMNET package. Lets give this one a try asΒ well.

L1 via a normal procedure, and looking for the optimal lamba via cross-validation. Image: byΒ author.
Applying the cross-validated lambda value and see how inclusion year (incjr), differential grade (diffgrad), and surgery (org_chir) are deleted from the model. Image: byΒ author.

The following part is a bit experimental for this particular dataset since I do not really need it. Below you will see me fit parametric survival models, which try to bring survival curves to the end by modeling beyond the data. Now, modeling beyond the data via a (statistical) model is always tricky, but this kind of model ARE accepted and often used in Health Technology Assessment procedures.

The survreg procedure automatically fits the Weibull model if nothing additional is specified. Image: byΒ author.
The fit of the model, based on the time-to-event data (vit_stat_int), tumor kind, tumor stage (stadium), and age (leeft). Not really a good fit. Image: byΒ author.
Conditional model. Image: byΒ author.
A plot to check which distribution fits best. As you can see, the majority could not be fitted at all. Image: byΒ author.

So, the parametric survival model did not really bring me something and perhaps the data are not so good for this kind of modeling anyway. Survival in Breast Cancer is quite high and a parametric survival model would by far extend beyond the data at hand. Most likely too far. So, let's stick to what we have, and seal the deal aΒ bit.

Here, you see me building a nomogramβ€Šβ€”β€Šstable and dynamic using the RMS package and the DynNomΒ package.

Image: byΒ author.

Good, so we now moved through the most important aspects of modeling this dataset, which are visual exploration, univariate survival models, multivariable cox regression models, model assessment and validation, penalization and variable importance, and parametric survival. Now I have shown what a potential sensible pathway could be towards a dynamic nomogram, which can be used in a clinical setting, lets rehash what we have learned and build a second tier ofΒ models.

It’s decision time. Let's make the model we want to have and focus a bit more. Based on the above, I will make the following choices:

  1. Females only β†’ although breast cancer is a possibility in men, we have more data onΒ females.
  2. Tumor kind 501300 only β†’ the majority of the data is situatedΒ here.
  3. Will not include the inclusion yearβ€Šβ€”β€Ša proxy forΒ survival
  4. From the penalized models, only the following variables were deemed important: age, differential grade, tumor position, hormonal indicators, Mari-procedure, and tumorΒ stage.
  5. From the Random Forest Models, only the following variables were deemed important: age, Mari-procedure, tumor size, tumor stage, sentinel node procedure performed, surgery performed, number of positive lymph nodes found, differential grade, and hormonal receptors.

Let's try and re-verify the above by plotting some basic proportional tables. What I want to see is that the variables included show the same proportion of events and non-events across their levels. If this is the case then that variable has limitedΒ appeal.

First, check the proportions, within each row of the matrix for tumor kind, sex, lateralization, and inclusion year.

Looking good. Safe to focus only on females, and the tumor kind with the biggest data and the lowest survival rate. Image: byΒ author.
Lateralization is indeed not a differentiating variable. Image: byΒ author.
The inclusion year and event proportion. The number of events is highest for the earliest inclusion year which mirrors the survival time by itself. Including it as a proxy will hurt the analysis, so we indeed kick it out. Image: byΒ author.

Then, let's look at the variables deemed important.

They do not show a lot of differentiation as well to be honest? Well, at least the majority do not while some have a major impact like tumor placement (gedrag). The problem with a 2*2 matrix is that it can really only show marginal effects, nothing very conditional. Image: byΒ author.

To ease the model a bit, I will refactor some of the variables included to downplay the number of factors. I will also cut two continuous variables. Normally, this is really not a smart thing to do as you will lose predictive power, but for the sake of the prediction model, I might makeΒ sense.

Image: byΒ author.
Image: byΒ author.

Now, let's rerun the models and see where we will land. I will not be perfect, for sure, since this model is made without an oncologist or anybody else with a lot of content knowledge on how to treat breastΒ cancer.

We should have enough events. Image: byΒ author.
The model fit, containing one interaction between age and differential gradation. Image: byΒ author.
The overal survival curve. Almost to zero. Image: byΒ author.
Variance Inflation Factor. For sure some high levels. This model probably still has way to many variables included. Image: byΒ author.
Not too bad, but I would like to reach above the magic threshold of 0.8. Lets later see how the calibration lines move. Image: byΒ author.
Not too badβ€Šβ€”β€Šnot perfect either. Image: byΒ author.
Hmm, the residuals look off at the end. Image: byΒ author.
The calibration plot. There is this strange dent which I am pretty sure is caused by age, since age is a very big predictor for survival time, due to the fact that the older the less survival time (both in ability to deal with the cancer, but also just plainly the number of years left to live). Image: byΒ author.
The nomogram, which attributes quite some weight on age, stage, tumor location, and if the Mari procedure was done. The rest not so interesting. Image: byΒ author.
And the dynamic nomogram. Image: byΒ author.
Here you see that every time I do a prediction, the curve is added, as well as the confidence interval and the table showing the risk profile. Image: byΒ author.

All right, folks, this is it for now. I am surely not done, but I need some time to contemplate. I will also dedicate a post on showing you how to host the ShinyΒ app.

If something is amiss with this post, or there is a mistake or you have a question, you know where to findΒ me!

Enjoy!


Analysis of a Synthetic Breast Cancer Dataset in R was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓