Survival Analysis with Python Tutorial — How, What, When, and Why
Last Updated on October 21, 2021 by Editorial Team
Author(s): Pratik Shukla
This article covers an extensive review with step-by-step explanations and code for how to perform statistical survival analysis used to investigate the time some event takes to occur, such as patient survival during the COVID-19 pandemic, the time to failure of engineering products, or even the time to closing a sale after an initial customer contact.
This tutorial’s code is available on Github and its full implementation on Google Colab.
📚 Check out our Monte Carlo Simulation Tutorial with Python 📚
Table of Contents:
- Survival Analysis Basics
- Kaplan-Meier fitter Theory with an Example.
- Nelson-Aalen fitter Theory with an Example.
- Kaplan-Meier fitter Based on Different Groups.
- Log-Rank-Test with an Example.
- Cox-Regression with an Example.
- Resources.
1. Survival Analysis Basics:
Survival analysis is nothing but a set of statistical approaches used to determine the time it takes for an event of interest to occur. We use survival analysis to study the time until some event of interest occurs. Time is usually measured in years, months, weeks, days, and other time measuring units. The event of interest could be anything of interest. It could be an actual death, a birth, a retirement, along with others.
For instance, how can Survival Analysis be useful to analyze the ongoing COVID-19 pandemic data?
- We can find the number of days until patients showed COVID-19 symptoms.
- We can find for which age group it is deadlier.
- We can find which treatment has the highest survival probability.
- We can find whether a person’s sex has a significant effect on their survival time?
- We can find the median number of days of survival for patients.
- We can find which factor has more impact on patients’ survival.
In this tutorial, we are going to perform a thorough analysis of patients with lung cancer. Do not worry if it seems complicated. Once we go through the logic behind it, we will have the ability to perform survival analysis on any data set. Exciting! Isn’t it?
Survival analysis is used in a variety of field such as:
- Cancer studies for patients survival time analyses.
- Sociology for “event-history analysis.”
- In Engineering for “failure-time analysis.”
- Time until product failure.
- Time until a warranty claim.
- Time until a process reaches a critical level.
- Time from initial sales contact to a sale.
- Time from employee hire to either termination or quit.
- Time from a salesperson hires to their first sale.
In cancer studies, typical research questions are:
- What is the impact of specific clinical characteristics on patient’s survival? For example, is there any difference between people who have higher blood sugar and those who do not?
- What is the probability that an individual survives a specific time (years, months, days)? For example, given a set of cancer patients, we will tell that if 300 days after a cancer diagnosis has been passed, then the probability of that person being alive at that time will be 0.7.
- Are there differences in survival between groups of patients? For example, Let’s say there are two groups of people diagnosed with cancer. Those two groups were given two different kinds of treatments. Our goal here will be to find out if there is a significant difference between the survival time for those two different groups based on the treatment they were given.
Objectives
In this tutorial, we will see the following methods of survival analysis in detail:
1) Kaplan-Meier plots to visualize survival curves.
2) Nelson-Aalen plots to visualize the cumulative hazard.
3) Log-Rank test to compare the survival curves of two or more groups
4) Cox-proportional hazards regression finds out the effect of different variables like age, sex, and weight on survival.
Fundamental concepts
We will start this tutorial by understanding some basic definitions and concepts related to survival analysis.
Survival time and type of events in cancer studies
Survival Time: It is usually referred to as an amount of time until when a subject is alive or actively participates in a survey.
There are three main types of events in survival analysis:
1) Relapse: Relapse is defined as a deterioration in the subject’s state of health after a temporary improvement.
2) Progression: Progression is defined as the process of developing or moving gradually towards a more advanced state. It basically means that the health of the subject under observation is improving.
3) Death: Death is defined as the destruction or permanent end of something. In our case, death will be our event of interest.
Censoring
As we discussed above, survival analysis focuses on the occurrence of an event of interest. The event of interest can be anything like birth, death, or retirement. However, there is still a possibility that the event we are interested in does not occur. Such observations are known as censored observations.
There are three types of censoring:
- Right Censoring: The subject under observation is still alive. In this case, we can not have our timing when our event of interest(death) occurs.
- Left Censoring: In this type of censoring, the event cannot be observed for some reason. It may also include the event that occurred before the experiment started, such as the number of days from birth when the kid started walking.
3) Interval Censoring: In this type of data censoring, we only have data for a specific interval, so it is possible that the event of interest does not occur during that time.
Censoring may occur in the following instances:
- A patient has not (yet) experienced the event of interest (death or relapse in our case) within a period.
- A patient is not followed anymore.
- If a patient moves to another city, then follow-up might not be possible for the hospital staff.
- We only have the data for a specific interval.
Survival and hazard functions
We generally use two related probabilities to analyze survival data for a subject.
1) Survival Function(S)
2) Hazard Function (H)
To find the survival probability of a subject, we will use the survival function S(t), the Kaplan-Meier Estimator. The survival function is defined as the probability that an individual (subject) survives from the time origin (diagnosis of a disease) to a specified future time t. Please note that the time can be in various forms like minutes, days, weeks, months, or years. For example, S(200)=0.7 means that after 200 days, a subject’s survival probability is 0.7. In many deadly diseases, the survival probability decreases as the period increases. If the subject is alive at the end of an experiment, then that data will be censored.
The hazard probability, denoted by H(t), is the probability that an individual (subject) who is under observation at a time t has an event (death) at that time. For example, If h(200) = 0.7 means that after 200 days or on the 200th day, the probability of being dead is 0.7. One thing to keep in mind here is that the hazard function gives us the cumulative probability. We will discuss this in detail later in this tutorial.
Notice that, in contrast to the survival function, which focuses on the survival of a subject, the hazard function gives us the probability of a subject being dead on a given time. We can note that higher survival probability and lower hazard probability is good for the subject’s health.
Let’s move forward to the cool coding part!
Data Description:
2. Kaplan-Meier Estimator Theory and Example
The Kaplan–Meier estimator is a non-parametric statistic used to estimate the survival function (probability of a person surviving) from the lifetime data. In medical research, it is often used to measure the fraction of patients living for a specific time after treatment or diagnosis. For example: Calculating the amount of time(year, month, day) a particular patient lived after he/she was diagnosed with cancer or his treatment starts. The estimator is named after Edward L. Kaplan and Paul Meier, who submitted similar manuscripts to the American Statistical Association Journal.
The probability of survival at time ti, which is denoted by S(ti), is calculated as follow:
We can also write the equation above in a simple form as follows:
For Example:
1) Survival probability at time t=1:
2) Survival probability at time t=2:
3) Survival probability at time t=3:
In a more generalized way, the probability of survival for a particular time is given by.
From the above equations, we can confidently say that.
Kaplan-Meier Estimator (Without any groups)
1) Import required libraries:
2) Read the data set:
3) Print the columns in our data set:
4) Get additional information about the dataset:
It gives us information about the data type of the columns along with their null-value counter. We need to remove the rows with a null value for some of the survival analysis methods.
5) Get statistical information about the dataset:
It gives us some statistical information like the total number of rows, mean, standard deviation, minimum value, 25th percentile, 50th percentile, 75th percentile, and maximum value for each column.
6) Find out sex distribution using histogram:
This gives us a general idea about how our data is distributed. In the following graph, we can see that around 139 values have a status of 1, and approximately 90 values have a status of 2, which means that there are 139 males and around 90 females in our dataset.
7) Create an object for Kaplan-Meier-Fitter:
8) Organize the data:
Now we need to organize our data. We will add a new column in our dataset that is called “dead.” It stores the data about whether a person that is a part of our experiment is dead or alive(based on the status value). If our status value is 1, then that person is alive, and if our status value is 2, then the person is dead. It is a crucial step for what we need to do in the next step as we are going to store our data in columns called censored and observed. Where observed data stores the value of dead persons in a specific timeline, and censored data stores the value of alive persons or persons that we are not going to investigate.
9) Fitting our data into an object:
Here our goal is to find the number of days a patient survived before they died. Our event of interest will be “death,” which is stored in the “dead” column. The first argument it takes is the timeline for our experiment.
10) Generate event table:
One of the most crucial methods of the kmf object is the “event_table.” It gives us various information for our survival analysis. Let’s have a look at it column-by-column.
a) event_at: It stores the value of the timeline for our dataset. i.e., when was the patient observed in our experiment or when was the experiment conducted. It can be several minutes, days, months, years, and others. In our case, it is going to be for many days. It stores the value of survival days for the subjects.
b) at_risk: It stores the number of current patients under observation. In the beginning, it will be the total number of patients we are going to observe in our experiment. If new patients are added at a particular time, then we have to increase their value accordingly. Therefore:
c) entrance: It stores the value of new patients in a given timeline. It is possible that while experimenting, other patients are also diagnosed with the disease. To account for that, we have the entrance column.
d) censored: Our ultimate goal is to find the survival probability for a patient. At the end of the experiment, if the person is still alive, we will add him/her to the censored category. We have already discussed the types of censoring.
e) observed: It stores the value of the number of subjects that died during the experiment. From a broad perspective, these are the people who met our event of interest.
f) removed: It stores the values of patients that are no longer part of our experiment. If a person dies or is censored, then he/she falls into this category. In short, it is an addition of the data in the observed and censored category.
11) Calculating the probability of survival for individual timelines:
Let’s first see the formula for calculating the survival of a particular person at a given time.
a) Survival probability at t=0 only:
b) Survival probability at t=5 only:
c) Survival probability at t=11 only:
Now what we found here is the probability for a specific time. What we want is the probability for the entire time for a patient. i.e., the probability of patient surviving all the rounds of the experiment.
In a nutshell, we want to find the probability of a person surviving all of the time he lived after diagnosis. What we just found is the probability of a particular experiment only.
Let us take a straightforward example to understand the concept of conditional probability. For instance, we have a total of 15 balls in a non-transparent box. Out of the 15 balls, we are seven black balls, five red balls, and three green balls. Here is a pictorial view for that.
a) Probability of choosing a red ball:
b) Probability of choosing the second red ball:
Since we’ve removed a ball that was red, the total number of red balls we have is 4, and the total number of balls we have is 14.
If our question is to find the probability of both the balls being red, we will multiply it, and that is precisely what we are going to do in survival analysis. We know that a patient has survived the 1st time interval, and we want to find the probability of him surviving the second time interval given that he has survived the 1st time interval. My point here is we do not want to find the probability of the second time interval only. We want the total probability of him surviving the entire period.
In our example, the probability of both balls being red is as following:
In survival analysis, we can write the formula as follows:
12) Finding survival probability:
We want to find the probability that a patient has survived through all the timeline till now. Now we need to find the actual survival probability for a patient.
a) Survival probability for t=0:
b) Survival probability for t=5:
c) Survival probability for t=11:
13) Predicting the probability:
Now the kmf object’s predict function does all of this work for us. However, it is always good practice to know the logic behind it.
14) Finding the survival probability for an array of the timeline:
15) Get survival probability for the whole timeline:
The survival probability for a patient at timeline 0 is 1. Grasping our thoughts, then we gather that the probability that a person dies on the 1st day of diagnosis is near equals to 0. So we can say that the survival probability is as high as possible. As the timeline increases, the probability of survival decreases for a patient.
16) Plot the graph:
17) The median number of survival days:
It provides the number of days where, on average, 50% of the patients survived.
From the code above, we can say that on average, a person lived 310 days after the day of diagnosis.
18) Survival probability with confidence interval:
19) Graph for survival probability with confidence interval:
Now all the information we have is for the survival of a person. Now we will see what is the probability for a person to die at a specific timeline. Here notice that a higher survival probability is suitable for a person, but higher cumulative density (probability of a person to die) is not so good!
20) Probability of a person dying:
Here the denominator value is subjected at risk in the previous row.
The formula for cumulative density:
a) Probability of a person dying at t=0:
b) Probability of a person dying at t=5:
c) Probability of a person dying at t=11:
Find the cumulative density:
d) Cumulative density at t=0:
e) Cumulative density at t=5:
f) Cumulative density at t=11:
21) Plot the graph for cumulative density:
Notice that, as the number of survival days increases the probability of a person dying increases.
22) The cumulative density with confidence interval:
23) Graph for cumulative density with a confidence interval:
24) Get cumulative density for a particular day:
25) The median time to an event:
We can get the amount of time remaining from the median survival time.
26) Graph for the median time to the event:
3. Estimating hazard rates using Nelson-Aalen
Hazard function H(t):
Until now, we discussed the Kaplan-Meier survival function. Using that, we can get the probability of the event of interest (death in our case) not occurring by that time. The survival functions are a great way to summarize and visualize the survival dataset; however, it is not the only way. We can visualize the aggregate information on survival using the Nelson-Aalen hazard function h(t). The hazard function h(t) gives us the probability that a subject under observation at time t has an event of interest (death) at that time. To get the information about the hazard function, we cannot transform the Kaplan-Meier estimator. For that, there is a proper nonparametric estimator of the cumulative hazard function:
Cumulative Hazard Function:
1) Import required libraries:
2) Create an object of Nelson-Aalen-Fitter:
3) Fitting the data:
4) Finding the cumulative hazard:
Here we’ll use the event table generated in the previous part to understand how the hazard function actually works.
Here is the formula to find the non-cumulative hazard probability at a specific time:
a) Finding the hazard probability at t=0:
b) Finding the hazard probability at t=5:
c) Finding the hazard probability at t=11:
d) Finding the cumulative hazard probability at t=0:
e) Finding the cumulative hazard probability at t=5:
f) Finding the cumulative hazard probability at t=11:
5) Plot the graph for cumulative hazard:
The cumulative hazard has a less clear understanding than the survival functions, but the hazard functions are based on more advanced survival analysis techniques.
6) Predict a value:
7) Cumulative hazard probability with confidence interval:
8) Graph for cumulative hazard probability with confidence interval:
9) Cumulative hazard vs. cumulative density:
4. Kaplan-Meier Estimator with groups
Until now, we saw how we could find the survival probability and hazard probability for all of our observations. Now it is time to perform some analysis on our data to determine whether there is any difference in survival probability if we divide our data into groups based on specific characteristics. Let’s divide our data into two groups based on sex: Male and Female. Our goal here is to check is there any significant difference in survival rate if we divide our dataset based on sex. Later in this tutorial, we will see on what basis do we divide the data into groups.
1) Import required libraries:
2) Read the dataset:
3) Organize our data:
4) Create two objects of Kaplan-Meier-Fitter():
5) Divide the data into groups:
6) Male data:
7) Female data:
8) Fit data into our objects:
9) Event table for the male group:
10) Event table for the female group:
11) Predicting survival probabilities:
Now we can predict the survival probability for both the groups.
12) Get the complete list of survival probabilities:
a) Survival probability for a male group:
b) Survival probability for the female group:
13) Plot the graph for survival probabilities:
Here we can notice that the probability of females surviving lung cancer is higher than that of males. Therefore, from this data, we can say that medical researchers should focus more on the factors that lead to male patients’ poor survival rates.
14) Get the cumulative density:
a) For the male group:
b) For the female group:
15) Plot the graph for cumulative density:
16) Hazard function:
17) Fit the data into our objects:
18) Cumulative hazard probability:
a) For the male group:
b) For the female group:
19) Plot the graph for cumulative hazard probability:
20) The median time to event for the male group:
21) The median time to event graph for the male group:
22) The median time to event for the female group:
23) The median time to event graph for the female group:
24) Survival probability with a confidence interval for the male group:
25) Survival probability graph with a confidence interval for the male group:
26) Survival probability with a confidence interval for the female group:
27) Survival probability graph with a confidence interval for the female group:
28) Comparison of cumulative density vs. cumulative hazard:
a) For the male group:
b) For the female group:
5. Log-Rank Test:
The log-rank test is a hypothesis test that is used to compare the survival distribution of two samples.
Goal: Our goal is to see if there is any significant difference between the groups being compared.
Null Hypothesis: The null hypothesis states that there is no significant difference between the groups being studied. If there is a significant difference between those groups, then we have to reject our null hypothesis.
How do we say that there is a significant difference?
A p-value between 0 and 1 denotes the statistical significance. The smaller the p-value, the more significant the statistical difference between groups being studied is. Notice that our goal is to find if there is any difference between the groups we are comparing. If yes, we can do more research on why there are lower survival chances for a particular group based on various information like their diet, lifestyle, and others.
Less than (5% = 0.05) P-value means there is a significant difference between the groups we compared. We can partition our groups based on their sex, age, race, treatment method, and others.
It’s a test to find out the value of P.
1) Get the variables for the Log-rank test:
2) Performing the Log-rank test:
3) Print the p-value:
We have compared the survival distributions of two different groups using the famous statistical method, the Log-rank test. Here we can notice that the p-value is 0.00131(<0.005) for our groups, which denotes that we have to reject the null hypothesis and admit that the survival function for both groups is significantly different. The p-values give us strong evidence that “sex” was associated with the number of survival days. In short, we can say that the “sex” of a person makes a significant difference in survival probability.
6. Cox-proportional hazard model
The cox-proportional hazard model is a regression model generally used by medical researchers to determine the relationship between the survival time of a subject and one or more predictor variables. In short, we want to find out how different parameters like age, sex, weight, height affects the survival time of a subject.
In the previous section, we saw Kaplan-Meier, Nelson-Aalen, and Log-Rank-Test. However, in that, we were only able to consider one variable at a time, and one more thing to notice is that we were performing operations only on categorical variables like sex, status, and others. It can not be used for non-categorical data like age, weight, or height. As a solution for that, we use the Cox proportional hazards regression analysis, which works for both quantitative predictors non-categorical variables and for categorical variables.
Why do we need it?
In medical research, we are generally considering more than one factor to diagnose a person’s health or survival time. i.e., we generally make use of their sex, age, blood pressure, and blood sugar to find out if there is any significant difference between those in different groups. For example, if we are grouping our data based on a person’s age, our goal will be to determine which age group has a higher survival chance. Is that children’s group, adult group, or old persons’ group? Now what we need to find is on what basis we make a group? To find that, we use cox regression and find the coefficients of different parameters. Let’s see how that works!
Basics of the Cox-proportional Hazard Method
The ultimate purpose of the cox-proportional hazard method is to notice how different factors in our dataset impact the event of interest.
Hazard function:
The value of exp(bi) is called the Hazard Ratio (HR). We will understand this by taking an example.
Let’s code:
1) Import required libraries:
2) Read the CSV file:
3) Delete rows that contain null values:
Next, we need to delete the rows which have null values. Our model cannot work on rows which has null values. If we do not preprocess our data, then we might get an error.
4) Create an object for the KapanMeierFitter:
5) Organize the data:
6) Fit the data into an object:
7) Generating the event table:
8) Get the required columns:
9) Fit the data and print the summary:
In the picture above, notice the p-value for each column in our dataset. Next, we know that p-value<0.05 is considered statistically significant. Here we can see that “sex” and “ph.ecog” have p-values less than 0.05. So we can say that while grouping our data for analysis, we should focus on dividing the data based on these two factors.
Here notice the p-value for “sex” is 0.01, and the Hazard Ratio(HR) is 0.57, which indicates a strong relationship between the patients’ sex and decreased risk of death. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 0.57, or 43%. That means that females have higher survival chances. Next, the p-value for ph.ecog is <0.005, and the Hazard Ratio(HR) is 2.09, which indicates a strong relationship between the ph.ecog value and the increased risk of death. Holding the other covariates constant, a higher value of ph.ecog is associated with poor survival. Here person with higher ph.ecog value has a 109% higher risk of death. So, in short, we can say that doctors should try to reduce the value of ph.ecog in patients by providing relevant medicines. Next, notice that the Hazard Ratio(HR) for Age is 1.01, suggesting only a 1% increase for a higher age group. So we can say that there is no significant difference between different age groups.
10) Check which factor affects the most from the graph:
In the following graph, we can notice the difference in “sex” and “ph.ecog” data.
11) Check our theory with real observations:
Let’s check our conclusions with real data from our observations.
In the above graph, we can see that person 23 has the highest chance of survival, while person 17 has the least chance of survival. By checking the main table, we can notice a significant change in the ph.ecog value. We can also see that people 21 and 23 have higher chances of survival as they have the least value of ph.ecog.
Okay, so this is it for this tutorial. Thank you for reading. Your feedback is always welcome.
This is a revamped version of the original article published on KDNuggets.
📚 If you enjoyed this piece, check out our tutorial on neural networks from scratch with Python code and math in detail.📚
DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University. These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.
Published via Towards AI
Citation
For attribution in academic contexts, please cite this work as:
Shukla, et al., “Survival Analysis with Python Tutorial - How, What, When, and Why”, Towards AI, 2020
BibTex citation:
@article{pratik_iriondo_2020, title={Survival Analysis with Python Tutorial - How, What, When, and Why}, url={https://towardsai.net/survival-analysis-with-python}, journal={Towards AI}, publisher={Towards AI Co.}, author={Pratik, Shukla}, editor={Iriondo, Roberto}, year={2020}, month={Sep} }
7. Resources
Google Colab Implementation
8. References:
[1] Lifelines Example, https://lifelines.readthedocs.io/en/latest/Examples.html
[2] Kaplan — Meier Estimator, Wikipedia, https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator
[3] Lifelines, Univariate NelsonAalenFilter, https://lifelines.readthedocs.io/en/latest/fitters/univariate/NelsonAalenFitter.html
[4] STHDA, Statistical Tools for High-throughput Data Analysis, http://www.sthda.com/english/wiki/cox-proportional-hazards-model