Survival Analysis with Python Tutorial — How, What, When, and Why

Author(s): Pratik Shukla

This article covers an extensive review with step-by-step explanations and code for how to perform statistical survival analysis used to investigate the time some event takes to occur, such as patient survival during the COVID-19 pandemic, the time to failure of engineering products, or even the time to closing a sale after an initial customer contact.

This tutorial’s code is available on Github and its full implementation on Google Colab.

📚 Check out our Monte Carlo Simulation Tutorial with Python 📚

1. Survival Analysis Basics:

Survival analysis is nothing but a set of statistical approaches used to determine the time it takes for an event of interest to occur. We use survival analysis to study the time until some event of interest occurs. Time is usually measured in years, months, weeks, days, and other time measuring units. The event of interest could be anything of interest. It could be an actual death, a birth, a retirement, along with others.

For instance, how can Survival Analysis be useful to analyze the ongoing COVID-19 pandemic data?

We can find the number of days until patients showed COVID-19 symptoms.
We can find for which age group it is deadlier.
We can find which treatment has the highest survival probability.
We can find whether a person’s sex has a significant effect on their survival time?
We can find the median number of days of survival for patients.
We can find which factor has more impact on patients’ survival.

In this tutorial, we are going to perform a thorough analysis of patients with lung cancer. Do not worry if it seems complicated. Once we go through the logic behind it, we will have the ability to perform survival analysis on any data set. Exciting! Isn’t it?

Survival analysis is used in a variety of field such as:

Cancer studies for patients survival time analyses.
Sociology for “event-history analysis.”
In Engineering for “failure-time analysis.”
Time until product failure.
Time until a warranty claim.
Time until a process reaches a critical level.
Time from initial sales contact to a sale.
Time from employee hire to either termination or quit.
Time from a salesperson hires to their first sale.

In cancer studies, typical research questions are:

What is the impact of specific clinical characteristics on patient’s survival? For example, is there any difference between people who have higher blood sugar and those who do not?
What is the probability that an individual survives a specific time (years, months, days)? For example, given a set of cancer patients, we will tell that if 300 days after a cancer diagnosis has been passed, then the probability of that person being alive at that time will be 0.7.
Are there differences in survival between groups of patients? For example, Let’s say there are two groups of people diagnosed with cancer. Those two groups were given two different kinds of treatments. Our goal here will be to find out if there is a significant difference between the survival time for those two different groups based on the treatment they were given.

Objectives

In this tutorial, we will see the following methods of survival analysis in detail:

1) Kaplan-Meier plots to visualize survival curves.

2) Nelson-Aalen plots to visualize the cumulative hazard.

3) Log-Rank test to compare the survival curves of two or more groups

4) Cox-proportional hazards regression finds out the effect of different variables like age, sex, and weight on survival.

Fundamental concepts

We will start this tutorial by understanding some basic definitions and concepts related to survival analysis.

Survival time and type of events in cancer studies

Survival Time: It is usually referred to as an amount of time until when a subject is alive or actively participates in a survey.

There are three main types of events in survival analysis:

1) Relapse: Relapse is defined as a deterioration in the subject’s state of health after a temporary improvement.

2) Progression: Progression is defined as the process of developing or moving gradually towards a more advanced state. It basically means that the health of the subject under observation is improving.

3) Death: Death is defined as the destruction or permanent end of something. In our case, death will be our event of interest.

Censoring

As we discussed above, survival analysis focuses on the occurrence of an event of interest. The event of interest can be anything like birth, death, or retirement. However, there is still a possibility that the event we are interested in does not occur. Such observations are known as censored observations.

There are three types of censoring:

Right Censoring: The subject under observation is still alive. In this case, we can not have our timing when our event of interest(death) occurs.
Left Censoring: In this type of censoring, the event cannot be observed for some reason. It may also include the event that occurred before the experiment started, such as the number of days from birth when the kid started walking.

3) Interval Censoring: In this type of data censoring, we only have data for a specific interval, so it is possible that the event of interest does not occur during that time.

Censoring may occur in the following instances:

A patient has not (yet) experienced the event of interest (death or relapse in our case) within a period.
A patient is not followed anymore.
If a patient moves to another city, then follow-up might not be possible for the hospital staff.
We only have the data for a specific interval.

Survival and hazard functions

We generally use two related probabilities to analyze survival data for a subject.

1) Survival Function(S)

2) Hazard Function (H)

To find the survival probability of a subject, we will use the survival function S(t), the Kaplan-Meier Estimator. The survival function is defined as the probability that an individual (subject) survives from the time origin (diagnosis of a disease) to a specified future time t. Please note that the time can be in various forms like minutes, days, weeks, months, or years. For example, S(200)=0.7 means that after 200 days, a subject’s survival probability is 0.7. In many deadly diseases, the survival probability decreases as the period increases. If the subject is alive at the end of an experiment, then that data will be censored.

The hazard probability, denoted by H(t), is the probability that an individual (subject) who is under observation at a time t has an event (death) at that time. For example, If h(200) = 0.7 means that after 200 days or on the 200th day, the probability of being dead is 0.7. One thing to keep in mind here is that the hazard function gives us the cumulative probability. We will discuss this in detail later in this tutorial.

Notice that, in contrast to the survival function, which focuses on the survival of a subject, the hazard function gives us the probability of a subject being dead on a given time. We can note that higher survival probability and lower hazard probability is good for the subject’s health.

Let’s move forward to the cool coding part!

Download the public dataset from the UPC.

Data Description:

2. Kaplan-Meier Estimator Theory and Example

The Kaplan–Meier estimator is a non-parametric statistic used to estimate the survival function (probability of a person surviving) from the lifetime data. In medical research, it is often used to measure the fraction of patients living for a specific time after treatment or diagnosis. For example: Calculating the amount of time(year, month, day) a particular patient lived after he/she was diagnosed with cancer or his treatment starts. The estimator is named after Edward L. Kaplan and Paul Meier, who submitted similar manuscripts to the American Statistical Association Journal.

The probability of survival at time ti, which is denoted by S(ti), is calculated as follow:

Figure 2: Formula to calculate the probability of survival at time ti. | Survival Analysis with Python Tutorial — Figure 2: Formula to calculate the probability of survival at time ti.

We can also write the equation above in a simple form as follows:

Figure 3: Probability of survival time in a simple form S(ti). | Survival Analysis with Python Tutorial — Figure 3: Probability of survival time in a simple form S(ti).

For Example:

1) Survival probability at time t=1:

Survival probability at time t=1 formula | Survival Analysis with Python Tutorial — Figure 4: Survival probability at time t=1 formula

2) Survival probability at time t=2:

Figure 5: Survival probability at time t=2 formula

3) Survival probability at time t=3:

Figure 6: Survival probability at time t=3 formula

In a more generalized way, the probability of survival for a particular time is given by.

Figure 7: Generalized formula for the probability of survival for a particular time.

From the above equations, we can confidently say that.

Figure 8: Expressing survival generalization.

Kaplan-Meier Estimator (Without any groups)

1) Import required libraries:

Figure 9: Importing pandas, numpty, matplotlib.pyplot, and lifelines | Survival Analysis with Python Tutorial — Figure 9: Importing pandas, numpty, matplotlib.pyplot, and lifelines.

2) Read the data set:

Figure 10: Reading the dataset. | Survival Analysis with Python Tutorial — Figure 10: Reading the dataset.

3) Print the columns in our data set:

Figure 11: Printing the data columns. | Survival Analysis with Python Tutorial — Figure 11: Printing the data columns.

4) Get additional information about the dataset:

It gives us information about the data type of the columns along with their null-value counter. We need to remove the rows with a null value for some of the survival analysis methods.

Figure 12: Additional info about our dataset. | Survival Analysis with Python Tutorial — Figure 12: Additional info about our dataset.

5) Get statistical information about the dataset:

It gives us some statistical information like the total number of rows, mean, standard deviation, minimum value, 25th percentile, 50th percentile, 75th percentile, and maximum value for each column.

Figure 13: Obtaining statistical info about our dataset. | Survival Analysis with Python Tutorial — Figure 13: Obtaining statistical info about our dataset.

6) Find out sex distribution using histogram:

This gives us a general idea about how our data is distributed. In the following graph, we can see that around 139 values have a status of 1, and approximately 90 values have a status of 2, which means that there are 139 males and around 90 females in our dataset.

Figure 14: Plotting histogram for the sex of the patient. | Survival Analysis with Python Tutorial — Figure 14: Plotting histogram for the sex of the patient.

7) Create an object for Kaplan-Meier-Fitter:

Figure 15: Creating an object for the Kaplan-Meier-Fitter. | Survival Analysis with Python Tutorial — Figure 15: Creating an object for the Kaplan-Meier-Fitter.

8) Organize the data:

Now we need to organize our data. We will add a new column in our dataset that is called “dead.” It stores the data about whether a person that is a part of our experiment is dead or alive(based on the status value). If our status value is 1, then that person is alive, and if our status value is 2, then the person is dead. It is a crucial step for what we need to do in the next step as we are going to store our data in columns called censored and observed. Where observed data stores the value of dead persons in a specific timeline, and censored data stores the value of alive persons or persons that we are not going to investigate.

Figure 16: Organizing our data. | Survival Analysis with Python Tutorial — Figure 16: Organizing our data.

9) Fitting our data into an object:

Here our goal is to find the number of days a patient survived before they died. Our event of interest will be “death,” which is stored in the “dead” column. The first argument it takes is the timeline for our experiment.

Figure 17: Fitting the parameter values in our object. | Survival Analysis with Python Tutorial — Figure 17: Fitting the parameter values in our object.

10) Generate event table:

One of the most crucial methods of the kmf object is the “event_table.” It gives us various information for our survival analysis. Let’s have a look at it column-by-column.

Figure 18: Printing the event table. | Survival Analysis with Python Tutorial — Figure 18: Printing the event table.

a) event_at: It stores the value of the timeline for our dataset. i.e., when was the patient observed in our experiment or when was the experiment conducted. It can be several minutes, days, months, years, and others. In our case, it is going to be for many days. It stores the value of survival days for the subjects.

b) at_risk: It stores the number of current patients under observation. In the beginning, it will be the total number of patients we are going to observe in our experiment. If new patients are added at a particular time, then we have to increase their value accordingly. Therefore:

Figure 19: Analyzing at_risk variable. | Survival Analysis with Python Tutorial — Figure 19: at_risk variable formula.

c) entrance: It stores the value of new patients in a given timeline. It is possible that while experimenting, other patients are also diagnosed with the disease. To account for that, we have the entrance column.

d) censored: Our ultimate goal is to find the survival probability for a patient. At the end of the experiment, if the person is still alive, we will add him/her to the censored category. We have already discussed the types of censoring.

e) observed: It stores the value of the number of subjects that died during the experiment. From a broad perspective, these are the people who met our event of interest.

f) removed: It stores the values of patients that are no longer part of our experiment. If a person dies or is censored, then he/she falls into this category. In short, it is an addition of the data in the observed and censored category.

Figure 20: removed variable formula. | Survival Analysis with Python Tutorial — Figure 20: removed variable formula.

11) Calculating the probability of survival for individual timelines:

Let’s first see the formula for calculating the survival of a particular person at a given time.

Figure 21: Calculating the probability of survival for individual timelines. | Survival Analysis with Python Tutorial — Figure 21: Calculating the probability of survival for individual timelines.

a) Survival probability at t=0 only:

Figure 22: Calculating the survival probability of t=0. | Survival Analysis with Python Tutorial — Figure 22: Calculating the survival probability of t=0.

Figure 23: Calculating the survival probability for a given time and t=0. | Survival Analysis with Python Tutorial — Figure 23: Calculating the survival probability for a given time and t=0.

b) Survival probability at t=5 only:

Figure 25: Calculating the survival probability for a given time. | Survival Analysis with Python Tutorial — Figure 25: Calculating the survival probability for a given time.

c) Survival probability at t=11 only:

Figure 26: Survival probability at t=11. | Survival Analysis with Python Tutorial — Figure 26: Survival probability at t=11.

Figure 27: Calculating the survival probability for a given time. | Survival Analysis with Python Tutorial — Figure 27: Calculating the survival probability for a given time.

Now what we found here is the probability for a specific time. What we want is the probability for the entire time for a patient. i.e., the probability of patient surviving all the rounds of the experiment.

In a nutshell, we want to find the probability of a person surviving all of the time he lived after diagnosis. What we just found is the probability of a particular experiment only.

Let us take a straightforward example to understand the concept of conditional probability. For instance, we have a total of 15 balls in a non-transparent box. Out of the 15 balls, we are seven black balls, five red balls, and three green balls. Here is a pictorial view for that.

Figure 28: Ball figure example. | Survival Analysis with Python Tutorial — Figure 28: Ball figure example.

a) Probability of choosing a red ball:

Figure 29: Probability of choosing a red ball. | Survival Analysis with Python Tutorial — Figure 29: Probability of choosing a red ball.

b) Probability of choosing the second red ball:

Since we’ve removed a ball that was red, the total number of red balls we have is 4, and the total number of balls we have is 14.

Figure 30: Probability of choosing the second red ball. | Survival Analysis with Python Tutorial — Figure 30: Probability of choosing the second red ball.

If our question is to find the probability of both the balls being red, we will multiply it, and that is precisely what we are going to do in survival analysis. We know that a patient has survived the 1st time interval, and we want to find the probability of him surviving the second time interval given that he has survived the 1st time interval. My point here is we do not want to find the probability of the second time interval only. We want the total probability of him surviving the entire period.

In our example, the probability of both balls being red is as following:

Figure 31: Calculating the probability of both balls being red. | Survival Analysis with Python Tutorial — Figure 31: Calculating the probability of both balls being red.

In survival analysis, we can write the formula as follows:

Figure 32: Calculating S(n). | Survival Analysis with Python Tutorial — Figure 32: Calculating S(n).

12) Finding survival probability:

We want to find the probability that a patient has survived through all the timeline till now. Now we need to find the actual survival probability for a patient.

a) Survival probability for t=0:

Figure 33: Calculating the survival probability for t=0. | Survival Analysis with Python Tutorial — Figure 33: Calculating the survival probability for t=0.

Figure 34: Calculating the actual survival probability at a given time. | Survival Analysis with Python Tutorial — Figure 34: Calculating the actual survival probability at a given time.

b) Survival probability for t=5:

Figure 35: Calculating the survival probability for t=5. | Survival Analysis with Python Tutorial — Figure 35: Calculating the survival probability for t=5.

c) Survival probability for t=11:

Figure 37: Calculating the survival probability for t=11. | Survival Analysis with Python Tutorial — Figure 37: Calculating the survival probability for t=11.

Figure 38: Calculating the actual survival probability at a given time. | Survival Analysis with Python Tutorial — Figure 38: Calculating the actual survival probability at a given time.

13) Predicting the probability:

Now the kmf object’s predict function does all of this work for us. However, it is always good practice to know the logic behind it.

Figure 39: Displaying the probability values the easy way. | Survival Analysis with Python Tutorial — Figure 39: Displaying the probability values the easy way.

14) Finding the survival probability for an array of the timeline:

Figure 40: Predicting the survival probability for an array of values. | Survival Analysis with Python Tutorial — Figure 40: Predicting the survival probability for an array of values.

15) Get survival probability for the whole timeline:

Figure 41: Getting the survival probability for the whole timeline. | Survival Analysis with Python Tutorial — Figure 41: Getting the survival probability for the whole timeline.

The survival probability for a patient at timeline 0 is 1. Grasping our thoughts, then we gather that the probability that a person dies on the 1st day of diagnosis is near equals to 0. So we can say that the survival probability is as high as possible. As the timeline increases, the probability of survival decreases for a patient.

16) Plot the graph:

Figure 42: Plotting the probability of survival. | Survival Analysis with Python Tutorial — Figure 42: Plotting the probability of survival.

17) The median number of survival days:

It provides the number of days where, on average, 50% of the patients survived.

Figure 43: Displaying the median number of days. | Survival Analysis with Python Tutorial — Figure 43: Displaying the median number of days.

From the code above, we can say that on average, a person lived 310 days after the day of diagnosis.

18) Survival probability with confidence interval:

Figure 44: Estimating the survival probability with a confidence interval. | Survival Analysis with Python Tutorial — Figure 44: Estimating the survival probability with a confidence interval.

19) Graph for survival probability with confidence interval:

Figure 45: Plotting the survival function with a confidence interval. | Survival Analysis with Python Tutorial — Figure 45: Plotting the survival function with a confidence interval.

Now all the information we have is for the survival of a person. Now we will see what is the probability for a person to die at a specific timeline. Here notice that a higher survival probability is suitable for a person, but higher cumulative density (probability of a person to die) is not so good!

20) Probability of a person dying:

Figure 46: Probability of a person dying. | Survival Analysis with Python Tutorial — Figure 46: Probability of a person dying.

Here the denominator value is subjected at risk in the previous row.

The formula for cumulative density:

a) Probability of a person dying at t=0:

Figure 48: Probability of a person dying at t=0. | Survival Analysis with Python Tutorial — Figure 48: Probability of a person dying at t=0.

b) Probability of a person dying at t=5:

Figure 49: Probability of a person dying at t=5. | Survival Analysis with Python Tutorial — Figure 49: Probability of a person dying at t=5.

c) Probability of a person dying at t=11:

Figure 50: Probability of a person dying at t=11. | Survival Analysis with Python Tutorial — Figure 50: Probability of a person dying at t=11.

Find the cumulative density:

d) Cumulative density at t=0:

Figure 51: Calculating the cumulative density at t=0. | Survival Analysis with Python Tutorial — Figure 51: Calculating the cumulative density at t=0.

e) Cumulative density at t=5:

f) Cumulative density at t=11:

Figure 53: Calculate the cumulative density at t=11. | Survival Analysis with Python Tutorial — Figure 53: Calculate the cumulative density at t=11.

Figure 54: Displaying the probability of a subject dying. | Survival Analysis with Python Tutorial — Figure 54: Displaying the probability of a subject dying.

21) Plot the graph for cumulative density:

Figure 55: Plotting the cumulative density. | Survival Analysis with Python Tutorial — Figure 55: Plotting the cumulative density.

Notice that, as the number of survival days increases the probability of a person dying increases.

22) The cumulative density with confidence interval:

Figure 56: Calculate the cumulative density with confidence interval. | Survival Analysis with Python Tutorial — Figure 56: Calculate the cumulative density with a confidence interval.

23) Graph for cumulative density with a confidence interval:

Figure 57: Plotting the cumulative density with a confidence interval. | Survival Analysis with Python Tutorial — Figure 57: Plotting the cumulative density with a confidence interval.

24) Get cumulative density for a particular day:

Figure 58: Find cumulative density at a specific time. | Survival Analysis with Python Tutorial — Figure 58: Find cumulative density at a specific time.

25) The median time to an event:

We can get the amount of time remaining from the median survival time.

Figure 59: Calculating the conditional median time to an event of interest.

26) Graph for the median time to the event:

Figure 60: Plotting the graph for the median time to an event. | Survival Analysis with Python Tutorial — Figure 60: Plotting the graph for the median time to an event.

3. Estimating hazard rates using Nelson-Aalen

Hazard function H(t):

Until now, we discussed the Kaplan-Meier survival function. Using that, we can get the probability of the event of interest (death in our case) not occurring by that time. The survival functions are a great way to summarize and visualize the survival dataset; however, it is not the only way. We can visualize the aggregate information on survival using the Nelson-Aalen hazard function h(t). The hazard function h(t) gives us the probability that a subject under observation at time t has an event of interest (death) at that time. To get the information about the hazard function, we cannot transform the Kaplan-Meier estimator. For that, there is a proper nonparametric estimator of the cumulative hazard function:

Cumulative Hazard Function:

1) Import required libraries:

Figure 62: Importing NelsonAalenFitter from lifelines. | Survival Analysis with Python Tutorial — Figure 62: Importing NelsonAalenFitter from lifelines.

2) Create an object of Nelson-Aalen-Fitter:

Figure 63: Creating an object of Nelson-Aalen-Fitter. | Survival Analysis with Python Tutorial — Figure 63: Creating an object of Nelson-Aalen-Fitter.

3) Fitting the data:

Figure 64: Fitting the data into the object. | Survival Analysis with Python Tutorial — Figure 64: Fitting the data into the object.

4) Finding the cumulative hazard:

Here we’ll use the event table generated in the previous part to understand how the hazard function actually works.

Figure 65: Finding the cumulative hazard. | Survival Analysis with Python Tutorial — Figure 65: Finding the cumulative hazard.

Here is the formula to find the non-cumulative hazard probability at a specific time:

Figure 66: Formula to calculate the non-cumulative hazard probability at a specific time.| Survival Analysis with Python Mach — Figure 66: Formula to calculate the non-cumulative hazard probability at a specific time.

a) Finding the hazard probability at t=0:

Figure 67: Finding the hazard probability at t=0.| Survival Analysis with Python Machine Learning Tutorial — Figure 67: Finding the hazard probability at t=0.

b) Finding the hazard probability at t=5:

Figure 68: Finding the hazard probability at t=5.| Survival Analysis with Python Machine Learning Tutorial — Figure 68: Finding the hazard probability at t=5.

c) Finding the hazard probability at t=11:

Figure 69: Finding the hazard probability at t-11.| Survival Analysis with Python Machine Learning Tutorial — Figure 69: Finding the hazard probability at t-11.

d) Finding the cumulative hazard probability at t=0:

Figure 70: Finding the cumulative hazard probability at t=0.| Survival Analysis with Python Machine Learning Tutorial — Figure 70: Finding the cumulative hazard probability at t=0.

e) Finding the cumulative hazard probability at t=5:

Figure 71: Finding the cumulative hazard probability at t=5.

f) Finding the cumulative hazard probability at t=11:

Figure 72: Finding the cumulative hazard probability at t=11.

Figure 73: Displaying the cumulative hazard.

5) Plot the graph for cumulative hazard:

Figure 74: Plot the graph for cumulative hazard.| Survival Analysis with Python Machine Learning Tutorial — Figure 74: Plot the graph for cumulative hazards.

The cumulative hazard has a less clear understanding than the survival functions, but the hazard functions are based on more advanced survival analysis techniques.

6) Predict a value:

Figure 75: Predicting the value of a certain point.| Survival Analysis with Python Machine Learning Tutorial — Figure 75: Predicting the value of a certain point.

7) Cumulative hazard probability with confidence interval:

Figure 76: Calculating the cumulative hazard probability with confidence interval. | Survival Analysis with Python Tutorial — Figure 76: Calculating the cumulative hazard probability with a confidence interval.

8) Graph for cumulative hazard probability with confidence interval:

Figure 77: Plotting the confidence interval.| Survival Analysis with Python Machine Learning Tutorial — Figure 77: Plotting the confidence interval.

9) Cumulative hazard vs. cumulative density:

Figure 78: Plotting the cumulative hazard and cumulative density| Survival Analysis with Python Machine Learning Tutorial — Figure 78: Plotting the cumulative hazard and cumulative density

4. Kaplan-Meier Estimator with groups

Until now, we saw how we could find the survival probability and hazard probability for all of our observations. Now it is time to perform some analysis on our data to determine whether there is any difference in survival probability if we divide our data into groups based on specific characteristics. Let’s divide our data into two groups based on sex: Male and Female. Our goal here is to check is there any significant difference in survival rate if we divide our dataset based on sex. Later in this tutorial, we will see on what basis do we divide the data into groups.

1) Import required libraries:

Figure 79: Importing pandas, numpy, matplotlib.pyplot and KaplanMeierFitter from lifelines in Python. — Figure 79: Importing pandas, numpy, matplotlib.pyplot, and KaplanMeierFitter from lifelines in Python.

2) Read the dataset:

Figure 80: Reading the dataset. | Survival Analysis with Python Machine Learning Tutorial — Figure 80: Reading the dataset.

3) Organize our data:

Figure 81: Organizing the data. | Survival Analysis with Python Machine Learning Tutorial — Figure 81: Organizing the data.

4) Create two objects of Kaplan-Meier-Fitter():

Figure 82: Creating the two objects of the Kaplan-Meier-Fitter. | Survival Analysis with Python Machine Learning Tutorial — Figure 82: Creating the two objects of the Kaplan-Meier-Fitter.

5) Divide the data into groups:

Figure 83: Dividing the data into groups. | Survival Analysis with Python Machine Learning Tutorial — Figure 83: Dividing the data into groups.

6) Male data:

Figure 84: Viewing the data of the male group. | Survival Analysis with Python Machine Learning Tutorial — Figure 84: Viewing the data of the male group.

7) Female data:

Figure 85: Displaying the data of the female group. | Survival Analysis with Python Machine Learning Tutorial — Figure 85: Displaying the data of the female group.

8) Fit data into our objects:

Figure 86: Fitting the male and female data into objects.

9) Event table for the male group:

Figure 87: Event table for the male group. | Survival Analysis with Python Machine Learning Tutorial — Figure 87: Event table for the male group.

10) Event table for the female group:

Figure 88: Event table for the female group. | Survival Analysis with Python Machine Learning Tutorial — Figure 88: Event table for the female group.

11) Predicting survival probabilities:

Now we can predict the survival probability for both the groups.

Figure 89: Predicting the value based on time. | Survival Analysis with Python Machine Learning Tutorial — Figure 89: Predicting the value based on time.

Figure 90: Predicting the value based on time. | Survival Analysis with Python Machine Learning Tutorial — Figure 90: Predicting the value based on time.

12) Get the complete list of survival probabilities:

a) Survival probability for a male group:

Figure 91: Get complete data of the survival function for the male group. | Survival Analysis with Python Machine Learning T — Figure 91: Get complete data of the survival function for the male group.

b) Survival probability for the female group:

Figure 92: Get complete data of the survival function for the male group. | Survival Analysis with Python Machine Learning Tu — Figure 92: Get complete data of the survival function for the male group.

13) Plot the graph for survival probabilities:

Figure 93: Plotting the graph for survival probabilities. | Survival Analysis with Python Machine Learning Tutorial — Figure 93: Plotting the graph for survival probabilities.

Here we can notice that the probability of females surviving lung cancer is higher than that of males. Therefore, from this data, we can say that medical researchers should focus more on the factors that lead to male patients’ poor survival rates.

14) Get the cumulative density:

a) For the male group:

Figure 94: Cumulative density for the male group. | Survival Analysis with Python Machine Learning Tutorial — Figure 94: Cumulative density for the male group.

b) For the female group:

Figure 95: Cumulative density for the female group. | Survival Analysis with Python Machine Learning Tutorial — Figure 95: Cumulative density for the female group.

15) Plot the graph for cumulative density:

Figure 96: Plotting the graph for cumulative density for both groups.

16) Hazard function:

Figure 97: Importing the NelsonAalenFitter.

17) Fit the data into our objects:

Figure 98: Fitting the data into our objects.

18) Cumulative hazard probability:

a) For the male group:

Figure 99: Cumulative hazard for the male group. | Survival Analysis with Python Machine Learning Tutorial — Figure 99: Cumulative hazard for the male group.

b) For the female group:

Figure 100: Cumulative hazard for the female group. | Survival Analysis with Python Machine Learning Tutorial — Figure 100: Cumulative hazard for the female group.

19) Plot the graph for cumulative hazard probability:

Figure 101: Plotting the graph for cumulative hazard. | Survival Analysis with Python Machine Learning Tutorial — Figure 101: Plotting the graph for cumulative hazard.

20) The median time to event for the male group:

Figure 102: Finding the conditional median to the event of interest. | Survival Analysis with Python Machine Learning Tutoria — Figure 102: Finding the conditional median to the event of interest.

21) The median time to event graph for the male group:

Figure 103: Plotting the conditional time to event for the male group. | Survival Analysis with Python Machine Learning Tutor — Figure 103: Plotting the conditional time to event for the male group.

22) The median time to event for the female group:

Figure 104: Finding the conditional median time to event of interest for the female group.

23) The median time to event graph for the female group:

Figure 105: Conditional median time left for event for the female group. — Figure 105: Conditional median time left for an event for the female group.

24) Survival probability with a confidence interval for the male group:

Figure 106: Calculating the survival probability with a confidence interval for the male group.

25) Survival probability graph with a confidence interval for the male group:

Figure 107: Confidence survival function and plot with a confidence interval for the male group.

26) Survival probability with a confidence interval for the female group:

Figure 108: Survival probability with a confidence interval for the female group.

27) Survival probability graph with a confidence interval for the female group:

Figure 109: Plotting the survival function with a confidence interval for the female group.

28) Comparison of cumulative density vs. cumulative hazard:

a) For the male group:

Figure 110: Plot the cumulative hazard and cumulative density.

b) For the female group:

Figure 111: Plotting the cumulative hazard and cumulative density.

5. Log-Rank Test:

The log-rank test is a hypothesis test that is used to compare the survival distribution of two samples.

Goal: Our goal is to see if there is any significant difference between the groups being compared.

Null Hypothesis: The null hypothesis states that there is no significant difference between the groups being studied. If there is a significant difference between those groups, then we have to reject our null hypothesis.

How do we say that there is a significant difference?

A p-value between 0 and 1 denotes the statistical significance. The smaller the p-value, the more significant the statistical difference between groups being studied is. Notice that our goal is to find if there is any difference between the groups we are comparing. If yes, we can do more research on why there are lower survival chances for a particular group based on various information like their diet, lifestyle, and others.

Less than (5% = 0.05) P-value means there is a significant difference between the groups we compared. We can partition our groups based on their sex, age, race, treatment method, and others.

It’s a test to find out the value of P.

1) Get the variables for the Log-rank test:

Figure 112: Defining the variables for the log-rank test. | Survival Analysis with Python Machine Learning Tutorial — Figure 112: Defining the variables for the log-rank test.

2) Performing the Log-rank test:

Figure 113: Performing the log-rank test.

3) Print the p-value:

We have compared the survival distributions of two different groups using the famous statistical method, the Log-rank test. Here we can notice that the p-value is 0.00131(<0.005) for our groups, which denotes that we have to reject the null hypothesis and admit that the survival function for both groups is significantly different. The p-values give us strong evidence that “sex” was associated with the number of survival days. In short, we can say that the “sex” of a person makes a significant difference in survival probability.

6. Cox-proportional hazard model

The cox-proportional hazard model is a regression model generally used by medical researchers to determine the relationship between the survival time of a subject and one or more predictor variables. In short, we want to find out how different parameters like age, sex, weight, height affects the survival time of a subject.

In the previous section, we saw Kaplan-Meier, Nelson-Aalen, and Log-Rank-Test. However, in that, we were only able to consider one variable at a time, and one more thing to notice is that we were performing operations only on categorical variables like sex, status, and others. It can not be used for non-categorical data like age, weight, or height. As a solution for that, we use the Cox proportional hazards regression analysis, which works for both quantitative predictors non-categorical variables and for categorical variables.

Why do we need it?

In medical research, we are generally considering more than one factor to diagnose a person’s health or survival time. i.e., we generally make use of their sex, age, blood pressure, and blood sugar to find out if there is any significant difference between those in different groups. For example, if we are grouping our data based on a person’s age, our goal will be to determine which age group has a higher survival chance. Is that children’s group, adult group, or old persons’ group? Now what we need to find is on what basis we make a group? To find that, we use cox regression and find the coefficients of different parameters. Let’s see how that works!

Basics of the Cox-proportional Hazard Method

The ultimate purpose of the cox-proportional hazard method is to notice how different factors in our dataset impact the event of interest.

Hazard function:

The value of exp(bi) is called the Hazard Ratio (HR). We will understand this by taking an example.

Figure 116: Defining the hazard ratio. | Survival Analysis with Python Machine Learning Tutorial — Figure 116: Defining the hazard ratio.

Let’s code:

1) Import required libraries:

Figure 117: Importing the required libraries with Python. | Survival Analysis with Python Machine Learning Tutorial — Figure 117: Importing the required libraries with Python.

2) Read the CSV file:

Figure 118: Read the CSV file. | Survival Analysis with Python Machine Learning Tutorial — Figure 118: Read the CSV file.

3) Delete rows that contain null values:

Next, we need to delete the rows which have null values. Our model cannot work on rows which has null values. If we do not preprocess our data, then we might get an error.

4) Create an object for the KapanMeierFitter:

Figure 120: Create an object for the KapanMeierFitter.

5) Organize the data:

6) Fit the data into an object:

Figure 122: Fitting the data into an object.

7) Generating the event table:

8) Get the required columns:

Figure 124: Get the required columns from the data.

9) Fit the data and print the summary:

Figure 125: Get the summary using CoxPHFitter.

In the picture above, notice the p-value for each column in our dataset. Next, we know that p-value<0.05 is considered statistically significant. Here we can see that “sex” and “ph.ecog” have p-values less than 0.05. So we can say that while grouping our data for analysis, we should focus on dividing the data based on these two factors.

Here notice the p-value for “sex” is 0.01, and the Hazard Ratio(HR) is 0.57, which indicates a strong relationship between the patients’ sex and decreased risk of death. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 0.57, or 43%. That means that females have higher survival chances. Next, the p-value for ph.ecog is <0.005, and the Hazard Ratio(HR) is 2.09, which indicates a strong relationship between the ph.ecog value and the increased risk of death. Holding the other covariates constant, a higher value of ph.ecog is associated with poor survival. Here person with higher ph.ecog value has a 109% higher risk of death. So, in short, we can say that doctors should try to reduce the value of ph.ecog in patients by providing relevant medicines. Next, notice that the Hazard Ratio(HR) for Age is 1.01, suggesting only a 1% increase for a higher age group. So we can say that there is no significant difference between different age groups.

10) Check which factor affects the most from the graph:

In the following graph, we can notice the difference in “sex” and “ph.ecog” data.

11) Check our theory with real observations:

Let’s check our conclusions with real data from our observations.

In the above graph, we can see that person 23 has the highest chance of survival, while person 17 has the least chance of survival. By checking the main table, we can notice a significant change in the ph.ecog value. We can also see that people 21 and 23 have higher chances of survival as they have the least value of ph.ecog.

Okay, so this is it for this tutorial. Thank you for reading. Your feedback is always welcome.

This is a revamped version of the original article published on KDNuggets.

📚 If you enjoyed this piece, check out our tutorial on neural networks from scratch with Python code and math in detail.📚

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University. These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

Published via Towards AI

Citation

For attribution in academic contexts, please cite this work as:

Shukla, et al., “Survival Analysis with Python Tutorial - How, What, When, and Why”, Towards AI, 2020

BibTex citation:

@article{pratik_iriondo_2020, 
 title={Survival Analysis with Python Tutorial - How, What, When, and Why}, 
 url={https://towardsai.net/survival-analysis-with-python}, 
 journal={Towards AI}, 
 publisher={Towards AI Co.}, 
 author={Pratik, Shukla}, 
 editor={Iriondo, Roberto}, 
 year={2020}, 
 month={Sep}
}

7. Resources

Google Colab Implementation

Github Repository

8. References:

[1] Lifelines Example, https://lifelines.readthedocs.io/en/latest/Examples.html

[2] Kaplan — Meier Estimator, Wikipedia, https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator

[3] Lifelines, Univariate NelsonAalenFilter, https://lifelines.readthedocs.io/en/latest/fitters/univariate/NelsonAalenFitter.html

[4] STHDA, Statistical Tools for High-throughput Data Analysis, http://www.sthda.com/english/wiki/cox-proportional-hazards-model

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Survival Analysis with Python Tutorial — How, What, When, and Why

Table of Contents:

1. Survival Analysis Basics:

Objectives

Fundamental concepts

Survival time and type of events in cancer studies

Censoring

Survival and hazard functions

Let’s move forward to the cool coding part!

Data Description:

2. Kaplan-Meier Estimator Theory and Example

Kaplan-Meier Estimator (Without any groups)

3. Estimating hazard rates using Nelson-Aalen

Hazard function H(t):

4. Kaplan-Meier Estimator with groups

5. Log-Rank Test:

6. Cox-proportional hazard model

Why do we need it?

Basics of the Cox-proportional Hazard Method

Let’s code:

Citation

BibTex citation:

7. Resources

8. References:

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥