Plotting Kaplan-Meier Survival Times Curves In R With ggplot2

Hi. This page will be about plotting Kaplan-Meier survival curves using R with the ggplot2 data visualization package. When it comes to survival times between two groups we are dealing with the statistical field of survival analysis. Survival analysis deals with time to event data. Events can include a patient being ill, bankruptcy, an employee leaving a company, a person exiting a clinical trial and more.

Sections

References

The addicts Dataset

Importing The Data

Taking A Look At The Data

Plotting Survival Curves Using Base R Graphics

Plotting Survival Curves Using ggplot2 and ggfortify

References

R Graphics Cookbook by Winston Chang (2012)

The link http://rpubs.com/sinhrks/plot_surv is useful for understanding ggfortify.

The dataset is from http://web1.sph.emory.edu/dkleinb/surv3.htm

The book that I use for understanding Survival Analysis is called Survival Analysis - A Self Learning Text (3rd Edition, 2012) by David G. Kleinbaum & Mitchel Klein. This book teaches the subject in an applied manner and it is suitable for non-statisticians who wish to study the subject. A slight problem is that the R coding section in this book uses base R graphics and does not mention ggplot2.

The addicts Dataset

This addicts dataset can be downloaded from the website http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/addicts.dta. This is a .dta file or a STATA file so the haven package in R is needed to deal with this file type.

This information is from the Survival Analysis - A Self Learning Text (3rd Edition, 2012). A 1991 Australian study by Caplehorn et al. compared two methadone clinics for heroin addicts. The patient’s survival time (in days) is the amount of time the patient spent at the clinic before dropping out.

In the addicts dataset, the variables are defined as:

ID - Patient ID

SURVT - The time in days until the patient dropped out of the clinic or was censored (missing information).

STATUS - 1 for patient dropped out of the clinic or censored; o otherwise

CLINIC - Methadone Treatment Clinic Number 1 or 2

PRISON - An indicator whether the patient had a prison record. 1 for yes, 0 for no

DOSE - Patient’s maximum methadone does (mg/day, continuous variable)

Importing The Data

In the bookSurvival Analysis - A Self Learning Text (3rd Edition), the addicts dataset is loaded from the C:\ drive in your computer. I propose that you can load this addicts dataset online under the link of http://web1.sph.emory.edu/dkleinb/surv3.htm. The only slight issue is that the file is a .dta file (for STATA users). To fix this, the haven package in R is used to deal with the .dta files.

If the haven package is not installed into R, you can install haven by typing in:

install.packages("haven")

Here is the code for importing the data.

# Survival Analysis - A Self Learning Text (3rd Edition)

# addicts dataset and Chapter One:

# Dataset can be found at http://web1.sph.emory.edu/dkleinb/surv3.htm.

library(haven) #For importing STATA files.
library(survival)

url <- "http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/addicts.dta"

# Import data and turn it into a data frame
addicts <- data.frame(read_dta(url))

The read_data() function is needed to read the .dta file. I then convert this into a data.frame and save it to the variable addicts.

Taking A Look AtThe Data

It is usually a good idea to preview the data to have an idea of what the data looks like and the type of information you are dealing with. The head() and tail() functions are used here to preview the data.

# A look at the addicts dataset:

head(addicts)

##   id clinic status survt prison dose
## 1  1      1      1   428      0   50
## 2  2      1      1   275      1   55
## 3  3      1      1   262      0   55
## 4  4      1      1   183      0   30
## 5  5      1      1   259      1   65
## 6  6      1      1   714      0   55

tail(addicts)

##      id clinic status survt prison dose
## 233 260      1      1    59      1   45
## 234 261      1      1    33      1   60
## 235 262      2      1   540      0   80
## 236 263      2      0   551      0   65
## 237 264      1      1    90      0   40
## 238 266      1      1    47      0   45

It may seem that the id column is redundant at first but if you look at the output from tail(addicts) you see that a few id numbers were skipped. We have 238 rows but the last id number is 266. Keep the id column and work with what we have.

For more information on the variables, the summary() and str() functions can be used.

# Summary and data structure:

summary(addicts)

##        id             clinic          status           survt       
##  Min.   :  1.00   Min.   :1.000   Min.   :0.0000   Min.   :   2.0  
##  1st Qu.: 65.25   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 171.2  
##  Median :131.50   Median :1.000   Median :1.0000   Median : 367.5  
##  Mean   :134.13   Mean   :1.315   Mean   :0.6303   Mean   : 402.6  
##  3rd Qu.:205.75   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 585.5  
##  Max.   :266.00   Max.   :2.000   Max.   :1.0000   Max.   :1076.0  
##      prison            dose      
##  Min.   :0.0000   Min.   : 20.0  
##  1st Qu.:0.0000   1st Qu.: 50.0  
##  Median :0.0000   Median : 60.0  
##  Mean   :0.4664   Mean   : 60.4  
##  3rd Qu.:1.0000   3rd Qu.: 70.0  
##  Max.   :1.0000   Max.   :110.0

str(addicts)

## 'data.frame':    238 obs. of  6 variables:
##  $ id    : num  1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "label")= chr "Subject ID"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ clinic: num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Coded 1 or 2"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ status: num  1 1 1 1 1 1 1 0 1 1 ...
##   ..- attr(*, "label")= chr "status (0=censored, 1=endpoint)"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ survt : num  428 275 262 183 259 714 438 796 892 393 ...
##   ..- attr(*, "label")= chr "survival time in days"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ prison: num  0 1 0 0 1 0 1 1 0 1 ...
##   ..- attr(*, "label")= chr "0=none, 1=prison record"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ dose  : num  50 55 55 30 65 55 65 60 50 65 ...
##   ..- attr(*, "label")= chr "methadone dose (mg/day)"
##   ..- attr(*, "format.stata")= chr "%10.0g"

In the str() output, all the variables are atomic. The variable clinic should be a factor and the rest of the variables should be numeric and not atomic.

# Modify data types for each variable in addicts 
# Clinic is a factor variable, everything else is numeric.

addicts$id <- as.numeric(addicts$id)
addicts$clinic <- as.factor(addicts$clinic)
addicts$status <- as.numeric(addicts$status)
addicts$survt <- as.numeric(addicts$survt)
addicts$prison <- as.numeric(addicts$prison)
addicts$dose <- as.numeric(addicts$dose)

I could verify the variable types by using str() again.

# Check data types:
str(addicts)

## 'data.frame':    238 obs. of  6 variables:
##  $ id    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ clinic: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ status: num  1 1 1 1 1 1 1 0 1 1 ...
##  $ survt : num  428 275 262 183 259 714 438 796 892 393 ...
##  $ prison: num  0 1 0 0 1 0 1 1 0 1 ...
##  $ dose  : num  50 55 55 30 65 55 65 60 50 65 ...

The Surv() function gives a list of times (in days) until the patient has dropped out of the methadone clinic. Cases with the plus sign indicate censorship rather than the event of the patient dropping out.

# Survival times until event

Surv(time = addicts$survt, event = addicts$status == 1)

##   [1]  428   275   262   183   259   714   438   796+  892   393   161+
##  [12]  836   523   612   212   399   771   514   512   624   209   341 
##  [23]  299   826+  262   566+  368   302   602+  652   293   564+  394 
##  [34]  755   591   787+  739   550   837   612   581+  523   504   785 
##  [45]  774   560   160   482   518   683   147   563   646   899   857 
##  [56]  180   452   760   496   258   181   386   439+  563+  337   613+
##  [67]  192   405+  667   905+  247   821   821   517+  346+  294   244 
##  [78]   95   376   212    96   532   522   679   408+  840+  148+  168 
##  [89]  489   541+  205   475+  237   517   749   150   465   708   713+
## [100]  146+  450   555+  460    53+  122    35   532+  684+  769+  591+
## [111]  769+  609+  932+  932+  587+   26    72+  641+  367+  633+  661 
## [122]  232    13   563+  969+ 1052+  944+  881+  190    79   884+  170 
## [133]  286   358+  326+  769+  161   564+  268   611+  322  1076+    2+
## [144]  788+  575+  109   730+  790+  456+  231   143    86+ 1021+  684+
## [155]  878   216   808+  268   222+  683+  496+  389   126    17   350 
## [166]  531+  317+  461+   37   167   358    49   457   127     7    29 
## [177]   62   150+  223   129+  204+  129   581   176    30    41   543+
## [188]  210+  193   434   367   348    28+  337+  175+  149   546    84 
## [199]  283+  533   207   216    28+   67    62+  111+  257   136   342+
## [210]   41   531+   98+  145    50    53+  103+    2+  157    75    19 
## [221]   35   394+  117   175   180   314   480+  325+  280   204   366 
## [232]  531+   59    33   540   551+   90    47

An optional line of code is to look at the summary statistics of this Surv() function by using summary().

summary(Surv(time = addicts$survt, event = addicts$status == 1))

##       time            status      
##  Min.   :   2.0   Min.   :0.0000  
##  1st Qu.: 171.2   1st Qu.:0.0000  
##  Median : 367.5   Median :1.0000  
##  Mean   : 402.6   Mean   :0.6303  
##  3rd Qu.: 585.5   3rd Qu.:1.0000  
##  Max.   :1076.0   Max.   :1.0000

The shortest clinic staying time is 2 days and the longest time a patient stayed at a methadone clinic was 1076 days.

Plotting Survival Curves Using Base R Graphics

To start, a variable Y is created as the survival object in R. This Surv() function is the outcome variable for survfit() which will be used later. (This Surv() function is the same as in the previous section.)

# Chapter One: Estimating Survival Functions And Comparing Them Across Strata:

# Surv creates survival object which is the response variable

Y = Surv(addicts$survt, addicts$status == 1)

The survfit() function produces Kaplan-Meier survival estimates. It takes in our Surv() function indicated by Y. We stratify by clinic as we are comparing the two methadone clinics.

# Actual Plot:
# Stratify by CLINIC variable:

kmfit = survfit(Y ~ addicts$clinic)

The summary function of kmfit gives a table of times (in days), the number of patients in the study, the number of patients who dropped out at each time point, the associated standard errors, the lower and upper limits of the 95% confidence intervals for the survival estimates.

# Base R Plot:

# Book had c(0,100,200,300,400,500,600,700,800,900,1000), I use the seq function.

summary(kmfit, times = c(seq(0, 1000, by = 100)))

## Call: survfit(formula = Y ~ addicts$clinic)
## 
##                 addicts$clinic=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    163       0   1.0000  0.0000      1.00000        1.000
##   100    137      20   0.8746  0.0262      0.82467        0.928
##   200    110      20   0.7420  0.0353      0.67601        0.814
##   300     87      20   0.6046  0.0399      0.53120        0.688
##   400     68      14   0.5025  0.0415      0.42741        0.591
##   500     53       9   0.4319  0.0418      0.35719        0.522
##   600     30      16   0.2951  0.0403      0.22570        0.386
##   700     20       8   0.2113  0.0383      0.14818        0.301
##   800     10       8   0.1268  0.0326      0.07660        0.210
##   900      1       7   0.0181  0.0172      0.00283        0.116
## 
##                 addicts$clinic=2 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0     75       0    1.000  0.0000        1.000        1.000
##   100     66       5    0.932  0.0294        0.876        0.991
##   200     58       7    0.832  0.0442        0.750        0.924
##   300     50       7    0.730  0.0530        0.633        0.842
##   400     43       3    0.685  0.0558        0.584        0.804
##   500     39       2    0.653  0.0577        0.549        0.776
##   600     27       1    0.634  0.0590        0.528        0.761
##   700     19       1    0.606  0.0625        0.495        0.742
##   800     11       1    0.575  0.0669        0.457        0.722
##   900      7       1    0.517  0.0812        0.380        0.703
##  1000      3       0    0.517  0.0812        0.380        0.703

# In CLINIC = 1, survival times stopped at 900 instead of 1000 as there is no subject
# at risk on day 1000.

# Could've also used 100*(0:10)

Here is the code and output for the Kaplan-Meier curves in base R graphics.

plot(kmfit, lty = c("solid", "dashed"), col = c("black", "grey"), xlab = "Survival Time In Days", ylab = "Survival Probabilities")

legend("topright", c("Clinic 1", "Clinic 2"), lty = c("solid", "dashed"), col = c("black", "grey"))

Plotting Survival Curves Using ggplot2 and ggfortify

The base R graphics version of the Kaplan-Meier survival curves is not visually appealing. With the help of the ggplot2 and ggfortify packages, nicer plots can be produced.

Here is the code and output for the Kaplan-Meier curves with ggplot2 and ggfortify.

## Trying to find ggplot version of the above plots:

# http://rpubs.com/sinhrks/plot_surv for ggfortify & survival curves
# http://rpubs.com/sinhrks/plot_surv for ggfortify basics
#

# Plotting Survival Curves Using ggplot2 and ggfortify:

library(ggplot2)
library(ggfortify)

model_fit <- survfit(Surv(survt, status) ~ clinic, data = addicts)

autoplot(model_fit) + 
 labs(x = "\n Survival Time (Days) ", y = "Survival Probabilities \n", 
 title = "Survival Times Of \n Methadone Patients \n") + 
 theme(plot.title = element_text(hjust = 0.5), 
 axis.title.x = element_text(face="bold", colour="#FF7A33", size = 12),
 axis.title.y = element_text(face="bold", colour="#FF7A33", size = 12),
 legend.title = element_text(face="bold", size = 10))

In this plot, the colours help the reader identify which curve goes with which clinic. The shaded bands represent the confidence intervals and each time point. The plus signs represent the censored cases at a given time point.

More patients stay in clinic 2 than in clinic 1 since the survival curve is higher than the curve for clinic 1. An investigation is recommended in determining on why a lot of the patients in clinic one leave. It could be the clinic, it could the selection of patients or something else not explained by the data.