Skip to contents

This function generates nonlinear bins for probability of survival data based on specified thresholds and divisors as specified in Napoli et al. (2017), Schroeder et al. (2019), and Kassar et al. (2016). This function calculates bin statistics, including mean, standard deviation, total alive, total dead, count, and percentage for each bin.

Usage

nonlinear_bins(
  data,
  Ps_col,
  outcome_col,
  divisor1 = 5,
  divisor2 = 5,
  threshold_1 = 0.9,
  threshold_2 = 0.99
)

Arguments

data

A data.frame or tibble containing the probability of survival data for a set of patients.

Ps_col

The column in data containing the probability of survival values for a set of patients.

outcome_col

The name of the column containing the outcome data. It should be binary, with values indicating patient survival. A value of 1 should represent "alive" (survived), while 0 should represent "dead" (did not survive). Ensure the column contains only these two possible values.

divisor1

A parameter to control the width of the probability of survival range bins. Affects the creation of step sizes for the beginning of each bin range. Defaults to 5.

divisor2

A parameter to control the width of the probability of survival range bins. Affects the creation of step sizes for the beginning of each bin range. Defaults to 5.

threshold_1

A parameter to decide where data indices will begin to create step sizes. Defaults to 0.9.

threshold_2

A parameter to decide where data indices will end to create step sizes. Defaults to 0.99.

Value

A list with intervals and bin_stats objects:

  • intervals: A vector of start and end-points for the probability of survival bin ranges.

  • bin_stats: A tibble with columns bin_number, bin_start, bin_end, mean, sd, alive, dead, count, and percent.

Author

Nicolas Foss, Ed.D, MS, original paper and code in MATLAB by Nicholas J. Napoli, Ph.D., MS

Examples

# Generate example data with high negative skewness
set.seed(123)

# Parameters
n_patients <- 10000  # Total number of patients

# Skewed towards higher values
Ps <- plogis(rnorm(n_patients, mean = 2, sd = 1.5))

# Simulate survival outcomes based on Ps
survival_outcomes <- rbinom(n_patients,
                            size = 1,
                            prob = Ps
                            )

# Create data frame
data <- data.frame(Ps = Ps, survival = survival_outcomes) |>
dplyr::mutate(death = dplyr::if_else(survival == 1, 0, 1))

# Apply the nonlinear_bins function
results <- nonlinear_bins(data = data,
                          Ps_col = Ps,
                          outcome_col = survival,
                          divisor1 = 5,
                          divisor2 = 5,
                          threshold_1 = 0.9,
                          threshold_2 = 0.99)

# View results
results$intervals
#>  [1] 0.02257717 0.54234698 0.70154257 0.79581165 0.85714527 0.90005763
#>  [7] 0.92518915 0.94603830 0.96266743 0.97623957 0.99957866
results$bin_stats
#> # A tibble: 10 × 13
#>    bin_number bin_start bin_end  mean      sd Pred_Survivors_b Pred_Deaths_b
#>         <int>     <dbl>   <dbl> <dbl>   <dbl>            <dbl>         <dbl>
#>  1          1    0.0226   0.542 0.378 0.122               419.         692. 
#>  2          2    0.542    0.702 0.628 0.0458              698.         413. 
#>  3          3    0.702    0.796 0.752 0.0271              836.         275. 
#>  4          4    0.796    0.857 0.829 0.0173              921.         190. 
#>  5          5    0.857    0.900 0.879 0.0126              976.         134. 
#>  6          6    0.900    0.925 0.913 0.00723             735.          70.0
#>  7          7    0.925    0.946 0.936 0.00596             753.          51.8
#>  8          8    0.946    0.963 0.954 0.00473             768.          36.7
#>  9          9    0.963    0.976 0.970 0.00406             781.          24.4
#> 10         10    0.976    1.00  0.987 0.00621            1210.          16.2
#> # ℹ 6 more variables: AntiS_b <dbl>, AntiM_b <dbl>, alive <dbl>, dead <dbl>,
#> #   count <dbl>, percent <dbl>