hazardous.SurvivalBoost#

Usage examples at the bottom of this page.

class hazardous.SurvivalBoost(hard_zero_fraction=0.1, n_iter=100, learning_rate=0.05, max_leaf_nodes=31, max_depth=None, min_samples_leaf=50, show_progressbar=True, n_time_grid_steps=100, time_horizon=None, ipcw_strategy='alternating', n_iter_before_feedback=20, random_state=None, n_horizons_per_observation=3)#

Cause-specific Cumulative Incidence Function (CIF) with GBDT [1].

This model estimates the cause-specific Cumulative Incidence Function (CIF) for each event of interest, as well as the survival function for any event, using a Gradient Boosting Decision Tree (GBDT) classifier. The CIF represents the probability of observing an event of a specific type before a given time.

The model handles survival analysis and competing risks data.

The cumulative incidence function (CIF) for each event type \(k\) at each time horizon t is defined as:

\[\hat{F}_k(t; x_i) \approx F_k(t; x_i) = \mathbb{P}(T \leq t, \Delta=k | X=x_i)\]

where \(T\) is a random variable for the uncensored time to first event and \(\Delta\) is a random variable over the \([1, K]\) domain for the (uncensored) event type, and \(x_i\) is the feature vector of the \(i\)-th observation.

The (any event) Survival Function can be defined as:

\[S(t; x_i) = \mathbb{P}(T > t | X=x_i) = 1 - \mathbb{P}(T \leq t | X=x_i) = 1 - \sum_{k=1}^K \mathbb{P}(T \leq t, \Delta=k | X=x_i) = 1 - \sum_{k=1}^K F_k(t; x_i)\]

Under the hood, this class randomly samples reference time horizons, which are concatenated as an extra input column to train the underlying Histogram-based Gradient Boosting (HGB) classifier. At each boosting iteration, a new tree is trained on a copy of the original feature matrix \(X\), augmented with a new independent sample of time horizons. The number of time horizons sampled at each iteration is controlled by the n_horizons_per_observation parameter.

To predict the survival function and the CIF, the model uses an alternating optimization approach. The censoring-adjusted incidence estimator is trained for a fixed number of iterations before the feedback loop is triggered. This feedback loop is initiated every n_iter_before_feedback iterations and updates the censoring-adjusted incidence estimator with the current model predictions.

Parameters:
hard_zero_fractionfloat, default=0.1

The fraction of observations that are assigned a time horizon set to exact zeros when doing one epoch of fitting. Increasing this value helps the model learn to predict 0 incidence at t=0 at the cost of reducing the effective sample size for the non-zero time horizons.

n_iterint, default=100

The number of boosting iterations.

learning_ratefloat, default=0.05

The learning rate, also known as shrinkage. This is used as a multiplicative factor for the leaves values. Use 1 for no shrinkage.

max_leaf_nodesint or None, default=31

The maximum number of leaves for each tree. Must be strictly greater than 1. If None, there is no maximum limit.

max_depthint, default=None

The maximum depth of each tree. The depth of a tree is the number of edges to go from the root to the deepest leaf. Depth isn’t constrained by default.

min_samples_leafint, default=50

The minimum number of samples per leaf.

show_progressbarbool, default=True

Whether to show a progress bar during the training process.

n_time_grid_stepsint, default=100

The number of time horizons to sample uniformly between the minimum and maximum observed event times. Note that the generated grid time_grid_ can be overridden in the method predict_cumulative_incidence and predict_survival_function by setting the parameter times.

time_horizonint or float, default=None

The time horizon at which to estimate the probabilities. If None, the time_horizon should be specified when calling the method predict_proba.

ipcw_strategy{“alternating”, “kaplan-meier”}, default=”alternating”

The method used to estimate the Inverse Probability of Censoring Weighting (IPCW).

If “alternating”, the two instances of gradient boosting are trained alternatively every n_iter_before_feedback iterations: one for the CIF + any event survival function and the other for the censoring distribution. This makes it possible to estimate IPCW conditionally on the covariates without assuming independence between censoring and covariates.

If “kaplan-meier”, the censoring estimator is trained using the Kaplan-Meier estimator. This estimator is trained only once at the beginning of the training process. This estimator is very fast but assumes that the censoring is independent of the covariates.

n_iter_before_feedbackint, default=20

The number of iterations at which we alternate to fit the Inverse Probability of Censoring Weighting (IPCW) estimator before feeding back the weights to the incidence estimator.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the uniform time sampler.

n_horizons_per_observationint, default=3

The number of time horizons to sample for each individual in the training at each stochastic boosting iteration (epoch).

Attributes:
estimator_HistGradientBoostingClassifier

The base estimator used to fit the CIF and survival function.

classes_ndarray of shape (n_classes,)

The events seen during training.

event_ids_ndarray of shape (n_classes,)

Numeric representation of classes_.

time_grid_ndarray of shape (n_time_grid_steps,)

The time horizons used to predict the survival function and the CIF.

weighted_targets_WeightedMultiClassTargetSampler

The weighted targets used to train the model.

References

[1]

J. Alberge, V. Maladière, O. Grisel, J. Abécassis, G. Varoquaux, “Teaching Models To Survive: Proper Scoring Rule and Stochastic Optimization with Competing Risks”, 2024. https://arxiv.org/pdf/2406.14085

Examples

>>> from hazardous.data import make_synthetic_competing_weibull
>>> from sklearn.model_selection import train_test_split
>>> from hazardous import SurvivalBoost
>>> X, y = make_synthetic_competing_weibull(return_X_y=True, random_state=0)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> survival_booster = SurvivalBoost(
...     n_iter=3, show_progressbar=False, random_state=0
... ).fit(X_train, y_train)
>>> survival_pred = survival_booster.predict_survival_function(X_test)
fit(X, y, times=None)#

Fit the model.

Parameters:
Xarray-like of shape (n_samples, n_features)

The input samples.

ydict, {array-like, dataframe} of shape (n_samples, 2)

The target values. If a dictionary, it must have keys “event” and “duration”. If an record array, it must have a dtype with two fields named “event” and “duration”. If a dataframe, it must have columns named “event” and “duration”. “event” is an integer array of shape (n_samples,) indicating which event was observed (0 means that the sample was censored). “duration” is a float array of shape (n_samples,) indicating the time of the first event or the time of censoring.

timesarray-like of shape (n_times,), default=None

The time horizons used to predict the survival function and the CIF. If None, the default time grid is computed from the observed event times in the training data.

Returns:
selfobject

Returns an instance of self.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict_cumulative_incidence(X, times=None)#

Estimate conditional cumulative incidence function for each event type.

Please refer to the docstring of the class for the definitions of the conditional survival function and the event-specific cumulative incidence functions estimated by this method.

Parameters:
Xarray-like of shape (n_samples, n_features)

The feature vectors for each observation for which to estimate the survival function.

timesarray-like, default=None

The time horizons at which to estimate the probabilities. If None, this method uses the grid generated during fit based on the parameter n_time_grid_steps.

Returns:
predicted_curvesndarray of shape (n_samples, n_events + 1, n_times)

The estimated probabilities at different time horizons. The values at event index 0 are the estimated probabilities of staying event-free at the requested time horizons for each observation described by the matching row of X. The remaining event indices correspond to the estimated cumulated incidence (or probability) for each event type.

predict_proba(X, time_horizon=None)#

Estimate the probability of all incidences for a specific time horizon.

Parameters:
Xarray-like of shape (n_samples, n_features)

The input samples.

time_horizonint or float, default=None

The time horizon at which to estimate the probabilities. If None, the time_horizon passed at the constructor is used.

Returns:
y_probandarray of shape (n_samples, n_events + 1)

The estimated probabilities at the given time horizon. The column indexed 0 stores the estimated probabilities of staying event-free at the requested time horizon for each observation described by the matching row of X. The remaining columns store the estimated cumulated incidence (or probability) for each event.

predict_survival_function(X, times=None)#

Estimate the conditional any-event survival function.

Parameters:
Xarray-like of shape (n_samples, n_features)

The feature vectors for each observation for which to estimate the survival function.

timesarray-like, default=None

The time horizons at which to estimate the probabilities. If None, it uses the grid generated during fit based on the parameter n_time_grid_steps.

Returns:
predicted_curvesndarray of shape (n_samples, n_times)

The estimated probabilities of staying event-free at different time horizons.

score(X, y)#

Return the mean of IBS for each event of interest and survival.

This returns the negative of the mean of the Integrated Brier Score (IBS, a proper scoring rule) of each competing event as well as the IBS of the survival to any event. So, the higher the value, the better the model to be consistent with the scoring convention of scikit-learn to make it possible to use this class with scikit-learn model selection utilities such as GridSearchCV and RandomizedSearchCV.

Parameters:
Xarray-like of shape (n_samples, n_features)

The input samples.

ydict with keys “event” and “duration”

The target values. “event” is a boolean array of shape (n_samples,) indicating whether the event was observed or not. “duration” is a float array of shape (n_samples,) indicating the time of the event or the time of censoring.

Returns:
scorefloat

The negative of time-integrated Brier score (IBS).

TODO: implement time integrated NLL and use as the default for the
.score method to match the objective function used at fit time.
set_fit_request(*, times: bool | None | str = '$UNCHANGED$') SurvivalBoost#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
timesstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for times parameter in fit.

Returns:
selfobject

The updated object.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

set_predict_proba_request(*, time_horizon: bool | None | str = '$UNCHANGED$') SurvivalBoost#

Request metadata passed to the predict_proba method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict_proba.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
time_horizonstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for time_horizon parameter in predict_proba.

Returns:
selfobject

The updated object.