hazardous
.SurvivalBoost#
Usage examples at the bottom of this page.
- class hazardous.SurvivalBoost(hard_zero_fraction=0.1, n_iter=100, learning_rate=0.05, max_leaf_nodes=31, max_depth=None, min_samples_leaf=50, show_progressbar=True, n_time_grid_steps=100, time_horizon=None, ipcw_strategy='alternating', n_iter_before_feedback=20, random_state=None, n_horizons_per_observation=3)#
Cause-specific Cumulative Incidence Function (CIF) with GBDT [1].
This model estimates the cause-specific Cumulative Incidence Function (CIF) for each event of interest, as well as the survival function for any event, using a Gradient Boosting Decision Tree (GBDT) classifier. The CIF represents the probability of observing an event of a specific type before a given time.
The model handles survival analysis and competing risks data.
The cumulative incidence function (CIF) for each event type \(k\) at each time horizon t is defined as:
\[\hat{F}_k(t; x_i) \approx F_k(t; x_i) = \mathbb{P}(T \leq t, \Delta=k | X=x_i)\]where \(T\) is a random variable for the uncensored time to first event and \(\Delta\) is a random variable over the \([1, K]\) domain for the (uncensored) event type, and \(x_i\) is the feature vector of the \(i\)-th observation.
The (any event) Survival Function can be defined as:
\[S(t; x_i) = \mathbb{P}(T > t | X=x_i) = 1 - \mathbb{P}(T \leq t | X=x_i) = 1 - \sum_{k=1}^K \mathbb{P}(T \leq t, \Delta=k | X=x_i) = 1 - \sum_{k=1}^K F_k(t; x_i)\]Under the hood, this class randomly samples reference time horizons, which are concatenated as an extra input column to train the underlying Histogram-based Gradient Boosting (HGB) classifier. At each boosting iteration, a new tree is trained on a copy of the original feature matrix \(X\), augmented with a new independent sample of time horizons. The number of time horizons sampled at each iteration is controlled by the
n_horizons_per_observation
parameter.To predict the survival function and the CIF, the model uses an alternating optimization approach. The censoring-adjusted incidence estimator is trained for a fixed number of iterations before the feedback loop is triggered. This feedback loop is initiated every
n_iter_before_feedback
iterations and updates the censoring-adjusted incidence estimator with the current model predictions.- Parameters:
- hard_zero_fractionfloat, default=0.1
The fraction of observations that are assigned a time horizon set to exact zeros when doing one epoch of fitting. Increasing this value helps the model learn to predict 0 incidence at
t=0
at the cost of reducing the effective sample size for the non-zero time horizons.- n_iterint, default=100
The number of boosting iterations.
- learning_ratefloat, default=0.05
The learning rate, also known as shrinkage. This is used as a multiplicative factor for the leaves values. Use 1 for no shrinkage.
- max_leaf_nodesint or None, default=31
The maximum number of leaves for each tree. Must be strictly greater than 1. If None, there is no maximum limit.
- max_depthint, default=None
The maximum depth of each tree. The depth of a tree is the number of edges to go from the root to the deepest leaf. Depth isn’t constrained by default.
- min_samples_leafint, default=50
The minimum number of samples per leaf.
- show_progressbarbool, default=True
Whether to show a progress bar during the training process.
- n_time_grid_stepsint, default=100
The number of time horizons to sample uniformly between the minimum and maximum observed event times. Note that the generated grid
time_grid_
can be overridden in the methodpredict_cumulative_incidence
andpredict_survival_function
by setting the parametertimes
.- time_horizonint or float, default=None
The time horizon at which to estimate the probabilities. If
None
, thetime_horizon
should be specified when calling the methodpredict_proba
.- ipcw_strategy{“alternating”, “kaplan-meier”}, default=”alternating”
The method used to estimate the Inverse Probability of Censoring Weighting (IPCW).
If “alternating”, the two instances of gradient boosting are trained alternatively every
n_iter_before_feedback
iterations: one for the CIF + any event survival function and the other for the censoring distribution. This makes it possible to estimate IPCW conditionally on the covariates without assuming independence between censoring and covariates.If “kaplan-meier”, the censoring estimator is trained using the Kaplan-Meier estimator. This estimator is trained only once at the beginning of the training process. This estimator is very fast but assumes that the censoring is independent of the covariates.
- n_iter_before_feedbackint, default=20
The number of iterations at which we alternate to fit the Inverse Probability of Censoring Weighting (IPCW) estimator before feeding back the weights to the incidence estimator.
- random_stateint, RandomState instance or None, default=None
Controls the randomness of the uniform time sampler.
- n_horizons_per_observationint, default=3
The number of time horizons to sample for each individual in the training at each stochastic boosting iteration (epoch).
- Attributes:
- estimator_HistGradientBoostingClassifier
The base estimator used to fit the CIF and survival function.
- classes_ndarray of shape (n_classes,)
The events seen during training.
- event_ids_ndarray of shape (n_classes,)
Numeric representation of classes_.
- time_grid_ndarray of shape (n_time_grid_steps,)
The time horizons used to predict the survival function and the CIF.
- weighted_targets_WeightedMultiClassTargetSampler
The weighted targets used to train the model.
References
[1]J. Alberge, V. Maladière, O. Grisel, J. Abécassis, G. Varoquaux, “Teaching Models To Survive: Proper Scoring Rule and Stochastic Optimization with Competing Risks”, 2024. https://arxiv.org/pdf/2406.14085
Examples
>>> from hazardous.data import make_synthetic_competing_weibull >>> from sklearn.model_selection import train_test_split >>> from hazardous import SurvivalBoost >>> X, y = make_synthetic_competing_weibull(return_X_y=True, random_state=0) >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) >>> survival_booster = SurvivalBoost( ... n_iter=3, show_progressbar=False, random_state=0 ... ).fit(X_train, y_train) >>> survival_pred = survival_booster.predict_survival_function(X_test)
- fit(X, y, times=None)#
Fit the model.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The input samples.
- ydict, {array-like, dataframe} of shape (n_samples, 2)
The target values. If a dictionary, it must have keys “event” and “duration”. If an record array, it must have a dtype with two fields named “event” and “duration”. If a dataframe, it must have columns named “event” and “duration”. “event” is an integer array of shape (n_samples,) indicating which event was observed (0 means that the sample was censored). “duration” is a float array of shape (n_samples,) indicating the time of the first event or the time of censoring.
- timesarray-like of shape (n_times,), default=None
The time horizons used to predict the survival function and the CIF. If None, the default time grid is computed from the observed event times in the training data.
- Returns:
- selfobject
Returns an instance of self.
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- predict_cumulative_incidence(X, times=None)#
Estimate conditional cumulative incidence function for each event type.
Please refer to the docstring of the class for the definitions of the conditional survival function and the event-specific cumulative incidence functions estimated by this method.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The feature vectors for each observation for which to estimate the survival function.
- timesarray-like, default=None
The time horizons at which to estimate the probabilities. If
None
, this method uses the grid generated duringfit
based on the parametern_time_grid_steps
.
- Returns:
- predicted_curvesndarray of shape (n_samples, n_events + 1, n_times)
The estimated probabilities at different time horizons. The values at event index 0 are the estimated probabilities of staying event-free at the requested time horizons for each observation described by the matching row of X. The remaining event indices correspond to the estimated cumulated incidence (or probability) for each event type.
- predict_proba(X, time_horizon=None)#
Estimate the probability of all incidences for a specific time horizon.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The input samples.
- time_horizonint or float, default=None
The time horizon at which to estimate the probabilities. If
None
, thetime_horizon
passed at the constructor is used.
- Returns:
- y_probandarray of shape (n_samples, n_events + 1)
The estimated probabilities at the given time horizon. The column indexed 0 stores the estimated probabilities of staying event-free at the requested time horizon for each observation described by the matching row of X. The remaining columns store the estimated cumulated incidence (or probability) for each event.
- predict_survival_function(X, times=None)#
Estimate the conditional any-event survival function.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The feature vectors for each observation for which to estimate the survival function.
- timesarray-like, default=None
The time horizons at which to estimate the probabilities. If
None
, it uses the grid generated duringfit
based on the parametern_time_grid_steps
.
- Returns:
- predicted_curvesndarray of shape (n_samples, n_times)
The estimated probabilities of staying event-free at different time horizons.
- score(X, y)#
Return the mean of IBS for each event of interest and survival.
This returns the negative of the mean of the Integrated Brier Score (IBS, a proper scoring rule) of each competing event as well as the IBS of the survival to any event. So, the higher the value, the better the model to be consistent with the scoring convention of scikit-learn to make it possible to use this class with scikit-learn model selection utilities such as
GridSearchCV
andRandomizedSearchCV
.- Parameters:
- Xarray-like of shape (n_samples, n_features)
The input samples.
- ydict with keys “event” and “duration”
The target values. “event” is a boolean array of shape (n_samples,) indicating whether the event was observed or not. “duration” is a float array of shape (n_samples,) indicating the time of the event or the time of censoring.
- Returns:
- scorefloat
The negative of time-integrated Brier score (IBS).
- TODO: implement time integrated NLL and use as the default for the
- .score method to match the objective function used at fit time.
- set_fit_request(*, times: bool | None | str = '$UNCHANGED$') SurvivalBoost #
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
- timesstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
times
parameter infit
.
- Returns:
- selfobject
The updated object.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
- set_predict_proba_request(*, time_horizon: bool | None | str = '$UNCHANGED$') SurvivalBoost #
Request metadata passed to the
predict_proba
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict_proba
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict_proba
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
- time_horizonstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
time_horizon
parameter inpredict_proba
.
- Returns:
- selfobject
The updated object.