hazardous.data.load_seer#

hazardous.data.load_seer(input_path, event_column_name='COD to site recode', duration_column_name='Survival months', events_of_interest=('Breast', 'Diseases of Heart'), censoring_labels=('Alive',), other_event_name='Other', survtrace_preprocessing=False, return_X_y=False)#

Load the seer dataset and optionally apply the same preprocessing as done in SurvTRACE.

The file is expected to be a txt file.

Parameters:
input_pathstr or file_path

The path of the txt file.

events_of_interesttuple of str or “all”, default=(“Breast”, “Diseases of Heart”)

If “all”: all event types are preserved. Other specificy the labels of the event of interest to extract. All other events are collapsed into an “Other” event with a dedicated integer event code.

censoring_labelstyple of str, default=(“Alive”,)

The label(s) used in the COD (cause of death) column that should be interpreted as censoring marker(s) in the original dataset.

other_event_namestr, default=”Other”

Whe other_events is “collapse”, this parameter controls the name of the collapsed competing event.

survtrace_preprocessingbool, default=False

If set to True, apply the preprocessing steps used in SurvTRACE to ensure reproducibility of the results in its paper. Note that to fully replicate the preprocessing used by SurvTRACE, one would also need to recode the “Other” competing event as 0 to treat it as censoring.

Returns:
bunch_dataseta Bunch object with the following attributes:
datapandas.DataFrame of shape (n_samples, n_features)

The dataframe of features.

targetpandas.DataFrame of shape (n_samples, 2)

The two columns are named “event” and “duration”.

  • The “event” columns holds integer identifiers of event of

interest or 0 for censoring. The meaning of event integer codes is defined by the position in the event_labels list. * The “duration” columns holds a numerical value for the event free duration expressed in months.

TODO: document what t0 mean.

event_labelslist of str

The labels of the events.

original_datapandas.DataFrame of shape (n_samples, 29)

The original data.