Usage#
Pre-requisites#
Data and compute environment#
This code is focused on the computing and data environment from Paris Greater Hospitals. The data format should be OMOP.
Install the package#
Install poetry and python 3.10. Inside the project folder, run the following command:
poetry install
Create the study population#
Three populations are available, corresponding to three predictive tasks: length of stay interpolation (LOS), prognosis of the next diagnosis billing codes (grouped into 21 ICD10 chapters), prognosis of major adverse cardio-vascular events (MACE).
The scripts for building each population are in
medem.populations
.To build the length-of-stay population, run:
poetry run python medem.populations.t1_los_population.py
The configuration for each populations are in
medem.exeperiences.configurations.py
. Most importantly, the user specifies the database name. At loading time, the code will look for a database inside the hive database of the APHP at:hdfs://bbsedsi/apps/hive/warehouse/bigdata/omop_exports_prod/hive/{database_name}.db/
. All parameters that can be specified are:
"database_name": "cse210038_20220921_160214312112",
"cohort_name": "complete_hospitalization_los", # name
"study_start": parse("2017-01-01"), # start of the study period
"study_end": parse("2022-06-01"), # end of the study period. Outside of this range, all data is thrown away
"min_age_at_admission": 18, # minimum age at admission
"sup_quantile_visits": 0.95, # exclude patients having a number of visits above the resulting threshold number of visits per patient
"task_name": TASK_LOS_CATEGORICAL, # define the prognosis task
"los_categories": np.array([0, 7, np.inf]), # define the categories for the task,
"with_incomplete_hospitalization": True, # include also outpatient visits ?
"visits_w_billing_codes_only": True, # keep only visits with billing codes ?
"horizon_in_days": 30, # used for avoiding right-censoring
"event_tables": DEFAULT_EVENT_CONFIG, # what event tables to use as features
"test_size": 0.2,
"n_min_events": 10, # minimum number of events for a given medical code be included. Too rare events are thrown away.
The underlying functions used for all population are
medem.preprocessing.selection.py:select_population
andmedem.preprocessing.selection.py:create_outcome
. Refer to these functions to see every details on the population flowcharts and task definitions.
Run an experiment#
The scripts to benchmark different machine learning pipelines are in
medem.experiences.setups
. For example to launch the benchmark on the :
poetry run python medem.experiments.setups/los_prediction.py
The configurations for the experiments are in
medem.experiments.configurations.py
. The available parameters are:
CONFIG_LOS_ESTIMATION = {
"validation_size": 0.1, # validation size
"subtrain_size": [0.1, 0.5, 1] # size of the succesives effective train sets
"splitting_rs": list(range(5)), # random seeds
"estimator_config": ESTIMATORS_TASK_LOS, # list of estimators to benchmark
"featurizer_config": FEATURIZERS_TASK_LOS, # list of featurizers to benchmark
"randomsearch_scoring": "roc_auc", # scoring function for the random search
"randomsearch_n_iter": 10, # number of random search iterations
"randomsearch_rs": 0, # random seed for the random search
"n_min_events": 10, # minimum number of events for a given medical code be included. Too rare events are thrown away.
"colname_demographics": [
STATIC_FEATURES_DICT,
], # list of static features to include
"local_embeddings_params": {
"colname_concept": COLNAME_SOURCE_CODE,
"window_radius_in_days": 30,
"window_orientation": "center",
"backend": "pandas",
"d": N_COMPONENTS,
}, # parameters passed to the local embeddings pipeline if present in featurizer config
}
To use the slurm cluster on the AP-HP data warehouse, use the dedicated sbatch scripts. You might need to change some path in these scripts that are specific to a given AP-HP user and project:
cd scripts/experiences/
mkdir logs
sbatch los_sbatch.sh
NB: all codes to run experiment with the transformer model are in a fork from the original cehr-bert transformer model.