Skip to content

Configuration setup

This document provides details about the configuration structures for the medpipe package. See the config-examples folder for examples.

Configuration file structure

Configuration files are required to provide variables and options. They are written using TOML. The main configuration files are for the data (loading, preprocessing, etc.), the models (name, type, hyperparameters, etc.), and the logger (message, log location, etc.). Examples for the configuration files can be found in the config-examples folder.

The configuration files are nested in folders and subfolders with the following structure:

├── data/
│   ├── features/
│   │   └── name_features-v1.toml
│   ├── general/
│   │   └── name_general-v0.toml
│   ├── preprocessing/
│   │   └── name_preprocessing-v0.toml
│   └── splitting/
│       └── name_splitting-v1.toml
├── model/
│   ├── calibrator/
│   │   └── name_calibrator-v1.toml
│   ├── hyperparameters/
│   │   └── name_hyperparameters-v1.toml
│   ├── imbalance/
│   │   └── name_imbalance-v0.toml
│   └── labels/
│       └── name_labels-va.toml
├── data_config.toml
├── log_config.toml
└── model_config.toml

Naming conventions

The top-level configuration files (data_config.toml, log_config.toml, and model_config.toml) do not follow any particular naming rules and can have any name.

The folders (data/ and model/) do not follow any particular naming rules. However, the subfolders must match with the names in tree, which are provided in the top level configuration files.

The subfolder .toml files must be structed as name_subfoler-vX.toml, where X is an integer or a letter. The name variable is provided in the top-level configuration files, the subfolder variable must match the name of the subfolder, and the X must match the version number provided in the top-level configuration. The labels version number is a letter to provide a separation between model parameters and data parameters.

Data configuration

The data_config.toml file contains main parameters and a data table with the following parameters:

Key Type Description
version string Version number for the data configuration.
base_dir string Base directory location.

NOTE: The version number is formatted as vX.Y where X and Y are integers. The version number contains as many numbers as there are subfolders. The numbers are parsed to fetch the correct file version in the subfolders.

Data table

These parameters are used to extract, load, and save the data. The table is named [data_parameters].

Key Type Default Description
dir string N/A Location of data configuration files from base_dir.
subfolders list[string] See note List of subfolders to search for data configuration files.
name string N/A Prefix name of the data configuration files.
extension string '.toml' Extension of the data configuration files.

NOTE: The subfolders must be ['general', 'features'] for the data parameters.

Data subfolders

Features

The features configuration files contain the list of features to extract from the database. The feature list is contained in a TOML table named [features].

Key Type Description
feature_list list[string] List of the features, including prediction labels, to extract from the data.

General

The general configuration files contain main parameters and an I/O table, and a DB table with the following parameters:

Key Type Description
base_dir string Base directory location.

I/O table

These paramaters are used for saving and loading data. The table is named [io_parameters].

Key Type Default Description
dir string N/A Location of the data from base_dir.
name string N/A Prefix name of the data file.
extension string '.csv' Extension of the data file.

DB table

These paramaters are used for opening and extracting data from a database. The table is named [db_parameters].

Key Type Default Description
dir string N/A Location of the database from base_dir.
name string N/A Prefix name of the database file.
table_name string 'main' Name of the table to query.
extension string '.db' Extension of the database file.

Preprocessing

The preprocessing configuration files contain the preprocessing operations and the feature list on which to apply them. The variables are stored in tables with the following parameters:

Preprocessing table

Key Type Description
preprocess bool Flag to apply operations.

NOTE: The v0 preprocessing file only contains the preprocessing table with the preprocess flag set to false.

The following tables can be removed if the preprocessing operations does not need to be performed.

Ordinal encoder table

These paramaters are used for the ordinal encoder preprocessing operation. The table is named [preprocessing.ordinal_encoder].

Key Type Description
feature_list list[string] List of the features encode.

Standarise table

These paramaters are used for the standardise preprocessing operation. The table is named [preprocessing.standardise].

Key Type Description
feature_list list[string] List of the features to standardise.

Bin table

These paramaters are used for the bin preprocessing operation. The table is named [preprocessing.bin].

Key Type Description
feature_list list[string] List of the features to bin.

Splitting

The splitting configuration files contain the variables used for splitting the data for cross validation. The variables are contained in a TOML table named [split_variables].

Key Type Description
temporal_k_fold bool Flag to use group K-fold cross validation.
n_splits int Number of splits to create for the cross validation.
shuffle bool Flag to shuffle the groups.
random_state int Random seed used for repeatability.
group_name string Feature to use to create the groups.

Model configuration

The model_config.toml file contains main parameters, a model table, a data table, an I/O table, and a fig table with the following parameters:

Key Type Default Description
version string N/A Version number for the model configuration.
predictor_type string 'hgb-c' Predictor type to use.
base_dir string N/A Base directory location.

NOTE: The version number is formatted as vA.B.C.D-W.X.Y.Z where W is a letter and A, B, C, D, X, Y, and Z are integers. The A.B.C.D portion represents the data version number and the W.X.Y.Z the model version number. The version number contains as many symbols as there are subfolders. The symbols are parsed to fetch the correct file version in the subfolders.

Model table

These parameters are used to create the models. The table is named [model_parameters].

Key Type Default Description
dir string N/A Location of model configuration files from base_dir.
subfolders list[string] See note List of subfolders to search for model configuration files.
name string N/A Prefix name of the model configuration files.
extension string '.toml' Extension of the model configuration files.

NOTE: The subfolders must be ['labels', 'imbalance', 'hyperparameters', 'calibrator'] for the model parameters.

Data table

These parameters are used to load and preprocess the data. The table is named [data_parameters].

Key Type Default Description
dir string N/A Location of data configuration files from base_dir.
subfolders list[string] See note List of subfolders to search for data configuration files.
name string N/A Prefix name of the data configuration files.
extension string '.toml' Extension of the data configuration files.

NOTE: The subfolders must be ['general', 'features', 'splitting', 'preprocessing'] for the data parameters.

I/O table

These paramaters are used for saving and loading models. The table is named [io_parameters].

Key Type Default Description
dir string N/A Location of models to save or load from base_dir.
name string N/A Prefix name of the I/O configuration files.
extension string '.pkl' Extension of the models to load or save.

Fig table

These paramaters are used for saving figures and results. The table is named [fig_parameters].

Key Type Default Description
dir string N/A Location to save figures from base_dir.
name string N/A Prefix name of the fig configuration files.
extension string '.png' Extension of the figure to save.

Model subfolders

Calibrator

The calibrator configuration files contain the variables used to create the calibration layer. The calibrator model type is stored in the [calibrator] table.

Key Type Description
calibrator_type string Type of calibrator model.

Hyperparameters for the calibrator can be passed by specifying the keys in the [calibrator.hyperparameters] table.

NOTE: The v1 file creates a logistic regression and the v2 a isotonic regression.

Hyperparameters

The hyperparameters configuration files contain the hyperparameters for the predictor model that is used. The variables are contained in the [hyperparameters] table. The keys must match those for the model created.

Imbalance

The imbalance configuration files contain the variables to address class imbalance in the data in a weighting and a sampler table with the following parameters:

Weighting table

These paramaters are used to apply weights for the examples or classes. The table is named [weighting].

Key Type Description
weighting_fn string Weighting function to use.

Sampler table

These paramaters are used to sample the examples. The table is named [sampler].

Key Type Description
sampler_fn string Sampler function to use.
target_ratio float Target ratio between majority and minority examples to achieve.
hard_percent float Percent of examples deemed 'hard' to include. Only used with certain functions.

Labels

The labels configuration files contain the list of prediction labels. The label list is contained in a TOML table named [labels].

Key Type Description
label_list list[string] List of the labels to predict

NOTE: The label list version number is not an integer but a letter.

Logger configuration

The log_config.toml contains parameters for the logger.

Key Type Default Description
version string N/A Version number for the logger configuration.
base_dir string N/A Base directory location.
log_dir string N/A Location to save log files in.
log_message string "Exception raised in " Message used by logger when writing to log file.
print_message string "[ERROR] An exception was raised.\n\nCheck logs " Message used by logger when printing to terminal.