Data Reference
This page documents the data sub-package.
medpipe.data.Preprocessor
Preprocessor class.
This class creates a Preprocessor to prepare data.
Preprocessor
Class that creates a Preprocessor.
Attributes:
| Name | Type | Description |
|---|---|---|
preprocess |
bool
|
Flag to preprocess data or not. |
transform_seq |
dict[str, dict[str, list[str]]]
|
Transformation sequence for the data. |
logger |
logging.Logger or None, default: None
|
Logger object to log prints. If None print to terminal. |
Methods:
| Name | Description |
|---|---|
__init__ |
Init method. |
_clean_data |
Cleans data in preparation for transformation. |
fit_transform |
Fits the operations and transforms the input data. |
fit |
Fits the operations based on input data. |
transform |
Transforms input data based on fitted operations. |
Source code in src/medpipe/data/Preprocessor.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | |
__init__(preprocessor_config, logger=None)
Initialise a Preprocessor class instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preprocessor_config
|
dict[str, dict[str, list[str]]]
|
Configuration parameters for the preprocessor object. |
required |
logger
|
Logger or None
|
Logger object to log prints. If None print to terminal. |
None
|
Returns:
| Type | Description |
|---|---|
None
|
Nothing is returned. |
Source code in src/medpipe/data/Preprocessor.py
fit(X)
Fits the operations based on input data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
pd.Dataframe of shape (n_samples, n_features)
|
Data to clean. |
required |
Returns:
| Type | Description |
|---|---|
None
|
Nothings is returned. |
Source code in src/medpipe/data/Preprocessor.py
fit_transform(X)
Fits the operations and transforms the input data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
pd.Dataframe of shape (n_samples, n_features)
|
Data to clean. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
data |
pd.Dataframe of shape (n_samples, n_features)
|
Transformed data. |
Source code in src/medpipe/data/Preprocessor.py
transform(X)
Transforms input data based on fitted operations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
pd.Dataframe of shape (n_samples, n_features)
|
Data to clean. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
data |
pd.Dataframe of shape (n_samples, n_features)
|
Transformed data. |
Source code in src/medpipe/data/Preprocessor.py
medpipe.data.db
Database functions module.
This module provides functions to open, query, and save data from databases.
Functions: - parquet_to_db: Converts a .parquet file to a .db fil. - extract_data_from_db: Queries a SQL .db to extract data.
extract_data_from_db(db_file, query)
Extracts data from a .db and saves it to a .csv file.
The parquet file
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
db_file
|
str
|
Path to the .db file. |
required |
query
|
str
|
Query to send to extract data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
data |
DataFrame
|
Extracted data from the database. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If db_file or query is not a str. |
FileNotFoundError
|
If db_file does not exist. |
IsADirectoryError
|
If db_file is not a file. |
ValueError
|
If db_file extension is not a .sqlite3 file. |
Source code in src/medpipe/data/db.py
parquet_to_db(parquet_file, db_file, table_name='main')
Converts a .parquet file to a .db file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parquet_file
|
str
|
File path to the .parquet file. |
required |
db_file
|
str
|
File path to the .db file. |
required |
table_name
|
default: 'main'
|
Name of the table to create in the SQL database. |
'main'
|
Returns:
| Type | Description |
|---|---|
None
|
Nothing is returned. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If parquet_file or db_file are not str. |
FileNotFoundError
|
If parquet_file does not exist. |
IsADirectoryError
|
If parquet_file or db_file are not a file. |
ValueError
|
If parquet_file extension is not a .parquet file. |
ValueError
|
If db_file extension is not a .sqlite3 file. |
Source code in src/medpipe/data/db.py
medpipe.data.preprocessing
Preprocessing functions module.
This module provides functions to preprocess data before training.
Functions: - train_test_it: Creates a KFold iterator to split data into test and train sets. - get_validation_idx: Removes some of the indices to create a validation set. - convert_object_to_categorical: Converts object columns to categoricals. - fit_preprocess_operations: Fits processing operations to data. - bin_score: Bins the M3 score into 5 categories. - extract_labels: Extracts prediction labels from data.
bin_score(data)
Bins the M3 score into 5 categories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
M3 score data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
binned_data |
DataFrame
|
Binned data. |
Source code in src/medpipe/data/preprocessing.py
convert_object_to_categorical(data)
Converts all object columns of a DataFrame to categoricals.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame to manipulate. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
processed_data |
DataFrame
|
Processed DataFrame. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If data is not a pd.DataFrame. |
Source code in src/medpipe/data/preprocessing.py
extract_labels(data, labels)
Extracts the prediction labels from the training data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame to manipulate. |
required |
labels
|
list(str)
|
List of labels to extract from the data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X |
DataFrame
|
DataFrame containing the data. |
y |
array - like
|
Array containing the prediction labels. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If data is not a pd.DataFrame. |
TypeError
|
If labels is not list(str). |
KeyError
|
If a prediction label is not a valid key. |
Source code in src/medpipe/data/preprocessing.py
fit_preprocess_operations(data, preprocessing_dict)
Fits processing operations to data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
DataFrame to manipulate. |
required |
preprocessing_dict
|
dict[str, list[str]]
|
Dictionary of the operations and the features on which to operate. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
operation_dict |
dict[]
|
Dictionary of the different preprocessing objects. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If data is not a pd.DataFrame. If features is not a list(str). |
KeyError
|
If a features is not a valid key. |
ValueError
|
If preprocess is not a valid preprocessing function. |
Source code in src/medpipe/data/preprocessing.py
get_validation_idx(idx_list, groups=None, val_size=0.1)
Removes some of the indices to create a validation set.
If groups are provided, all the indices of the group with the largest value are selected as the validation set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx_list
|
array(n_samples)
|
Indices of the set to split. |
required |
groups
|
Series(n_samples) or None
|
Groups to which the train indices belong. Must be numeric. |
None
|
val_size
|
float
|
Size of the validation set if groups are None. |
0.1
|
Returns:
| Name | Type | Description |
|---|---|---|
train_idx |
array
|
Train indices. |
val_idx |
array
|
Validation indices. |
Source code in src/medpipe/data/preprocessing.py
train_test_it(temporal_k_fold=False, **kwargs)
Creates a KFold iterator to split data into test and train sets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
temporal_k_fold
|
bool
|
If True, the data will be split using a group and a GroupKFold iterator is returned. |
False
|
**kwargs
|
Extra arguments for the StratifiedKFold or GroupKFold class. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
kfold_it |
StratifiedKFold or GroupKFold
|
KFold iterator. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If n_splits is less than 2. |
Source code in src/medpipe/data/preprocessing.py
medpipe.data.weighting
Weighting functions module.
This module provides functions to create sample weigths to address class imbalance.
Functions: - inverse_frequency_multiclass_sample_weights: Create sample weights using the total number of samples over the number of positive and negative samples. - inverse_frequency_single_sample_weights: Create sample weights using the inverse frequency of positive and negative samples. - inverse_frequency_class_weights: Create class weights using inverse frequency of classes. - negative_positive_ratio_sample_weights: Create sample weights using the ratio betwee negative and positive classes. - negative_positive_ratio_class_weights: Create class weights using the ratio between negative and positive classes.
inverse_frequency_class_weights(labels)
Create class weights of the positive class using inverse frequency of the positive class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes) |
required |
Returns:
| Name | Type | Description |
|---|---|---|
class_weights |
array(n_classes)
|
Weight for each class. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If labels is empty. |
ZeroDivisionError
|
If there are no positive labels. |
Source code in src/medpipe/data/weighting.py
inverse_frequency_multiclass_sample_weights(labels)
Create sample weights using the total number of samples over the number of positive and negative samples.
Each class has its own set of weights for positive and negative examples based on the number of positive and negative examples in that class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes) |
required |
Returns:
| Name | Type | Description |
|---|---|---|
sample_weights |
array(n_samples, n_classes)
|
Weight for each sample. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If labels is empty. |
ZeroDivisionError
|
If there are no positive labels. |
Notes
For each class, the weights are calculated as: len(labels) / (pos_weight + neg_weight), where pos_weight is an array of shape (n_samples, n_classes) for the positive examples with the total number of positive samples in each class, and neg_weight is similar but for the negative examples.
Source code in src/medpipe/data/weighting.py
inverse_frequency_single_sample_weights(labels)
Create sample weights using the inverse frequency of positive and negative samples.
One set of weights is created and used for each class based on the total number of positive and negative examples. Weights are normalised so that negative weights are 1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes) |
required |
Returns:
| Name | Type | Description |
|---|---|---|
sample_weights |
array(n_samples)
|
Weight for each sample. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If labels is empty. |
ZeroDivisionError
|
If there are no positive labels. |
Source code in src/medpipe/data/weighting.py
negative_positive_ratio_class_weights(labels)
Create class weights of the positive class using the ratio between the number of samples in the negative and positive classes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes) |
required |
Returns:
| Name | Type | Description |
|---|---|---|
class_weights |
array(n_classes)
|
Weight for each class. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If labels is empty. |
ZeroDivisionError
|
If there are no positive labels. |
Source code in src/medpipe/data/weighting.py
negative_positive_ratio_sample_weights(labels)
Create sample weights using the ratio between negative and positive samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes) |
required |
Returns:
| Name | Type | Description |
|---|---|---|
sample_weights |
array(n_samples, n_classes)
|
Weight for each sample. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If labels is empty. |
ZeroDivisionError
|
If there are no positive labels. |
Source code in src/medpipe/data/weighting.py
medpipe.data.sampler
Sampler functions module.
This module provides functions to sample the data to address class imbalance.
Functions: - data_sampler: Samples the data and labels to adjust the class imbalance. - random_undersampler: Randomly select labels to achieve the target ratio between minority and majority classes by undersampling majority class. - group_random_undersampler: Randomly select labels to achieve the target ratio between minority and majority classes in each group. - random_oversampler: Randomly select labels to achieve the target ratio between minority and majority classes by oversampling minority class. - group_random_oversampler: Randomly select labels to achieve the target ratio between minority and majority classes in each group. - mean_dist_sampler: Computes the mean data sample of the majority class and uses the distance to it to select examples. - group_mean_dist_sampler: Computes the mean data sample of the majority class in each group and uses the distance to it to select examples. - smote: Oversample minority class using Synthetic Minority Over-Sampling Technique (SMOTE). - group_smote: Oversample minority class using Synthetic Minority Over-Sampling Technique (SMOTE) in each group.
data_sampler(data, labels, target_ratio=0.25, sampler_fn='random_undersampler', groups=None, **kwargs)
Samples the data and labels to adjust the class imbalance.
The majority class is assumed to have a False or 0 label. The new set will have an imbalance equal to: IR * target_ratio, where IR is the current imbalance ratio.
If the target ratio is too small, the algorithm defaults to obtain a balanced dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Data to sample of shape (n_samples, n_features). |
required |
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes). |
required |
target_ratio
|
float
|
Target ratio between the minority and majority classes. |
0.25
|
sampler_fn
|
str
|
Sampler function to use to sample the data. |
"random_undersampler"
|
groups
|
Series or None
|
List containing groups for the group_sampler function. |
None
|
**kwargs
|
Extra arguments for the sampler functions. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
X |
DataFrame
|
Sampled data. |
y |
array
|
Sampled labels. |
groups |
Series or None
|
Groups of the examples, None if not specified. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If target_ratio is less than 0.0. |
Source code in src/medpipe/data/sampler.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
group_mean_dist_sampler(data, labels, target_ratio, groups, hard_percent=0.5)
Computes the mean data sample of the majority class in each group and uses the distance to it to select examples.
The examples are sorted based on their distance to the mean. The hardest examples are the ones that have the greatest distance to the mean and the easiest are the ones closest to the mean.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Data to sample of shape (n_samples, n_features). |
required |
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes). |
required |
target_ratio
|
float
|
Ratio of minority over majority classes to achieve. |
required |
groups
|
array - like
|
List of groups in which labels belong of shape (n_samples,). |
required |
hard_percent
|
float
|
Percentage of examples that are considered hard, between 0 and 1. If hard_percent is 0.5, half of the examples are chosen from the end of the sorted list and the other half from the beginning. |
0.5
|
Returns:
| Name | Type | Description |
|---|---|---|
sample_idx |
array(n_samples)
|
Index list of examples to achieve target ratio. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If labels and group do not have the same dimension. |
Source code in src/medpipe/data/sampler.py
group_random_oversampler(labels, target_ratio, groups)
Randomly select labels to achieve the target ratio between minority and majority classes in each group.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes). |
required |
target_ratio
|
float
|
Ratio of minority over majority classes to achieve. |
required |
groups
|
array - like
|
List of groups in which labels belong of shape (n_samples,). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
sample_idx |
array(n_samples)
|
Index list of examples to achieve target ratio. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If labels and group do not have the same dimension. If target_ratio is less than 0.0. |
Source code in src/medpipe/data/sampler.py
group_random_undersampler(labels, target_ratio, groups)
Randomly select labels to achieve the target ratio between minority and majority classes in each group.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes). |
required |
target_ratio
|
float
|
Ratio of minority over majority classes to achieve. |
required |
groups
|
array - like
|
List of groups in which labels belong of shape (n_samples,). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
sample_idx |
array(n_samples)
|
Index list of examples to achieve target ratio. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If labels and group do not have the same dimension. If target_ratio is less than 0.0. |
Source code in src/medpipe/data/sampler.py
group_smote(data, labels, target_ratio, groups, k_neighbors)
Oversample minority class using Synthetic Minority Over-Sampling Technique (SMOTE) in each group.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Data to sample of shape (n_samples, n_features). |
required |
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes). |
required |
target_ratio
|
float
|
Ratio of minority over majority classes to achieve. |
required |
groups
|
array - like
|
List of groups in which labels belong of shape (n_samples,). |
required |
k_neighbors
|
int
|
Number of neighbors to use for SMOTE knn. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_gen |
DataFrame
|
Generated data. |
multilabels_gen |
array
|
Generated labels. |
groups_gen |
array - like
|
Generated groups. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If labels and group do not have the same dimension. |
Source code in src/medpipe/data/sampler.py
mean_dist_sampler(data, labels, target_ratio, hard_percent=0.5)
Computes the mean data sample of the majority class and uses the distance to it to select examples.
The examples are sorted based on their distance to the mean. The hardest examples are the ones that have the greatest distance to the mean and the easiest are the ones closest to the mean.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Data to sample of shape (n_samples, n_features). |
required |
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes). |
required |
target_ratio
|
float
|
Ratio of minority over majority classes to achieve. |
required |
hard_percent
|
float
|
Percentage of examples that are considered hard, between 0 and 1. If hard_percent is 0.5, half of the examples are chosen from the end of the sorted list and the other half from the beginning. |
0.5
|
Returns:
| Name | Type | Description |
|---|---|---|
sample_idx |
array(n_samples)
|
Index list of examples to achieve target ratio. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If hard_percent is not between 0 and 1. If target_ratio is less than 0.0. |
Source code in src/medpipe/data/sampler.py
random_oversampler(labels, target_ratio)
Randomly select labels to achieve the target ratio between minority and majority classes by oversampling minority class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes). |
required |
target_ratio
|
float
|
Ratio of minority over majority classes to achieve. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
sample_idx |
array(n_samples)
|
Index list of examples to achieve target ratio. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If target_ratio is less than 0.0. |
Source code in src/medpipe/data/sampler.py
random_undersampler(labels, target_ratio)
Randomly select labels to achieve the target ratio between minority and majority classes by undersampling majority class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes). |
required |
target_ratio
|
float
|
Ratio of minority over majority classes to achieve. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
sample_idx |
array(n_samples)
|
Index list of examples to achieve target ratio. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If target_ratio is less than 0.0. |
Source code in src/medpipe/data/sampler.py
smote(data, labels, target_ratio, k_neighbors)
Oversample minority class using Synthetic Minority Over-Sampling Technique (SMOTE).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Data to sample of shape (n_samples, n_features). |
required |
labels
|
array - like
|
Binary prediction labels of shape (n_samples, n_classes). |
required |
target_ratio
|
float
|
Ratio of minority over majority classes to achieve. |
required |
k_neighbors
|
int
|
Number of neighbors to use for SMOTE knn. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_gen |
DataFrame
|
Generated data. |
multilabels_gen |
array
|
Generated labels. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If labels is not array-like. |
ValueError
|
If target_ratio is less than 0.0. |