Skip to content

Pipeline Reference

This page documents the pipeline sub-package.


medpipe.pipeline.Pipeline

Pipeline class.

This class creates a Pipeline to prepare data, fit a predictor and a calibrator.

Pipeline

Class that creates a Pipeline.

Attributes:

Name Type Description
version str

Version number.

label_list list[str]

List of labels to predict.

n_labels int

Number of labels to predict.

predictor_type str

Model type of the predictor.

calibrator_type str

Model type of the calibrator.

preprocessor_config dict[str, attr]

Configuration dictionary for the preprocessor.

predictor_config dict[str, attr]

Configuration dictionary for the predictor. Model type of the predictor.

calibrator_config dict[str, attr]

Configuration dictionary for the calibrator.

preprocessor Preprocessor

Data preprocessor object.

predictor dict[label, Predictor]

Dictionary of Predictors instances for each label.

calibrator dict[label, Calibrator]

Dictionary of Calibrator instances for each label.

predictor_probabilities dict[label, dict[int, array]]

Dictionary of predicted probabilities for each predictor The dictionary keys are the labels and the values are the predicted probabilities of the predictor for that fold.

calibrator_probabilities dict[label, dict[int, array]]

Dictionary of predicted probabilities for each calibrator The dictionary keys are the labels and the values are the predicted probabilities of the calibrator for that fold.

logger logging.Logger or None, default: None

Logger object to log prints. If None print to terminal.

Methods:

Name Description
__init__

Init method.

fit_preprocessor

Fits the preprocessor operations based on input data.

transform

Transforms input data based on preprocessor fitted operations.

fit_transform

Fits the preprocessor operations and transforms the input data.

fit_model

Fits the predictor or calibrator model on the provided dataset.

test_model

Tests the predictor or calibrator model on the provided dataset.

run

Run pipeline with input data.

_train_models

Trains the predictor and calibrator models.

predict_proba

Predicts probabilities from predictor or calibrator based on input data.

predict

Predicts labels from predictor or calibrator based on input data.

_predictor_pred_wrapper

Wrapper function to create predictions with the predictor.

_calibrator_pred_wrapper

Wrapper function to create predictions with the calibrator.

_sample_data

Samples the data based on configuration.

_weight_data

Gets the weights for the data based on configuration.

_get_calibrator_data

Get the calibrator data based on the calibrator type.

Source code in src/medpipe/pipeline/Pipeline.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
class Pipeline:
    """
    Class that creates a Pipeline.

    Attributes
    ----------
    version : str
        Version number.
    label_list : list[str]
        List of labels to predict.
    n_labels : int
        Number of labels to predict.
    predictor_type : str
        Model type of the predictor.
    calibrator_type : str
        Model type of the calibrator.
    preprocessor_config : dict[str, attr]
        Configuration dictionary for the preprocessor.
    predictor_config : dict[str, attr]
        Configuration dictionary for the predictor.
        Model type of the predictor.
    calibrator_config : dict[str, attr]
        Configuration dictionary for the calibrator.
    preprocessor : Preprocessor
        Data preprocessor object.
    predictor : dict[label, Predictor]
        Dictionary of Predictors instances for each label.
    calibrator : dict[label, Calibrator]
        Dictionary of Calibrator instances for each label.
    predictor_probabilities : dict[label, dict[int, array]]
        Dictionary of predicted probabilities for each predictor
        The dictionary keys are the labels and the values are
        the predicted probabilities of the predictor for that fold.
    calibrator_probabilities : dict[label, dict[int, array]]
        Dictionary of predicted probabilities for each calibrator
        The dictionary keys are the labels and the values are
        the predicted probabilities of the calibrator for that fold.
    logger : logging.Logger or None, default: None
        Logger object to log prints. If None print to terminal.

    Methods
    -------
    __init__(pipeline_config={}, logger=None)
        Init method.
    fit_preprocessor(X)
        Fits the preprocessor operations based on input data.
    transform(X)
        Transforms input data based on preprocessor fitted operations.
    fit_transform(X)
        Fits the preprocessor operations and transforms the input data.
    fit_model(X, y, model, **kwargs)
        Fits the predictor or calibrator model on the provided dataset.
    test_model(X, y, model, label_list, key=None)
        Tests the predictor or calibrator model on the provided dataset.
    run(X)
        Run pipeline with input data.
    _train_models(X_train, y_train, X_cal, y_cal, label, **kwargs)
        Trains the predictor and calibrator models.
    predict_proba(X)
        Predicts probabilities from predictor or calibrator based on input data.
    predict(X)
        Predicts labels from predictor or calibrator based on input data.
    _predictor_pred_wrapper(X, label, prediction_type)
        Wrapper function to create predictions with the predictor.
    _calibrator_pred_wrapper(X, label, prediction_type)
        Wrapper function to create predictions with the calibrator.
    _sample_data(X, y, groups)
        Samples the data based on configuration.
    _weight_data(y)
        Gets the weights for the data based on configuration.
    _get_calibrator_data(X, label)
        Get the calibrator data based on the calibrator type.
    """

    def __init__(self, pipeline_config={}, logger=None):
        """
        Initialise a Pipeline class instance.

        Parameters
        ----------
        pipeline_config : dict[str, parameters]
            Configuration parameters for the pipeline object.
        logger : logging.Logger or None, default: None
            Logger object to log prints. If None print to terminal.

        Returns
        -------
        None
            Nothing is returned.

        """
        self.version = pipeline_config["version"]
        self.predictor_type = pipeline_config["predictor_type"]
        self.logger = logger
        self.predictor_probabilities = (
            {}
        )  # Empty dict for predictor predicted probabilities
        self.calibrator_probabilities = (
            {}
        )  # Empty dict for calibrator predicted probabilities

        print_message("Setting up Pipeline", self.logger, SCRIPT_NAME)

        # Get the different configuration dictionaries
        data_version, model_version = split_version_number(pipeline_config["version"])

        # Get predictor configuration parameters
        self.predictor_config = get_configuration(
            pipeline_config["model_parameters"],
            model_version,
        )

        # Get data configuration parameters
        self.preprocessor_config = get_configuration(
            pipeline_config["data_parameters"],
            data_version,
        )

        # Get the calibrator configuration parameters from the predictor config
        self.calibrator_type = self.predictor_config["calibrator"]["calibrator_type"]
        self.calibrator_config = self.predictor_config["calibrator"]

        # Define variables needed to initialise other objects
        self.label_list = self.predictor_config["labels"]["label_list"]
        n_features = len(self.preprocessor_config["features"]["feature_list"]) - len(
            self.label_list
        )

        if self.preprocessor_config["split_variables"]["group_name"]:
            # Remove group name if using GroupKFold
            n_features -= 1
        self.n_labels = len(self.label_list)

        self.predictor = {}
        self.calibrator = {}

        for label in self.label_list:
            self.predictor[label] = Predictor(
                self.predictor_type,
                hyperparameters=self.predictor_config["hyperparameters"],
                logger=self.logger,
            )

            self.predictor_probabilities[label] = {}
            if self.calibrator_type != "":
                # Only if a calibrator type is provided
                self.calibrator[label] = Calibrator(
                    self.calibrator_type,
                    hyperparameters=self.calibrator_config["hyperparameters"],
                    logger=self.logger,
                )
                self.calibrator_probabilities[label] = {}
        self.preprocessor = Preprocessor(
            self.preprocessor_config["preprocessing"], logger=self.logger
        )

    def fit_preprocessor(self, X):
        """
        Fits the preprocessor operations based on input data.

        Parameters
        ----------
        X : pd.Dataframe of shape (n_samples, n_features)
            Data to clean.

        Returns
        -------
        None
            Nothings is returned.

        """
        self.preprocessor.fit(X)

    def transform(self, X):
        """
        Transforms input data based on preprocessor fitted operations.

        Parameters
        ----------
        X : pd.Dataframe of shape (n_samples, n_features)
            Data to clean.

        Returns
        -------
        data : pd.Dataframe of shape (n_samples, n_features)
             Transformed data.

        """
        return self.preprocessor.transform(X)

    def fit_transform(self, X):
        """
        Fits the preprocessor operations and transforms the input data.

        Parameters
        ----------
        X : pd.Dataframe of shape (n_samples, n_features)
            Data to clean.

        Returns
        -------
        data : pd.Dataframe of shape (n_samples, n_features)
             Transformed data.

        """
        return self.preprocessor.fit_transform(X)

    def get_test_data(self, X):
        """
        Returns train and test data based on input data.

        Parameters
        ----------
        X : pd.DataFrame of shape (n_samples, n_features)
            Data to split.

        Returns
        -------
        X_train : pd.DataFrame of shape (n_samples, n_features)
            Train set.
        X_test : pd.DataFrame of shape (n_samples, n_features)
            Test set.

        """
        split_vars = self.preprocessor_config["split_variables"]

        if split_vars["group_name"]:
            train_idx, test_idx = get_validation_idx(
                arange(len(X), dtype=int), X[split_vars["group_name"]]
            )
            X_test = X.iloc[test_idx]
            X_test = X_test.drop(split_vars["group_name"], axis=1)

        else:
            # No groups just get 10 percent of the data
            train_idx, test_idx = get_validation_idx(arange(len(X), dtype=int))
            X_test = X.iloc[test_idx]

        X_train = X.iloc[train_idx]

        return X_train, X_test

    def fit_model(self, X, y, model, label, **kwargs):
        """
        Fits the predictor or calibrator model on the provided dataset.

        Parameters
        ----------
        X : pd.DataFrame of shape (n_samples, n_features)
            Training data.
        y : array-like of shape (n_samples, self.n_labels)
            Prediction labels.
        model : {"predictor", "calibrator"}
            Model to fit.
        label : str
            Label associated with the model to use.
        **kwargs
            Extra arguments for fitting the models.

        Returns
        -------
        None
            Nothing is returned.

        Raises
        ------
        ValueError
            If model is not "predictor" or "calibrator".

        """
        match model:
            case "predictor":
                self.predictor[label].fit(X, y, **kwargs)
            case "calibrator":
                self.calibrator[label].fit(X, y, **kwargs)
            case _:
                raise ValueError(
                    f"Model should be predictor or calibrator, but got {model}"
                )

    def test_model(self, X, y, model, label):
        """
        Tests the predictor or calibrator model on the provided dataset.

        Parameters
        ----------
        X : pd.DataFrame of shape (n_samples, n_features)
            Training data.
        y : array-like of shape (n_samples,)
            Prediction labels.
        model : {"predictor", "calibrator"}
            Model to test.
        label : str
            Label associated with the model to use.

        Returns
        -------
        None
            Nothing is returned.

        Raises
        ------
        ValueError
            If model is not "predictor" or "calibrator".

        """
        match model:
            case "predictor":
                message = "Uncalibrated metrics"

            case "calibrator":
                message = "Calibrated metrics"

            case _:
                raise ValueError(
                    f"Model should be predictor or calibrator, but got {model}"
                )

        metric_dict = test_model(
            y,
            self.predict(X, label_list=label, model_type=model),
            array(self.predict_proba(X, label_list=label, model_type=model)),
        )
        print_message(message, self.logger, SCRIPT_NAME)
        print_metrics(metric_dict, [label], self.logger)

    def run(self, X):
        """
        Run pipeline with input data.

        Parameters
        ----------
        X : pd.DataFrame of shape (n_samples, n_features)
            Training data.

        Returns
        -------
        None
            Nothing is returned.

        """
        if self.preprocessor.operations:
            # If operations are already set then simply transform the data
            data = self.transform(X)
        else:
            # Fit and transform
            data = self.fit_transform(X)

        group_name = self.preprocessor_config["split_variables"]["group_name"]
        weights = None
        X, y = extract_labels(data, self.label_list)  # Get prediction labels from data

        if group_name:
            groups = data[group_name]  # Get the groups for splitting
        else:
            groups = None

        # Create independent calibration set if calibrator is specified
        X_cal = []
        y_cal = []

        if self.calibrator_type != "":
            train_idx, val_idx = get_validation_idx(arange(len(y)), groups)
            X_cal = X.iloc[val_idx]
            y_cal = y[val_idx]
            X = X.iloc[train_idx]
            y = y[train_idx]

            if group_name:
                groups = groups.iloc[train_idx]
                X_cal = X_cal.drop(groups.name, axis=1)  # Remove groups in calibration

        kfold_it = train_test_it(**self.preprocessor_config["split_variables"])
        n_folds = kfold_it.get_n_splits(X, y[:, 0], groups=groups)

        for i, (train_idx, test_idx) in enumerate(
            kfold_it.split(X, y[:, 0], groups=groups)
        ):
            if group_name:
                X_fold = X.drop(groups.name, axis=1)
                fold = int(
                    groups.iloc[test_idx[0]]
                )  # Use the test year as the fold number
                fold_groups = groups.iloc[train_idx]
                fold_message = f"  Fold number {fold} ({i+1}/{n_folds})"
            else:
                X_fold = X
                fold = i
                fold_groups = None
                fold_message = f"  Fold number {fold+1}/{n_folds}"

            # Create the different data sets
            X_train = X_fold.iloc[train_idx]
            y_train = y[train_idx]
            X_test = X_fold.iloc[test_idx]
            y_test = y[test_idx]

            for j, label in enumerate(self.label_list):
                # Sample and weight data if needed
                X_train_i, y_train_i, _ = self._sample_data(
                    X_train, expand_dims(y_train[:, j], 1), fold_groups
                )
                weights = self._weight_data(y_train_i)

                print_message(
                    f"Current metric: {self.label_list[j]}", self.logger, SCRIPT_NAME
                )
                print_message(fold_message, self.logger, SCRIPT_NAME)
                print_message(
                    f"  Train set size: {len(X_train_i)} examples",
                    self.logger,
                    SCRIPT_NAME,
                )
                print_message(
                    f"  Calibration set size: {len(X_cal)} examples",
                    self.logger,
                    SCRIPT_NAME,
                )
                print_message(
                    f"  Test set size: {len(X_test)} examples", self.logger, SCRIPT_NAME
                )

                if self.calibrator_type != "":
                    self._train_models(
                        X_train_i,
                        y_train_i,
                        label,
                        X_cal,
                        y_cal[:, j],
                        **{"weights": weights},
                    )

                    # Test, save probabilities, and reset calibrator
                    self.test_model(X_test, y_test[:, j].squeeze(), "calibrator", label)
                    self.calibrator_probabilities[label][fold] = get_positive_proba(
                        self.predict_proba(X_test, label, model_type="calibrator")
                    )
                    self.calibrator[label]._set_model(quiet=True)

                else:
                    # Train only predictor if no calibrator specified
                    self._train_models(
                        X_train_i, y_train_i, label, **{"weights": weights}
                    )

                # Test predictor on test set
                self.test_model(X_test, y_test[:, j].squeeze(), "predictor", label)

                # Save positive class predicted probabilities
                self.predictor_probabilities[label][fold] = get_positive_proba(
                    self.predict_proba(X_test, label, model_type="predictor")
                )

                # Rest predictor without printing
                self.predictor[label]._set_model(quiet=True)

        # Train final model on complete training set
        print_message("  Final training on all examples", self.logger, SCRIPT_NAME)
        if group_name:
            # Drop group names for final dataset
            X = X.drop(groups.name, axis=1)

        for k, label in enumerate(self.label_list):
            X_train, y_train, _ = self._sample_data(X, expand_dims(y[:, k], 1), groups)
            weights = self._weight_data(y_train)

            print_message(
                f"Current metric: {self.label_list[k]}", self.logger, SCRIPT_NAME
            )
            print_message(
                f"  Train set size: {len(X_train)} examples",
                self.logger,
                SCRIPT_NAME,
            )

            if self.calibrator_type != "":
                self._train_models(
                    X_train, y_train, label, X_cal, y_cal[:, k], **{"weights": weights}
                )
            else:
                self._train_models(X_train, y_train, label, **{"weights": weights})

    def _train_models(self, X_train, y_train, label, X_cal=[], y_cal=[], **kwargs):
        """
        Trains the predictor and calibrator models.

        The calibrator is trained only if X_cal and y_cal are specified.

        Parameters
        ----------
        X_train : pd.DataFrame of shape (n_samples, n_features)
            Train data for the predictor.
        y_train : np.array of shape (n_samples,)
            Train labels for the predictor.
        label: str
            Label associated with the model to train.
        X_cal : pd.DataFrame of shape (n_samples, n_features), default: []
            Calibration data for the calibrator.
        y_cal : np.array of shape (n_samples,), default: []
            Calibration labels for the calibrator.
        **kwargs
            Extra arguments for fitting the predictor.

        Returns
        -------
        None
            Nothing is returned.

        """
        # Fit predictor on train set
        self.fit_model(X_train, y_train, "predictor", label, **kwargs)

        # Fit calibrator on validation set
        if self.calibrator_type != "":
            self.fit_model(
                self._get_calibrator_data(X_cal, label),
                y_cal,
                "calibrator",
                label,
            )

    def predict_proba(self, X, label_list="all", model_type="predictor"):
        """
        Predicts probabilities from predictor or calibrator based on input data.

        Parameters
        ----------
        X : pd.DataFrame of shape (n_samples, n_features)
            Data to make predictions on.
        label_list : str or list[str], default: "all"
            Label or list of labels associated with the model to use.
            If all, all models are used.
        model_type : {"predictor", "calibrator"}, default: "predictor"
            Model to use.

        Returns
        -------
        probabilities : np.array of shape (n_samples, 2)
            Predicted probabilities.

        Raises
        ------
        ValueError
            If model is not "predictor" or "calibrator".
        TypeError
            If label_list is not str or list.

        """
        match model_type:
            case "predictor":
                pred_fn = self._predictor_pred_wrapper
            case "calibrator":
                pred_fn = self._calibrator_pred_wrapper
            case _:
                raise ValueError(
                    f"Model should be predictor or calibrator, but got {model_type}"
                )

        if type(label_list) is type(""):
            if label_list == "all":
                # Convert to list of all labels
                label_list = self.label_list
            else:
                # Single label
                return pred_fn(X, label_list, "predict_proba")

        if type(label_list) is not type([]):
            raise TypeError(
                f"Label list should be str or list, but got {type(label_list)}"
            )

        probabilities = []
        for label in label_list:
            # Loop over all labels to get probabilities for each model
            pred_probas = pred_fn(X, label, "predict_proba")
            if type(pred_probas) is type([]):
                # Account for potential multilabel
                probabilities += pred_probas
            else:
                probabilities.append(pred_probas)
        return probabilities

    def predict(self, X, label_list="all", model_type="predictor"):
        """
        Predicts labels from predictor or calibrator based on input data.

        Parameters
        ----------
        X : pd.DataFrame of shape (n_samples, n_features)
            Data to make predictions on.
        label_list : str or list[str], default: "all"
            Label or list of labels associated with the model to use.
            If all, all models are used.
        model_type : {"predictor", "calibrator"}, default: "predictor"
            Model to use.

        Returns
        -------
        labels : array-like of shape (n_samples,)
            Predicted labels.

        Raises
        ------
        ValueError
            If model_type is not "predictor" or "calibrator".
        TypeError
            If label_list is not str or list.

        """
        match model_type:
            case "predictor":
                pred_fn = self._predictor_pred_wrapper
            case "calibrator":
                pred_fn = self._calibrator_pred_wrapper
            case _:
                raise ValueError(
                    f"Model should be predictor or calibrator, but got {model_type}"
                )

        if type(label_list) is type(""):
            if label_list == "all":
                # Convert to list of all labels
                label_list = self.label_list
            else:
                # Single label
                return pred_fn(X, label_list, "predict")

        if type(label_list) is not type([]):
            raise TypeError(
                f"Label list should be str or list, but got {type(label_list)}"
            )

        labels = []
        for _label in label_list:
            # Loop over all labels to get labels for each model
            pred_labels = pred_fn(X, _label, "predict")
            if type(pred_labels) is type([]):
                # Account for potential multilabel
                labels += pred_labels
            else:
                labels.append(pred_labels)
        return labels

    def _predictor_pred_wrapper(self, X, label, prediction_type):
        """
        Wrapper function to create predictions with the predictor.

        Parameters
        ----------
        X : pd.DataFrame of shape (n_samples, n_features)
            Data to make predictions on.
        label : str
            Label associated with the model to use.
        prediction_type : {"predict", "predict_proba"}
            Prediction function to use.

        Returns
        -------
        estimates : np.array(n_samples,) or np.array(n_samples, 2)
            Labels or probabilities estimated from model based on prediction_type.

        Raises
        ------
        ValueError
            If prediction_type is not "predict" or "predict_proba".


        """
        match prediction_type:
            case "predict":
                return self.predictor[label].predict(X)
            case "predict_proba":
                return self.predictor[label].predict_proba(X)
            case _:
                raise ValueError(
                    "Prediction type should be predict or predict_proba, "
                    f"but got {prediction_type}"
                )

    def _calibrator_pred_wrapper(self, X, label, prediction_type):
        """
        Wrapper function to create predictions with the calibrator.

        Parameters
        ----------
        X : pd.DataFrame of shape (n_samples, n_features)
            Data to make predictions on.
        label : str
            Label associated with the model to use.
        prediction_type : {"predict", "predict_proba"}
            Prediction function to use.

        Returns
        -------
        estimates : np.array(n_samples,) or np.array(n_samples, 2)
            Labels or probabilities estimated from model based on prediction_type.

        Raises
        ------
        ValueError
            If prediction_type is not "predict" or "predict_proba".

        """
        match prediction_type:
            case "predict":
                return self.calibrator[label].predict(
                    self._get_calibrator_data(X, label)
                )
            case "predict_proba":
                return self.calibrator[label].predict_proba(
                    self._get_calibrator_data(X, label)
                )
            case _:
                raise ValueError(
                    "Prediction type should be predict or predict_proba, "
                    f"but got {prediction_type}"
                )

    def _sample_data(self, X, y, groups):
        """
        Samples the data based on configuration.

        Parameters
        ----------
        X : pd.DataFrame of shape (n_samples, n_features)
            Data to sample.
        y : np.array of shape (n_samples,)
            Labels to sample.
        groups : pd.Series of shape (n_samples,) or None
            Groups of the examples, None if not specified.

        Returns
        -------
        X_sampled : pd.DataFrame of shape (n_sampled_samples, n_features)
            Sampled data.
        y_sampled : np.array of shape (n_sampled_samples,)
            Sampled labels.
        groups_sampled : pd.Series of shape (n_sampled_samples,) or None
            Groups of the examples, None if not specified.

        """
        sampler_fn = self.predictor_config["sampler"]["sampler_fn"]

        if sampler_fn:
            return data_sampler(X, y, groups=groups, **self.predictor_config["sampler"])

        return X, y, groups

    def _weight_data(self, y):
        """
        Gets the weights for the data based on configuration.

        Parameters
        ----------
        y : np.array of shape (n_samples, self.n_labels)
            Labels needed for creation of weights.

        Returns
        -------
        weights : np.array of shape (n_samples,), or None
            Sample or class weights based on labels.
            Sample weights are of shape (n_samples,).
            Class weights are of shape (1).
            None if no weighting function is provided.

        """
        weighting_fn = self.predictor_config["weighting"]["weighting_fn"]

        if weighting_fn:
            return getattr(weight, weighting_fn)(y)

        return ones(y.shape[0])

    def _get_calibrator_data(self, X, label):
        """
        Get the calibrator data based on the calibrator type.

        Parameters
        ----------
        X : pd.DataFrame of shape (n_samples, n_features)
            Data to transform for calibrator.
        label : str
            Label associated with the model to use.

        Returns
        -------
        calibrator_data : np.array of shape (n_samples,) or (n_samples, 2)
            Data for calibrator model based on its type.

        """
        calibrator_data = get_positive_proba(
            self.predict_proba(X, label, model_type="predictor")
        )

        if self.calibrator_type == "isotonic" and calibrator_data.shape[1] == 2:
            # Only provide positive probabilities
            calibrator_data = calibrator_data[:, 1]

        return calibrator_data

__init__(pipeline_config={}, logger=None)

Initialise a Pipeline class instance.

Parameters:

Name Type Description Default
pipeline_config dict[str, parameters]

Configuration parameters for the pipeline object.

{}
logger Logger or None

Logger object to log prints. If None print to terminal.

None

Returns:

Type Description
None

Nothing is returned.

Source code in src/medpipe/pipeline/Pipeline.py
def __init__(self, pipeline_config={}, logger=None):
    """
    Initialise a Pipeline class instance.

    Parameters
    ----------
    pipeline_config : dict[str, parameters]
        Configuration parameters for the pipeline object.
    logger : logging.Logger or None, default: None
        Logger object to log prints. If None print to terminal.

    Returns
    -------
    None
        Nothing is returned.

    """
    self.version = pipeline_config["version"]
    self.predictor_type = pipeline_config["predictor_type"]
    self.logger = logger
    self.predictor_probabilities = (
        {}
    )  # Empty dict for predictor predicted probabilities
    self.calibrator_probabilities = (
        {}
    )  # Empty dict for calibrator predicted probabilities

    print_message("Setting up Pipeline", self.logger, SCRIPT_NAME)

    # Get the different configuration dictionaries
    data_version, model_version = split_version_number(pipeline_config["version"])

    # Get predictor configuration parameters
    self.predictor_config = get_configuration(
        pipeline_config["model_parameters"],
        model_version,
    )

    # Get data configuration parameters
    self.preprocessor_config = get_configuration(
        pipeline_config["data_parameters"],
        data_version,
    )

    # Get the calibrator configuration parameters from the predictor config
    self.calibrator_type = self.predictor_config["calibrator"]["calibrator_type"]
    self.calibrator_config = self.predictor_config["calibrator"]

    # Define variables needed to initialise other objects
    self.label_list = self.predictor_config["labels"]["label_list"]
    n_features = len(self.preprocessor_config["features"]["feature_list"]) - len(
        self.label_list
    )

    if self.preprocessor_config["split_variables"]["group_name"]:
        # Remove group name if using GroupKFold
        n_features -= 1
    self.n_labels = len(self.label_list)

    self.predictor = {}
    self.calibrator = {}

    for label in self.label_list:
        self.predictor[label] = Predictor(
            self.predictor_type,
            hyperparameters=self.predictor_config["hyperparameters"],
            logger=self.logger,
        )

        self.predictor_probabilities[label] = {}
        if self.calibrator_type != "":
            # Only if a calibrator type is provided
            self.calibrator[label] = Calibrator(
                self.calibrator_type,
                hyperparameters=self.calibrator_config["hyperparameters"],
                logger=self.logger,
            )
            self.calibrator_probabilities[label] = {}
    self.preprocessor = Preprocessor(
        self.preprocessor_config["preprocessing"], logger=self.logger
    )

fit_model(X, y, model, label, **kwargs)

Fits the predictor or calibrator model on the provided dataset.

Parameters:

Name Type Description Default
X pd.DataFrame of shape (n_samples, n_features)

Training data.

required
y array-like of shape (n_samples, self.n_labels)

Prediction labels.

required
model (predictor, calibrator)

Model to fit.

"predictor"
label str

Label associated with the model to use.

required
**kwargs

Extra arguments for fitting the models.

{}

Returns:

Type Description
None

Nothing is returned.

Raises:

Type Description
ValueError

If model is not "predictor" or "calibrator".

Source code in src/medpipe/pipeline/Pipeline.py
def fit_model(self, X, y, model, label, **kwargs):
    """
    Fits the predictor or calibrator model on the provided dataset.

    Parameters
    ----------
    X : pd.DataFrame of shape (n_samples, n_features)
        Training data.
    y : array-like of shape (n_samples, self.n_labels)
        Prediction labels.
    model : {"predictor", "calibrator"}
        Model to fit.
    label : str
        Label associated with the model to use.
    **kwargs
        Extra arguments for fitting the models.

    Returns
    -------
    None
        Nothing is returned.

    Raises
    ------
    ValueError
        If model is not "predictor" or "calibrator".

    """
    match model:
        case "predictor":
            self.predictor[label].fit(X, y, **kwargs)
        case "calibrator":
            self.calibrator[label].fit(X, y, **kwargs)
        case _:
            raise ValueError(
                f"Model should be predictor or calibrator, but got {model}"
            )

fit_preprocessor(X)

Fits the preprocessor operations based on input data.

Parameters:

Name Type Description Default
X pd.Dataframe of shape (n_samples, n_features)

Data to clean.

required

Returns:

Type Description
None

Nothings is returned.

Source code in src/medpipe/pipeline/Pipeline.py
def fit_preprocessor(self, X):
    """
    Fits the preprocessor operations based on input data.

    Parameters
    ----------
    X : pd.Dataframe of shape (n_samples, n_features)
        Data to clean.

    Returns
    -------
    None
        Nothings is returned.

    """
    self.preprocessor.fit(X)

fit_transform(X)

Fits the preprocessor operations and transforms the input data.

Parameters:

Name Type Description Default
X pd.Dataframe of shape (n_samples, n_features)

Data to clean.

required

Returns:

Name Type Description
data pd.Dataframe of shape (n_samples, n_features)

Transformed data.

Source code in src/medpipe/pipeline/Pipeline.py
def fit_transform(self, X):
    """
    Fits the preprocessor operations and transforms the input data.

    Parameters
    ----------
    X : pd.Dataframe of shape (n_samples, n_features)
        Data to clean.

    Returns
    -------
    data : pd.Dataframe of shape (n_samples, n_features)
         Transformed data.

    """
    return self.preprocessor.fit_transform(X)

get_test_data(X)

Returns train and test data based on input data.

Parameters:

Name Type Description Default
X pd.DataFrame of shape (n_samples, n_features)

Data to split.

required

Returns:

Name Type Description
X_train pd.DataFrame of shape (n_samples, n_features)

Train set.

X_test pd.DataFrame of shape (n_samples, n_features)

Test set.

Source code in src/medpipe/pipeline/Pipeline.py
def get_test_data(self, X):
    """
    Returns train and test data based on input data.

    Parameters
    ----------
    X : pd.DataFrame of shape (n_samples, n_features)
        Data to split.

    Returns
    -------
    X_train : pd.DataFrame of shape (n_samples, n_features)
        Train set.
    X_test : pd.DataFrame of shape (n_samples, n_features)
        Test set.

    """
    split_vars = self.preprocessor_config["split_variables"]

    if split_vars["group_name"]:
        train_idx, test_idx = get_validation_idx(
            arange(len(X), dtype=int), X[split_vars["group_name"]]
        )
        X_test = X.iloc[test_idx]
        X_test = X_test.drop(split_vars["group_name"], axis=1)

    else:
        # No groups just get 10 percent of the data
        train_idx, test_idx = get_validation_idx(arange(len(X), dtype=int))
        X_test = X.iloc[test_idx]

    X_train = X.iloc[train_idx]

    return X_train, X_test

predict(X, label_list='all', model_type='predictor')

Predicts labels from predictor or calibrator based on input data.

Parameters:

Name Type Description Default
X pd.DataFrame of shape (n_samples, n_features)

Data to make predictions on.

required
label_list str or list[str]

Label or list of labels associated with the model to use. If all, all models are used.

"all"
model_type (predictor, calibrator)

Model to use.

"predictor"

Returns:

Name Type Description
labels array-like of shape (n_samples,)

Predicted labels.

Raises:

Type Description
ValueError

If model_type is not "predictor" or "calibrator".

TypeError

If label_list is not str or list.

Source code in src/medpipe/pipeline/Pipeline.py
def predict(self, X, label_list="all", model_type="predictor"):
    """
    Predicts labels from predictor or calibrator based on input data.

    Parameters
    ----------
    X : pd.DataFrame of shape (n_samples, n_features)
        Data to make predictions on.
    label_list : str or list[str], default: "all"
        Label or list of labels associated with the model to use.
        If all, all models are used.
    model_type : {"predictor", "calibrator"}, default: "predictor"
        Model to use.

    Returns
    -------
    labels : array-like of shape (n_samples,)
        Predicted labels.

    Raises
    ------
    ValueError
        If model_type is not "predictor" or "calibrator".
    TypeError
        If label_list is not str or list.

    """
    match model_type:
        case "predictor":
            pred_fn = self._predictor_pred_wrapper
        case "calibrator":
            pred_fn = self._calibrator_pred_wrapper
        case _:
            raise ValueError(
                f"Model should be predictor or calibrator, but got {model_type}"
            )

    if type(label_list) is type(""):
        if label_list == "all":
            # Convert to list of all labels
            label_list = self.label_list
        else:
            # Single label
            return pred_fn(X, label_list, "predict")

    if type(label_list) is not type([]):
        raise TypeError(
            f"Label list should be str or list, but got {type(label_list)}"
        )

    labels = []
    for _label in label_list:
        # Loop over all labels to get labels for each model
        pred_labels = pred_fn(X, _label, "predict")
        if type(pred_labels) is type([]):
            # Account for potential multilabel
            labels += pred_labels
        else:
            labels.append(pred_labels)
    return labels

predict_proba(X, label_list='all', model_type='predictor')

Predicts probabilities from predictor or calibrator based on input data.

Parameters:

Name Type Description Default
X pd.DataFrame of shape (n_samples, n_features)

Data to make predictions on.

required
label_list str or list[str]

Label or list of labels associated with the model to use. If all, all models are used.

"all"
model_type (predictor, calibrator)

Model to use.

"predictor"

Returns:

Name Type Description
probabilities np.array of shape (n_samples, 2)

Predicted probabilities.

Raises:

Type Description
ValueError

If model is not "predictor" or "calibrator".

TypeError

If label_list is not str or list.

Source code in src/medpipe/pipeline/Pipeline.py
def predict_proba(self, X, label_list="all", model_type="predictor"):
    """
    Predicts probabilities from predictor or calibrator based on input data.

    Parameters
    ----------
    X : pd.DataFrame of shape (n_samples, n_features)
        Data to make predictions on.
    label_list : str or list[str], default: "all"
        Label or list of labels associated with the model to use.
        If all, all models are used.
    model_type : {"predictor", "calibrator"}, default: "predictor"
        Model to use.

    Returns
    -------
    probabilities : np.array of shape (n_samples, 2)
        Predicted probabilities.

    Raises
    ------
    ValueError
        If model is not "predictor" or "calibrator".
    TypeError
        If label_list is not str or list.

    """
    match model_type:
        case "predictor":
            pred_fn = self._predictor_pred_wrapper
        case "calibrator":
            pred_fn = self._calibrator_pred_wrapper
        case _:
            raise ValueError(
                f"Model should be predictor or calibrator, but got {model_type}"
            )

    if type(label_list) is type(""):
        if label_list == "all":
            # Convert to list of all labels
            label_list = self.label_list
        else:
            # Single label
            return pred_fn(X, label_list, "predict_proba")

    if type(label_list) is not type([]):
        raise TypeError(
            f"Label list should be str or list, but got {type(label_list)}"
        )

    probabilities = []
    for label in label_list:
        # Loop over all labels to get probabilities for each model
        pred_probas = pred_fn(X, label, "predict_proba")
        if type(pred_probas) is type([]):
            # Account for potential multilabel
            probabilities += pred_probas
        else:
            probabilities.append(pred_probas)
    return probabilities

run(X)

Run pipeline with input data.

Parameters:

Name Type Description Default
X pd.DataFrame of shape (n_samples, n_features)

Training data.

required

Returns:

Type Description
None

Nothing is returned.

Source code in src/medpipe/pipeline/Pipeline.py
def run(self, X):
    """
    Run pipeline with input data.

    Parameters
    ----------
    X : pd.DataFrame of shape (n_samples, n_features)
        Training data.

    Returns
    -------
    None
        Nothing is returned.

    """
    if self.preprocessor.operations:
        # If operations are already set then simply transform the data
        data = self.transform(X)
    else:
        # Fit and transform
        data = self.fit_transform(X)

    group_name = self.preprocessor_config["split_variables"]["group_name"]
    weights = None
    X, y = extract_labels(data, self.label_list)  # Get prediction labels from data

    if group_name:
        groups = data[group_name]  # Get the groups for splitting
    else:
        groups = None

    # Create independent calibration set if calibrator is specified
    X_cal = []
    y_cal = []

    if self.calibrator_type != "":
        train_idx, val_idx = get_validation_idx(arange(len(y)), groups)
        X_cal = X.iloc[val_idx]
        y_cal = y[val_idx]
        X = X.iloc[train_idx]
        y = y[train_idx]

        if group_name:
            groups = groups.iloc[train_idx]
            X_cal = X_cal.drop(groups.name, axis=1)  # Remove groups in calibration

    kfold_it = train_test_it(**self.preprocessor_config["split_variables"])
    n_folds = kfold_it.get_n_splits(X, y[:, 0], groups=groups)

    for i, (train_idx, test_idx) in enumerate(
        kfold_it.split(X, y[:, 0], groups=groups)
    ):
        if group_name:
            X_fold = X.drop(groups.name, axis=1)
            fold = int(
                groups.iloc[test_idx[0]]
            )  # Use the test year as the fold number
            fold_groups = groups.iloc[train_idx]
            fold_message = f"  Fold number {fold} ({i+1}/{n_folds})"
        else:
            X_fold = X
            fold = i
            fold_groups = None
            fold_message = f"  Fold number {fold+1}/{n_folds}"

        # Create the different data sets
        X_train = X_fold.iloc[train_idx]
        y_train = y[train_idx]
        X_test = X_fold.iloc[test_idx]
        y_test = y[test_idx]

        for j, label in enumerate(self.label_list):
            # Sample and weight data if needed
            X_train_i, y_train_i, _ = self._sample_data(
                X_train, expand_dims(y_train[:, j], 1), fold_groups
            )
            weights = self._weight_data(y_train_i)

            print_message(
                f"Current metric: {self.label_list[j]}", self.logger, SCRIPT_NAME
            )
            print_message(fold_message, self.logger, SCRIPT_NAME)
            print_message(
                f"  Train set size: {len(X_train_i)} examples",
                self.logger,
                SCRIPT_NAME,
            )
            print_message(
                f"  Calibration set size: {len(X_cal)} examples",
                self.logger,
                SCRIPT_NAME,
            )
            print_message(
                f"  Test set size: {len(X_test)} examples", self.logger, SCRIPT_NAME
            )

            if self.calibrator_type != "":
                self._train_models(
                    X_train_i,
                    y_train_i,
                    label,
                    X_cal,
                    y_cal[:, j],
                    **{"weights": weights},
                )

                # Test, save probabilities, and reset calibrator
                self.test_model(X_test, y_test[:, j].squeeze(), "calibrator", label)
                self.calibrator_probabilities[label][fold] = get_positive_proba(
                    self.predict_proba(X_test, label, model_type="calibrator")
                )
                self.calibrator[label]._set_model(quiet=True)

            else:
                # Train only predictor if no calibrator specified
                self._train_models(
                    X_train_i, y_train_i, label, **{"weights": weights}
                )

            # Test predictor on test set
            self.test_model(X_test, y_test[:, j].squeeze(), "predictor", label)

            # Save positive class predicted probabilities
            self.predictor_probabilities[label][fold] = get_positive_proba(
                self.predict_proba(X_test, label, model_type="predictor")
            )

            # Rest predictor without printing
            self.predictor[label]._set_model(quiet=True)

    # Train final model on complete training set
    print_message("  Final training on all examples", self.logger, SCRIPT_NAME)
    if group_name:
        # Drop group names for final dataset
        X = X.drop(groups.name, axis=1)

    for k, label in enumerate(self.label_list):
        X_train, y_train, _ = self._sample_data(X, expand_dims(y[:, k], 1), groups)
        weights = self._weight_data(y_train)

        print_message(
            f"Current metric: {self.label_list[k]}", self.logger, SCRIPT_NAME
        )
        print_message(
            f"  Train set size: {len(X_train)} examples",
            self.logger,
            SCRIPT_NAME,
        )

        if self.calibrator_type != "":
            self._train_models(
                X_train, y_train, label, X_cal, y_cal[:, k], **{"weights": weights}
            )
        else:
            self._train_models(X_train, y_train, label, **{"weights": weights})

test_model(X, y, model, label)

Tests the predictor or calibrator model on the provided dataset.

Parameters:

Name Type Description Default
X pd.DataFrame of shape (n_samples, n_features)

Training data.

required
y array-like of shape (n_samples,)

Prediction labels.

required
model (predictor, calibrator)

Model to test.

"predictor"
label str

Label associated with the model to use.

required

Returns:

Type Description
None

Nothing is returned.

Raises:

Type Description
ValueError

If model is not "predictor" or "calibrator".

Source code in src/medpipe/pipeline/Pipeline.py
def test_model(self, X, y, model, label):
    """
    Tests the predictor or calibrator model on the provided dataset.

    Parameters
    ----------
    X : pd.DataFrame of shape (n_samples, n_features)
        Training data.
    y : array-like of shape (n_samples,)
        Prediction labels.
    model : {"predictor", "calibrator"}
        Model to test.
    label : str
        Label associated with the model to use.

    Returns
    -------
    None
        Nothing is returned.

    Raises
    ------
    ValueError
        If model is not "predictor" or "calibrator".

    """
    match model:
        case "predictor":
            message = "Uncalibrated metrics"

        case "calibrator":
            message = "Calibrated metrics"

        case _:
            raise ValueError(
                f"Model should be predictor or calibrator, but got {model}"
            )

    metric_dict = test_model(
        y,
        self.predict(X, label_list=label, model_type=model),
        array(self.predict_proba(X, label_list=label, model_type=model)),
    )
    print_message(message, self.logger, SCRIPT_NAME)
    print_metrics(metric_dict, [label], self.logger)

transform(X)

Transforms input data based on preprocessor fitted operations.

Parameters:

Name Type Description Default
X pd.Dataframe of shape (n_samples, n_features)

Data to clean.

required

Returns:

Name Type Description
data pd.Dataframe of shape (n_samples, n_features)

Transformed data.

Source code in src/medpipe/pipeline/Pipeline.py
def transform(self, X):
    """
    Transforms input data based on preprocessor fitted operations.

    Parameters
    ----------
    X : pd.Dataframe of shape (n_samples, n_features)
        Data to clean.

    Returns
    -------
    data : pd.Dataframe of shape (n_samples, n_features)
         Transformed data.

    """
    return self.preprocessor.transform(X)