4. ML Utility Metrics

ML utility metrics measure how useful the synthetic data is for machine learning tasks, indicating the effectiveness of the synthetic data in training ML models.

The ML utility metrics allows users to specify a learning task (classification or a regression task) and train a pool of models to quantify the information loss corresponding to using a synthetic dataset to train such a model.

The ML utility metrics module exploits the model garden module to evaluate the performance of the selected models on the dataset provided.

Please note that the models are trained using the default number of epochs specified in scikit-learn, which may differ depending on the model.

For a more thorough comparison, select the desired models from those available in scikit-learn, as demonstrated in the example below. Selecting a narrow pool of models also makes the computation faster.

APIs

`compute_utility_metrics_class()`

`compute_utility_metrics_regr()`

4.1 TSTR - Train on synthetic, test on real

It is possible to use the synthetic dataset to train the models of this module and the real dataset for testing.

This approach allows for the assessment of the synthetic dataset's ML utility by directly comparing the performance of models trained on the synthetic data and tested on the real data.

Examples

In the following example:

Another metric is added to the ones computed by default (only one metric can be added to custom_metric).
Only a few classifiers are selected instead of computing the ML models metrics for all the classifier models available.
The predictions of each of the models tested is returned, both for the real dataset and the synthetic one.
It is specified the path where to save the json file with the new information computed in the function.