1. Introduction

The SURE package is an open-source Python library intended to be used for the assessment of the utility and privacy performance of any tabular and time-series synthetic dataset.

The SURE library features multiple Python modules that can be easily imported and seamlessly integrated into any Python script after installing the library.

Modules overview

The SURE library features the following modules:

Preprocessor
Statistical similarity metrics
Model garden
ML utility metrics
Distance metrics
Privacy attack sandbox
Report generator

Preprocessor

The input datasets undergo manipulation by the preprocessor module, tailored to conform to the standard structure utilized across the subsequent processes. The Polars library used in the preprocessor makes this operation significantly faster compared to the use of other data processing libraries.

Utility

The statistical similarity metrics, the ML utility metrics and the model garden modules constitute the data utility evaluation part.

The statistical similarity module and the distance metrics module take as input the pre-processed datasets and carry out the operation to assess the statistical similarity between the datasets and how different the content of the synthetic dataset is from the one of the original dataset. In particular, The real and synthetic input datasets are used in the statistical similarity metrics module to assess how close the two datasets are in terms of statistical properties, such as features mean, variance correlation.

The machine learning utility metrics module executes a classification or regression task on the given dataset with multiple machine learning models, returning the performance metrics of each of the models tested on the given task and dataset.

Privacy

The distance metrics and the privacy attack sandbox make up the synthetic data privacy assessment modules.