6. Privacy Attack Sandbox

This module includes a testing environment where a privacy attack is simulated to assess the resilience of the synthetic data against re-identification and information leakage. For example, in a mobility context an attacker might aim at identifying user trajectories or visited places while in a medical setup they might be interested in knowing whether a certain individual belongs to a patient database. The privacy attack considered is membership inference attack (MIA).

The implementation of such attacks also depends on strong assumptions about the background knowledge available to the attacker. It might know aspects such as a set of demographical data of an individual or certain behavioural patterns.

Within this module, the user can define the prior information available to the attacker regarding the dataset and simulate the outcome of a Membership Inference Attack.

The function adversary_dataset() generates a dataset composed of 50% of rows from the original training dataset, which was used to train the generator used to produce the synthetic dataset under test, and 50% from the validation dataset, which is part of the original dataset but which was not used in training the synthetic data generator. This function merges half of the first dataset and half of the second and then samples a fraction of it. This sample fraction value can be set as one of the arguments of the function.

The adversary dataset produced will contain an additional column named “privacy_test_is_training”, consisting of boolean values indicating whether the record was part of the training set or not. This column can be used as label column for the MIA.

This adversary dataset represents the information that the simulated attacker has at its disposal for performing the membership inference attack.

The function membership_inference_test() performs the membership inference attack itself. It computes the distance to closest record (DCR) between the adversary dataset, generated with the function adversary_dataset() or provided by the user, and the synthetic dataset. This distance is then compared to some thresholds, defined with the quantiles of the distribution of the DCR computed before, and classified accordingly as belonging to the training dataset or not. The adversary guesses are compared to the adversary ground truth labels to derive the accuracy of the attack.

APIs

`adversary_dataset()`

`membership_inference_test()`

Examples

It is possible to specify the path where to save the new information computed in a json file.

# Generates the adversary dataset
adv_data = adversary_dataset(real_data_preprocessed, valid_data_preprocessed)

# The function adversary_dataset adds a column "privacy_test_is_training" to the adversary dataset, 
# indicating whether the record was part of the training set or not
adv_guesses_ground_truth = adv_data["privacy_test_is_training"] 

# Start the MIA simulation
MIA = membership_inference_test(adv_data, 
																synth_data_preprocessed, 
																adv_guesses_ground_truth, 
																path_to_json="path/to/json")