Problem Description
In #772, we're adding a new metric called EqualizedOddsImprovement that allows us to measure whether the synthetic data exhibits more fairness than the real data. Along with this metric, we'd like to create a tutorial notebook that shows how to use it and what kind of effect it has.
Notebook Description
This notebook can make use of the sdv library in order to create synthetic data. It should go through the following steps:
- Take the
adult dataset from the single-table demo datasets and break it into a test set and a training set. Keep in mind that the test set and training set should have all combinations of the prediction target and sensitive attributes:
- The prediction target column is
income, where a positive result is income='>50K'
- The sensitive attribute for this dataset is the
sex column. That is to say, we do not want the classifier to make the prediction based on the reported sex.
- We should train an SDV synthesizer (eg. TVAESynthesizer) using the training set from step (1).
- Sample synthetic data from the synthesizer. Then run the
EqualizeOddsImprovement metric across the real vs. synthetic data to see what the results are
- Now use conditional sampling to try removing biases. That is to say, sample all 4 combinations of target and sensitive attribute with equal :
- 25% data with
income='>50K' and sex='Female'
- 25% data with
income='<50K' and sex='Male'
- 25% data with
income='>50K' and sex='Female'
- 25% data with
income='<50K' and sex='Male'
- Test the conditionally sampled synthetic data against the real data using the
EqualizedOddsImprovement metric to see if is has improved
Expected behavior
Create a notebook that follows the above steps and explanations for each one.
The notebook can be added to the SDMetrics/resources folder here. (Please remove the existing visualization in that folder, as it is not needed anymore.)
Problem Description
In #772, we're adding a new metric called
EqualizedOddsImprovementthat allows us to measure whether the synthetic data exhibits more fairness than the real data. Along with this metric, we'd like to create a tutorial notebook that shows how to use it and what kind of effect it has.Notebook Description
This notebook can make use of the
sdvlibrary in order to create synthetic data. It should go through the following steps:adultdataset from the single-table demo datasets and break it into a test set and a training set. Keep in mind that the test set and training set should have all combinations of the prediction target and sensitive attributes:income, where a positive result isincome='>50K'sexcolumn. That is to say, we do not want the classifier to make the prediction based on the reported sex.EqualizeOddsImprovementmetric across the real vs. synthetic data to see what the results areincome='>50K'andsex='Female'income='<50K'andsex='Male'income='>50K'andsex='Female'income='<50K'andsex='Male'EqualizedOddsImprovementmetric to see if is has improvedExpected behavior
Create a notebook that follows the above steps and explanations for each one.
The notebook can be added to the
SDMetrics/resourcesfolder here. (Please remove the existing visualization in that folder, as it is not needed anymore.)