This is a benchmark to evaluate AI capabilities to do fair data driven decision-making.
The benchmark consists of several tasks.
A fairnessBench task is defined as follows: For a dataset and a very simple training script that uses logistic regression model. How well can an LLM agent improve the training script to achieve high fairness metrics.
Here we describe the steps necessary to run the experiment. Once you are able to run fairnessBench on your own machine, you can start adding your own tasks and your own LLMs to benchmark.
To run this benchmark start by setting up a suitable Python environment to run everything on. Then clone the fairnessBench repo inside the designated folder, install the requirements then install fairnessBench itself
git clone https://github.com/ml4sts/fairnessBench.git
pip install -r fairnessBench/requirements.txt
pip install -e fairnessBench
Warning
Logs and workspace files might be space consuming. Please create a log directory where you have sufficient space.
The path to your log directory will be used as the environment variable LOG_PATH in the following step.
- On your terminal run
export LOG_PATH=<path_to_log_dir> - Pick a task/list of tasks to run from fairnessBench/benchmarks/tasks.json
- Pick LLMs wanted for the benchmark (make sure the required API keys are in the root directory of the app)
- Follow the steps found in the each script to declare the tasks to run and evaluate and LLMs to be used
- Run
bash run_experiment.shto run tasks with an LLM agent - Run
bash baseline.shto run baseline for the same tasks - Run
bash eval.shon the Agents run and on baseline to compare results
Note
Some tasks are time consuming especially when using local LLMs. Make sure you have sufficient compute resources available,
and if on a SLURM environment, please follow the provided template batch_script_template.sh to schedule your job properly.
- Log_dir: A directory name in
LOG_PATHfor the environment to keep the logs - All tasks: list of tasks to be run with the agent
- Models: The large language models that will power the agent
- Available options are:
- Paid: claude-2.1, gpt-4-0125-preview, gpt-4o-mini, gpt-4o, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229
- Local: gemini-pro, llama, qwen, granite
- Local models will be downloaded into your cache if not loaded with
export HF_HOME=<path_to_model>
- Available options are:
- edit_script_model & fast_llm: Are LLMs specifically used to run smaller actions such as editing a script or summarizing a long observation, these can optionally be different from main agent models
The contents of the eval.sh script
- Log_dir: Directory that the llm placed the experiment logs
- json_folder: Directory to place results in
- All tasks: list of tasks to be evaluated on
- Models: Models that we are evaluating on above tasks
baseline.sh script provides a standardized way to run a benchmark task/list of benchmark tasks to produce the baseline results to compare with agent results.
- Log_dir: A directory name in
LOG_PATHfor the environment to keep the logs - All tasks: list of tasks to be evaluated on
We use the following group fairness metrics to capture disparities, assess differences in true positive rates, to quantify misclassification disparities and to examine disparities in false negatives across groups:
Measures whether the prediction and demographic group are independent.
- Disparate Impact
- Statistical Parity Difference
- Error Rate Difference
- Error Rate Ratio
Separation measures if the prediction & demographics are independent conditioned on the ground truth.
- Equal Opportunity Difference
Sufficiency measures if the ground truth is independent of the demographic variables, conditioned on the prediction
- False Omission Rate Difference
We use a variaty of open-source paid LLMs.
- Meta's Llama-3.3-70B (open source)
- Alibaba's Qwen-2.5-72B (open source)
- OpenAI's GPT-4o (proprietary)
- Anthropic's Claude-sonnet 3.7 (proprietary)
The script automates:
- Benchmark execution
- Experiment configuration
- Model/task setup
- Logging and output organization
- Reproducible baseline runs
It is intended for:
- Researchers reproducing paper results
- Users evaluating new models against benchmark baselines
- Developers validating changes to the framework
Run eval.sh with a list of tasks eval.sh runs eval.py which in turn runs the different level of evaluations:
- The task-specific eval.py (to evaluate accuracy and fairness metrics)
- Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use
- From the fairnessbench_analysis directory run explode_results.py (make sure to set the result paths to the folder that eval.sh outputted to) to prepare csv files with all the collected results
- Use the other py scripts in the analysis folder to get all the plots
- Task-specific: environment files for the task, the baseliine train.py and the dataset files.
- Benchmarking infrastructure: code needed to overall run of benchmark and scoring etc (
environment.py,run.py,eval-<type>.py) - Agent: agent tools, agent prompts, etc