Fairness Bench

This is a benchmark to evaluate AI capabilities to do fair data driven decision-making.

The benchmark consists of several tasks.

A fairnessBench task is defined as follows: For a dataset and a very simple training script that uses logistic regression model. How well can an LLM agent improve the training script to achieve high fairness metrics.

Instructions for running the benchmark:

Here we describe the steps necessary to run the experiment. Once you are able to run fairnessBench on your own machine, you can start adding your own tasks and your own LLMs to benchmark.

Setup Python Environment

To run this benchmark start by setting up a suitable Python environment to run everything on. Then clone the fairnessBench repo inside the designated folder, install the requirements then install fairnessBench itself

git clone https://github.com/ml4sts/fairnessBench.git
pip install -r fairnessBench/requirements.txt
pip install -e fairnessBench

Warning

Logs and workspace files might be space consuming. Please create a log directory where you have sufficient space. The path to your log directory will be used as the environment variable LOG_PATH in the following step.

What Makes A Single Run

On your terminal run export LOG_PATH=<path_to_log_dir>
Pick a task/list of tasks to run from fairnessBench/benchmarks/tasks.json
Pick LLMs wanted for the benchmark (make sure the required API keys are in the root directory of the app)
Follow the steps found in the each script to declare the tasks to run and evaluate and LLMs to be used
Run bash run_experiment.sh to run tasks with an LLM agent
Run bash baseline.sh to run baseline for the same tasks
Run bash eval.sh on the Agents run and on baseline to compare results

Note

Some tasks are time consuming especially when using local LLMs. Make sure you have sufficient compute resources available, and if on a SLURM environment, please follow the provided template batch_script_template.sh to schedule your job properly.

run_experiment.sh

Log_dir: A directory name in LOG_PATH for the environment to keep the logs
All tasks: list of tasks to be run with the agent
Models: The large language models that will power the agent
- Available options are:
  - Paid: claude-2.1, gpt-4-0125-preview, gpt-4o-mini, gpt-4o, claude-3-7-sonnet-20250219, claude-3-5-haiku-20241022, claude-3-opus-20240229
  - Local: gemini-pro, llama, qwen, granite
- Local models will be downloaded into your cache if not loaded with export HF_HOME=<path_to_model>
edit_script_model & fast_llm: Are LLMs specifically used to run smaller actions such as editing a script or summarizing a long observation, these can optionally be different from main agent models

Eval

The contents of the eval.sh script

Log_dir: Directory that the llm placed the experiment logs
json_folder: Directory to place results in
All tasks: list of tasks to be evaluated on
Models: Models that we are evaluating on above tasks

Baseline

baseline.sh script provides a standardized way to run a benchmark task/list of benchmark tasks to produce the baseline results to compare with agent results.

Log_dir: A directory name in LOG_PATH for the environment to keep the logs
All tasks: list of tasks to be evaluated on

Fairness Metrics

We use the following group fairness metrics to capture disparities, assess differences in true positive rates, to quantify misclassification disparities and to examine disparities in false negatives across groups:

Independence

Measures whether the prediction and demographic group are independent.

Disparate Impact
Statistical Parity Difference
Error Rate Difference
Error Rate Ratio

Separation

Separation measures if the prediction & demographics are independent conditioned on the ground truth.

Equal Opportunity Difference

Sufficiency

Sufficiency measures if the ground truth is independent of the demographic variables, conditioned on the prediction

False Omission Rate Difference

Different LLM models used for agent

We use a variaty of open-source paid LLMs.

Meta's Llama-3.3-70B (open source)
Alibaba's Qwen-2.5-72B (open source)
OpenAI's GPT-4o (proprietary)
Anthropic's Claude-sonnet 3.7 (proprietary)

Purpose

The script automates:

Benchmark execution
Experiment configuration
Model/task setup
Logging and output organization
Reproducible baseline runs

It is intended for:

Researchers reproducing paper results
Users evaluating new models against benchmark baselines
Developers validating changes to the framework

What does eval do?

Run eval.sh with a list of tasks eval.sh runs eval.py which in turn runs the different level of evaluations:

The task-specific eval.py (to evaluate accuracy and fairness metrics)

Flake8 eval that evaluates the training script generated by the agent for Python AST tree and some fairness library use

Reading eval results

From the fairnessbench_analysis directory run explode_results.py (make sure to set the result paths to the folder that eval.sh outputted to) to prepare csv files with all the collected results
Use the other py scripts in the analysis folder to get all the plots

Roles

Task-specific: environment files for the task, the baseliine train.py and the dataset files.
Benchmarking infrastructure: code needed to overall run of benchmark and scoring etc (environment.py, run.py, eval-<type>.py)
Agent: agent tools, agent prompts, etc

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
fairnessBench		fairnessBench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
baseline.sh		baseline.sh
batch_script_template.sh		batch_script_template.sh
eval.sh		eval.sh
folder_structure.md		folder_structure.md
install.sh		install.sh
log_rubric.json		log_rubric.json
multi_run_experiment.sh		multi_run_experiment.sh
requirements.txt		requirements.txt
research_agent_interactive.sh		research_agent_interactive.sh
rubric.json		rubric.json
run_experiments.sh		run_experiments.sh
setup.py		setup.py
system_prompt.txt		system_prompt.txt
system_prompt_log.txt		system_prompt_log.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fairness Bench

Instructions for running the benchmark:

Setup Python Environment

What Makes A Single Run

run_experiment.sh

Eval

Baseline

Fairness Metrics

Independence

Separation

Sufficiency

Different LLM models used for agent

Purpose

What does eval do?

Reading eval results

Roles

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fairness Bench

Instructions for running the benchmark:

Setup Python Environment

What Makes A Single Run

run_experiment.sh

Eval

Baseline

Fairness Metrics

Independence

Separation

Sufficiency

Different LLM models used for agent

Purpose

What does eval do?

Reading eval results

Roles

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages