Ranking with LLMs

This repository contains an advanced framework for iterative document ranking with Large Language Models (LLMs), including novel approaches for self-refinement and critic-based feedback mechanisms.

Overview

The project is built on top of the rank_llm library but we have further adapted their codebase. The project implements multiple ranking strategies in ranking/:

Multi-Pass Ranking: Iterative listwise reranking with LLMs
Iterative Self-Refinement: Multiple ranking iterations where the model refines its own output by first generating feedback on the given ranking and then outputting a new ranking based on this feedback.
Critic-Enhanced Ranking: Dual-model approach with a separate critic model providing structured or unstructed feedback.
Ground Truth-Guided Refinement: Critic model enhanced with relevance judgments for testing feedback potential.

Model Support

Built on top of and extending the RankLLM library, supporting:

Models: RankZephyr, RankVicuna, Qwen, Gemma
Inference Backend: vLLM

Configuration-Based Workflow

YAML Configuration: Unified configuration system for all experiments
Config Modes: Pre-configured setups for different ranking strategies:
- ranking/config/reranking/: Single-pass ranking configurations
- ranking/config/self_refinement/: Iterative self-refinement setups
- ranking/config/refinement_with_critic/: Dual-model critic-based configurations
- zephyr/: Zephyr model configurations
Reproducibility: Configuration hashing ensures experiment tracking ** Important**: To get the hash for a config you can use get_config_hash.py. For more information on this function anf examples, refer to README_CONFIG_HASH.md.

Dataset Support

Datasets: Support for Amazon Shopping Queries and a subset of hard queries (called test-data), FutureQueryEval and TREC DL19
Query Difficulty Analysis: Tools for identifying and analyzing challenging queries in data-analysis/.

Infrastructure

Distributed Inference: SageMaker_GPU integration for large-scale experiments
S3 Integration: Automatic data upload/download and result synchronization
Batch Processing: Submit and evaluate multiple experiments efficiently

Quick Start

Configuration

Copy the example environment file and configure your settings:

cp .env.example .env
# Edit .env with your configuration

Environment Variables:

RANKING_S3_BUCKET: Your S3 bucket name (optional, defaults to local storage)
RANKING_S3_BASE_PATH: Base path within S3 bucket (default: "ranking/data")
RANKING_LOCAL_DATA_PATH: Local data directory (default: "./data")

Installation

Pull and run the docker image: Login first aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com If you want to also build the image you need to log into this account, too, to pull the base image: aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com Then build: docker build -t <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/tczin/llm-rankers:ranking-sagemaker_gpu72 -f Dockerfile.ranking.sagemaker_gpu --push .

Then run the container: docker run --gpus all --shm-size 4GB -v path/to/LLMRanker/src/RankingWithLLMs/:/workspace -it <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/tczin/llm-rankers:ranking <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/tczin/llm-rankers:ranking-sagemaker_gpu72 bash Then connect to the container from VSCode. If you are using sagemaker_gpu, make sure that the submit_to_sagemaker_gpu.py script uses the right docker image.

Basic Usage

1. Multi-Pass Reranking

Recreate main results table by launching the jobs for all config files:

python batch_launch_sagemaker_gpu.py --model qwen gemma --dry-run

python ranking/cli.py --config ranking/config/reranking/dl19-default.yml

2. Iterative Self-Refinement

python ranking/cli.py --config ranking/config/self_refinement/dl19-default.yml

3. Critic-Enhanced Ranking

python ranking/cli.py --config ranking/config/refinement_with_critic/dl19_config.yml

4. Ground Truth-Guided Analysis

python ranking/cli.py --config ranking/config/refinement_with_critic/dl19_config_with_groundtruth.yml

Configuration Structure

Example configuration file:

# Model and ranking parameters
reranking:
  enabled: true
  model: "zephyr"  # or "qwen", "gemma", etc.
  batch_size: 4

# Iterative ranking settings
iterative:
  enabled: true
  iterations: 3
  dual_model: true  # Enable critic model
  critic_model: "gemma"
  convergence_threshold: 0.1
  
# Dataset configuration
data:
  dataset: "dl19"
  sample: false  # Use full dataset
  
# Evaluation
evaluation:
  enabled: true
  
# Prompt configuration
prompts:
  template: "llm_prompt_with_structured_feedback"

Architecture

Ranking Pipeline

Input Documents
    ↓
[First Round Ranker] (e.g., RankZephyr) (Always uses rank_zephyr_template prompt)
    ↓
Initial Ranking
    ↓
[Iterative Refinement Loop]
  ├─→ [Critic Model] (optional)
  │     ↓
  │   Feedback
  │     ↓
  ├─→ [Ranker Model] 
  │     ↓
  ├─→  Refined Ranking
    ↓
Final Ranking

Key Components

1. RankRefiner (`ranking/ranking/rank_refiner.py`)

Core iterative ranking engine
Support for single and dual-model modes (the latter when using separate critic model)
Can do ranking history tracking

2. Prompt Management (`ranking/ranking/prompt_manager.py`)

Dynamic prompt loading and template management
Support for different prompts per iteration
Parameterized prompt generation

3. Configuration System (`ranking/config/`)

YAML-based experiment configuration
Configuration validation and hashing
Template inheritance and overrides

4. Data Processing (`ranking/data_processing/`)

Dataset loaders for multiple benchmarks
Preprocessing pipelines
S3 integration for data management

Advanced Features

Structured Critic Model Feedback

For Structured Critic Model Feedback use prompt: llm_prompt_with_structured_feedback and critic_prompt: structured_critic_analysis. Then the critic model provides structured analysis including:

Ranking Quality Assessment: Overall evaluation of current ranking
Specific Issues: Identification of misranked documents
Improvement Suggestions: Concrete recommendations for refinement
Confidence Scores: Uncertainty quantification

Example critic output structure:

{
  "overall_quality": "moderate",
  "specific_issues": [
    "Document [3] ranked too high given limited relevance",
    "Document [7] should rank higher due to comprehensive coverage"
  ],
  "suggestions": [
    "Reorder documents [3] and [7]",
    "Consider moving [5] higher for better topical alignment"
  ]
}

Ground Truth-Guided Learning

When ground truth relevance judgments are available:

Critic receives relevance labels for each document
Provides more accurate feedback for training analysis
Enables evaluation of critic quality

Historical Context

Motivation: Prevent ranking oscillation by:

Tracking last N iterations of rankings
Including historical context in prompts
Detecting convergence patterns

We found that including history harms performance.

Distributed Inference

For large-scale experiments:

# Submit batch of experiments to SageMaker_GPU
python batch_launch_sagemaker_gpu.py --config-dir ranking/config/refinement_with_critic --models qwen gemma

# Monitor and evaluate results
python batch_evaluate_sagemaker_gpu_runs.py --run-ids <id1> <id2> <id3>

This launches all configs in the folder as jobs, creating configs for qwen and gemma on-the-fly. The evaluation script calculates the configs' hashes and retrieves all matching runs.

Query Difficulty Analysis

Tools for analyzing query complexity and difficulty:

Lexical Analysis

Identifies query types based on linguistic features:

Complex queries with multiple constraints
Vague queries with subjective terms
Comparative queries
Queries with negations
Well-formed sentence queries

python data-analysis/pre-retrieval-analysis/query_difficulty_lexical_analysis.py

Statistical Analysis

Uses NLP and statistical features:

Embedding similarity between queries and documents
Entropy and diversity metrics
TF-IDF analysis and KL divergence
SageMaker_GPUing and visualization

python data-analysis/pre-retrieval-analysis/query_difficulty_statistical_analysis.py

LLM-Based Annotation

Annotate queries with difficulty levels using LLMs:

python data-analysis/pre-retrieval-analysis/query_difficulty_llm_analysis.py

You can find the annotations in this s3 bucket.

Evaluation and Analysis

Metrics

NDCG@k: Normalized Discounted Cumulative Gain
Jaccard Distance: To measure similarity between two rankings.
FRBO: Order-sensitive variation of Jaccard (for iteration analysis)

Per-Iteration Analysis

# Analyze ranking changes across iterations
python results-analysis/print_ndcg_per_iteration.py

# Analyze feedback quality impact
python results-analysis/analyze_feedback_quality_impact.py

# Generate formatted tables for publication
python results-analysis/generate_formatted_latex_table.py

Results Structure

See README_OUTPUTS_STRUCTURE.md for detailed information about output organization.

Docker Environment

For data analysis and experimentation:

# Build the Docker image
docker build -f data-analysis-environment/Dockerfile -t data-exploration:latest .

# Start the environment
./data-analysis-environment/start_environment.sh

# Check status
./data-analysis-environment/check_docker_kernel.sh

# Stop when done
./data-analysis-environment/stop_docker_kernel.sh

Documentation

Iterative Ranking Guide - Detailed guide on iterative ranking
Batch Launch Guide - Distributed inference on SageMaker_GPU
Batch Evaluation Guide - Evaluating distributed runs
Outputs Structure - Understanding output organization
RankLLM Library - Core reranking library documentation

Project Structure

RankingWithLLMs/
├── ranking/                    # Main ranking module
│   ├── cli.py                 # Configuration-based CLI
│   ├── ranking/               # Core ranking implementations
│   │   ├── rank_refiner.py   # Iterative ranking engine
│   │   ├── prompt_manager.py # Prompt management
│   │   └── ranking_history_tracker.py
│   ├── config/                # Experiment configurations
│   │   ├── reranking/        # Single-pass configs
│   │   ├── self_refinement/  # Self-refinement configs
│   │   ├── refinement_with_critic/  # Critic configs
│   │   └── zephyr/           # Model-specific configs
│   ├── iterative_prompts/    # Prompt templates
│   ├── data_processing/      # Dataset loaders
│   └── evaluation/           # Evaluation metrics
├── rank_llm/                  # Core reranking library (submodule)
│   └── src/rank_llm/
│       ├── rerank/           # Reranker implementations
│       └── data.py           # Data structures
├── results-analysis/          # Analysis scripts
├── data-analysis/            # Query difficulty analysis
├── batch_launch_sagemaker_gpu.py # Run large-scale experiments
└── batch_evaluate_sagemaker_gpu_runs.py

Acknowledgments

This work is built on the RankLLM library and extends it with novel iterative refinement and critic-based feedback mechanisms.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data-analysis-environment		data-analysis-environment
data-analysis		data-analysis
esci_data		esci_data
ranking		ranking
results-analysis		results-analysis
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Config		Config
Dockerfile.ranking.sagemaker_gpu		Dockerfile.ranking.sagemaker_gpu
Dockerfile.ranking.sagemaker_gpu.qwen32b		Dockerfile.ranking.sagemaker_gpu.qwen32b
LICENSE		LICENSE
NOTICE		NOTICE
RANK_LLM_DIFF_README.md		RANK_LLM_DIFF_README.md
README.md		README.md
README_BATCH_EVALUATION.md		README_BATCH_EVALUATION.md
README_BATCH_LAUNCH.md		README_BATCH_LAUNCH.md
README_CONFIG_HASH.md		README_CONFIG_HASH.md
README_ITERATIVE_RANKING.md		README_ITERATIVE_RANKING.md
README_OUTPUTS_STRUCTURE.md		README_OUTPUTS_STRUCTURE.md
S3_FOLDER_SYNC_README.md		S3_FOLDER_SYNC_README.md
batch_evaluate_sagemaker_gpu_runs.py		batch_evaluate_sagemaker_gpu_runs.py
batch_launch_sagemaker_gpu.py		batch_launch_sagemaker_gpu.py
get_config_hash.py		get_config_hash.py
rank_llm_only_changes.diff		rank_llm_only_changes.diff
requirements_ranking.txt		requirements_ranking.txt
s3_sync.sh		s3_sync.sh
submit_to_sagemaker_gpu.py		submit_to_sagemaker_gpu.py

License

amazon-science/IterativeListwiseReranking

Folders and files

Latest commit

History

Repository files navigation