Skip to content

amazon-science/IterativeListwiseReranking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Ranking with LLMs

This repository contains an advanced framework for iterative document ranking with Large Language Models (LLMs), including novel approaches for self-refinement and critic-based feedback mechanisms.

Overview

The project is built on top of the rank_llm library but we have further adapted their codebase. The project implements multiple ranking strategies in ranking/:

  1. Multi-Pass Ranking: Iterative listwise reranking with LLMs
  2. Iterative Self-Refinement: Multiple ranking iterations where the model refines its own output by first generating feedback on the given ranking and then outputting a new ranking based on this feedback.
  3. Critic-Enhanced Ranking: Dual-model approach with a separate critic model providing structured or unstructed feedback.
  4. Ground Truth-Guided Refinement: Critic model enhanced with relevance judgments for testing feedback potential.

Model Support

Built on top of and extending the RankLLM library, supporting:

  • Models: RankZephyr, RankVicuna, Qwen, Gemma
  • Inference Backend: vLLM

Configuration-Based Workflow

  • YAML Configuration: Unified configuration system for all experiments
  • Config Modes: Pre-configured setups for different ranking strategies:
    • ranking/config/reranking/: Single-pass ranking configurations
    • ranking/config/self_refinement/: Iterative self-refinement setups
    • ranking/config/refinement_with_critic/: Dual-model critic-based configurations
    • zephyr/: Zephyr model configurations
  • Reproducibility: Configuration hashing ensures experiment tracking ** Important**: To get the hash for a config you can use get_config_hash.py. For more information on this function anf examples, refer to README_CONFIG_HASH.md.

Dataset Support

  • Datasets: Support for Amazon Shopping Queries and a subset of hard queries (called test-data), FutureQueryEval and TREC DL19
  • Query Difficulty Analysis: Tools for identifying and analyzing challenging queries in data-analysis/.

Infrastructure

  • Distributed Inference: SageMaker_GPU integration for large-scale experiments
  • S3 Integration: Automatic data upload/download and result synchronization
  • Batch Processing: Submit and evaluate multiple experiments efficiently

Quick Start

Configuration

Copy the example environment file and configure your settings:

cp .env.example .env
# Edit .env with your configuration

Environment Variables:

  • RANKING_S3_BUCKET: Your S3 bucket name (optional, defaults to local storage)
  • RANKING_S3_BASE_PATH: Base path within S3 bucket (default: "ranking/data")
  • RANKING_LOCAL_DATA_PATH: Local data directory (default: "./data")

Installation

Pull and run the docker image: Login first aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com If you want to also build the image you need to log into this account, too, to pull the base image: aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com Then build: docker build -t <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/tczin/llm-rankers:ranking-sagemaker_gpu72 -f Dockerfile.ranking.sagemaker_gpu --push .

Then run the container: docker run --gpus all --shm-size 4GB -v path/to/LLMRanker/src/RankingWithLLMs/:/workspace -it <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/tczin/llm-rankers:ranking <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/tczin/llm-rankers:ranking-sagemaker_gpu72 bash Then connect to the container from VSCode. If you are using sagemaker_gpu, make sure that the submit_to_sagemaker_gpu.py script uses the right docker image.

Basic Usage

1. Multi-Pass Reranking

Recreate main results table by launching the jobs for all config files:

python batch_launch_sagemaker_gpu.py --model qwen gemma --dry-run
python ranking/cli.py --config ranking/config/reranking/dl19-default.yml

2. Iterative Self-Refinement

python ranking/cli.py --config ranking/config/self_refinement/dl19-default.yml

3. Critic-Enhanced Ranking

python ranking/cli.py --config ranking/config/refinement_with_critic/dl19_config.yml

4. Ground Truth-Guided Analysis

python ranking/cli.py --config ranking/config/refinement_with_critic/dl19_config_with_groundtruth.yml

Configuration Structure

Example configuration file:

# Model and ranking parameters
reranking:
  enabled: true
  model: "zephyr"  # or "qwen", "gemma", etc.
  batch_size: 4

# Iterative ranking settings
iterative:
  enabled: true
  iterations: 3
  dual_model: true  # Enable critic model
  critic_model: "gemma"
  convergence_threshold: 0.1
  
# Dataset configuration
data:
  dataset: "dl19"
  sample: false  # Use full dataset
  
# Evaluation
evaluation:
  enabled: true
  
# Prompt configuration
prompts:
  template: "llm_prompt_with_structured_feedback"

Architecture

Ranking Pipeline

Input Documents
    ↓
[First Round Ranker] (e.g., RankZephyr) (Always uses rank_zephyr_template prompt)
    ↓
Initial Ranking
    ↓
[Iterative Refinement Loop]
  ├─→ [Critic Model] (optional)
  │     ↓
  │   Feedback
  │     ↓
  ├─→ [Ranker Model] 
  │     ↓
  ├─→  Refined Ranking
    ↓
Final Ranking

Key Components

1. RankRefiner (ranking/ranking/rank_refiner.py)

  • Core iterative ranking engine
  • Support for single and dual-model modes (the latter when using separate critic model)
  • Can do ranking history tracking

2. Prompt Management (ranking/ranking/prompt_manager.py)

  • Dynamic prompt loading and template management
  • Support for different prompts per iteration
  • Parameterized prompt generation

3. Configuration System (ranking/config/)

  • YAML-based experiment configuration
  • Configuration validation and hashing
  • Template inheritance and overrides

4. Data Processing (ranking/data_processing/)

  • Dataset loaders for multiple benchmarks
  • Preprocessing pipelines
  • S3 integration for data management

Advanced Features

Structured Critic Model Feedback

For Structured Critic Model Feedback use prompt: llm_prompt_with_structured_feedback and critic_prompt: structured_critic_analysis. Then the critic model provides structured analysis including:

  • Ranking Quality Assessment: Overall evaluation of current ranking
  • Specific Issues: Identification of misranked documents
  • Improvement Suggestions: Concrete recommendations for refinement
  • Confidence Scores: Uncertainty quantification

Example critic output structure:

{
  "overall_quality": "moderate",
  "specific_issues": [
    "Document [3] ranked too high given limited relevance",
    "Document [7] should rank higher due to comprehensive coverage"
  ],
  "suggestions": [
    "Reorder documents [3] and [7]",
    "Consider moving [5] higher for better topical alignment"
  ]
}

Ground Truth-Guided Learning

When ground truth relevance judgments are available:

  • Critic receives relevance labels for each document
  • Provides more accurate feedback for training analysis
  • Enables evaluation of critic quality

Historical Context

Motivation: Prevent ranking oscillation by:

  • Tracking last N iterations of rankings
  • Including historical context in prompts
  • Detecting convergence patterns

We found that including history harms performance.

Distributed Inference

For large-scale experiments:

# Submit batch of experiments to SageMaker_GPU
python batch_launch_sagemaker_gpu.py --config-dir ranking/config/refinement_with_critic --models qwen gemma

# Monitor and evaluate results
python batch_evaluate_sagemaker_gpu_runs.py --run-ids <id1> <id2> <id3>

This launches all configs in the folder as jobs, creating configs for qwen and gemma on-the-fly. The evaluation script calculates the configs' hashes and retrieves all matching runs.

Query Difficulty Analysis

Tools for analyzing query complexity and difficulty:

Lexical Analysis

Identifies query types based on linguistic features:

  • Complex queries with multiple constraints
  • Vague queries with subjective terms
  • Comparative queries
  • Queries with negations
  • Well-formed sentence queries
python data-analysis/pre-retrieval-analysis/query_difficulty_lexical_analysis.py

Statistical Analysis

Uses NLP and statistical features:

  • Embedding similarity between queries and documents
  • Entropy and diversity metrics
  • TF-IDF analysis and KL divergence
  • SageMaker_GPUing and visualization
python data-analysis/pre-retrieval-analysis/query_difficulty_statistical_analysis.py

LLM-Based Annotation

Annotate queries with difficulty levels using LLMs:

python data-analysis/pre-retrieval-analysis/query_difficulty_llm_analysis.py

You can find the annotations in this s3 bucket.

Evaluation and Analysis

Metrics

  • NDCG@k: Normalized Discounted Cumulative Gain
  • Jaccard Distance: To measure similarity between two rankings.
  • FRBO: Order-sensitive variation of Jaccard (for iteration analysis)

Per-Iteration Analysis

# Analyze ranking changes across iterations
python results-analysis/print_ndcg_per_iteration.py

# Analyze feedback quality impact
python results-analysis/analyze_feedback_quality_impact.py

# Generate formatted tables for publication
python results-analysis/generate_formatted_latex_table.py

Results Structure

See README_OUTPUTS_STRUCTURE.md for detailed information about output organization.

Docker Environment

For data analysis and experimentation:

# Build the Docker image
docker build -f data-analysis-environment/Dockerfile -t data-exploration:latest .

# Start the environment
./data-analysis-environment/start_environment.sh

# Check status
./data-analysis-environment/check_docker_kernel.sh

# Stop when done
./data-analysis-environment/stop_docker_kernel.sh

Documentation

Project Structure

RankingWithLLMs/
├── ranking/                    # Main ranking module
│   ├── cli.py                 # Configuration-based CLI
│   ├── ranking/               # Core ranking implementations
│   │   ├── rank_refiner.py   # Iterative ranking engine
│   │   ├── prompt_manager.py # Prompt management
│   │   └── ranking_history_tracker.py
│   ├── config/                # Experiment configurations
│   │   ├── reranking/        # Single-pass configs
│   │   ├── self_refinement/  # Self-refinement configs
│   │   ├── refinement_with_critic/  # Critic configs
│   │   └── zephyr/           # Model-specific configs
│   ├── iterative_prompts/    # Prompt templates
│   ├── data_processing/      # Dataset loaders
│   └── evaluation/           # Evaluation metrics
├── rank_llm/                  # Core reranking library (submodule)
│   └── src/rank_llm/
│       ├── rerank/           # Reranker implementations
│       └── data.py           # Data structures
├── results-analysis/          # Analysis scripts
├── data-analysis/            # Query difficulty analysis
├── batch_launch_sagemaker_gpu.py # Run large-scale experiments
└── batch_evaluate_sagemaker_gpu_runs.py

Acknowledgments

This work is built on the RankLLM library and extends it with novel iterative refinement and critic-based feedback mechanisms.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages