This repository contains an advanced framework for iterative document ranking with Large Language Models (LLMs), including novel approaches for self-refinement and critic-based feedback mechanisms.
The project is built on top of the rank_llm library but we have further adapted their codebase. The project implements multiple ranking strategies in ranking/:
- Multi-Pass Ranking: Iterative listwise reranking with LLMs
- Iterative Self-Refinement: Multiple ranking iterations where the model refines its own output by first generating feedback on the given ranking and then outputting a new ranking based on this feedback.
- Critic-Enhanced Ranking: Dual-model approach with a separate critic model providing structured or unstructed feedback.
- Ground Truth-Guided Refinement: Critic model enhanced with relevance judgments for testing feedback potential.
Built on top of and extending the RankLLM library, supporting:
- Models: RankZephyr, RankVicuna, Qwen, Gemma
- Inference Backend: vLLM
- YAML Configuration: Unified configuration system for all experiments
- Config Modes: Pre-configured setups for different ranking strategies:
ranking/config/reranking/: Single-pass ranking configurationsranking/config/self_refinement/: Iterative self-refinement setupsranking/config/refinement_with_critic/: Dual-model critic-based configurationszephyr/: Zephyr model configurations
- Reproducibility: Configuration hashing ensures experiment tracking
** Important**: To get the hash for a config you can use
get_config_hash.py. For more information on this function anf examples, refer to README_CONFIG_HASH.md.
- Datasets: Support for Amazon Shopping Queries and a subset of hard queries (called test-data), FutureQueryEval and TREC DL19
- Query Difficulty Analysis: Tools for identifying and analyzing challenging queries in
data-analysis/.
- Distributed Inference: SageMaker_GPU integration for large-scale experiments
- S3 Integration: Automatic data upload/download and result synchronization
- Batch Processing: Submit and evaluate multiple experiments efficiently
Copy the example environment file and configure your settings:
cp .env.example .env
# Edit .env with your configurationEnvironment Variables:
RANKING_S3_BUCKET: Your S3 bucket name (optional, defaults to local storage)RANKING_S3_BASE_PATH: Base path within S3 bucket (default: "ranking/data")RANKING_LOCAL_DATA_PATH: Local data directory (default: "./data")
Pull and run the docker image:
Login first
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com
If you want to also build the image you need to log into this account, too, to pull the base image:
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
Then build: docker build -t <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/tczin/llm-rankers:ranking-sagemaker_gpu72 -f Dockerfile.ranking.sagemaker_gpu --push .
Then run the container:
docker run --gpus all --shm-size 4GB -v path/to/LLMRanker/src/RankingWithLLMs/:/workspace -it <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/tczin/llm-rankers:ranking <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/tczin/llm-rankers:ranking-sagemaker_gpu72 bash
Then connect to the container from VSCode. If you are using sagemaker_gpu, make sure that the submit_to_sagemaker_gpu.py script uses the right docker image.
Recreate main results table by launching the jobs for all config files:
python batch_launch_sagemaker_gpu.py --model qwen gemma --dry-run
python ranking/cli.py --config ranking/config/reranking/dl19-default.ymlpython ranking/cli.py --config ranking/config/self_refinement/dl19-default.ymlpython ranking/cli.py --config ranking/config/refinement_with_critic/dl19_config.ymlpython ranking/cli.py --config ranking/config/refinement_with_critic/dl19_config_with_groundtruth.ymlExample configuration file:
# Model and ranking parameters
reranking:
enabled: true
model: "zephyr" # or "qwen", "gemma", etc.
batch_size: 4
# Iterative ranking settings
iterative:
enabled: true
iterations: 3
dual_model: true # Enable critic model
critic_model: "gemma"
convergence_threshold: 0.1
# Dataset configuration
data:
dataset: "dl19"
sample: false # Use full dataset
# Evaluation
evaluation:
enabled: true
# Prompt configuration
prompts:
template: "llm_prompt_with_structured_feedback"Input Documents
↓
[First Round Ranker] (e.g., RankZephyr) (Always uses rank_zephyr_template prompt)
↓
Initial Ranking
↓
[Iterative Refinement Loop]
├─→ [Critic Model] (optional)
│ ↓
│ Feedback
│ ↓
├─→ [Ranker Model]
│ ↓
├─→ Refined Ranking
↓
Final Ranking
- Core iterative ranking engine
- Support for single and dual-model modes (the latter when using separate critic model)
- Can do ranking history tracking
- Dynamic prompt loading and template management
- Support for different prompts per iteration
- Parameterized prompt generation
- YAML-based experiment configuration
- Configuration validation and hashing
- Template inheritance and overrides
- Dataset loaders for multiple benchmarks
- Preprocessing pipelines
- S3 integration for data management
For Structured Critic Model Feedback use prompt: llm_prompt_with_structured_feedback and critic_prompt: structured_critic_analysis.
Then the critic model provides structured analysis including:
- Ranking Quality Assessment: Overall evaluation of current ranking
- Specific Issues: Identification of misranked documents
- Improvement Suggestions: Concrete recommendations for refinement
- Confidence Scores: Uncertainty quantification
Example critic output structure:
{
"overall_quality": "moderate",
"specific_issues": [
"Document [3] ranked too high given limited relevance",
"Document [7] should rank higher due to comprehensive coverage"
],
"suggestions": [
"Reorder documents [3] and [7]",
"Consider moving [5] higher for better topical alignment"
]
}When ground truth relevance judgments are available:
- Critic receives relevance labels for each document
- Provides more accurate feedback for training analysis
- Enables evaluation of critic quality
Motivation: Prevent ranking oscillation by:
- Tracking last N iterations of rankings
- Including historical context in prompts
- Detecting convergence patterns
We found that including history harms performance.
For large-scale experiments:
# Submit batch of experiments to SageMaker_GPU
python batch_launch_sagemaker_gpu.py --config-dir ranking/config/refinement_with_critic --models qwen gemma
# Monitor and evaluate results
python batch_evaluate_sagemaker_gpu_runs.py --run-ids <id1> <id2> <id3>This launches all configs in the folder as jobs, creating configs for qwen and gemma on-the-fly. The evaluation script calculates the configs' hashes and retrieves all matching runs.
Tools for analyzing query complexity and difficulty:
Identifies query types based on linguistic features:
- Complex queries with multiple constraints
- Vague queries with subjective terms
- Comparative queries
- Queries with negations
- Well-formed sentence queries
python data-analysis/pre-retrieval-analysis/query_difficulty_lexical_analysis.pyUses NLP and statistical features:
- Embedding similarity between queries and documents
- Entropy and diversity metrics
- TF-IDF analysis and KL divergence
- SageMaker_GPUing and visualization
python data-analysis/pre-retrieval-analysis/query_difficulty_statistical_analysis.pyAnnotate queries with difficulty levels using LLMs:
python data-analysis/pre-retrieval-analysis/query_difficulty_llm_analysis.pyYou can find the annotations in this s3 bucket.
- NDCG@k: Normalized Discounted Cumulative Gain
- Jaccard Distance: To measure similarity between two rankings.
- FRBO: Order-sensitive variation of Jaccard (for iteration analysis)
# Analyze ranking changes across iterations
python results-analysis/print_ndcg_per_iteration.py
# Analyze feedback quality impact
python results-analysis/analyze_feedback_quality_impact.py
# Generate formatted tables for publication
python results-analysis/generate_formatted_latex_table.pySee README_OUTPUTS_STRUCTURE.md for detailed information about output organization.
For data analysis and experimentation:
# Build the Docker image
docker build -f data-analysis-environment/Dockerfile -t data-exploration:latest .
# Start the environment
./data-analysis-environment/start_environment.sh
# Check status
./data-analysis-environment/check_docker_kernel.sh
# Stop when done
./data-analysis-environment/stop_docker_kernel.sh- Iterative Ranking Guide - Detailed guide on iterative ranking
- Batch Launch Guide - Distributed inference on SageMaker_GPU
- Batch Evaluation Guide - Evaluating distributed runs
- Outputs Structure - Understanding output organization
- RankLLM Library - Core reranking library documentation
RankingWithLLMs/
├── ranking/ # Main ranking module
│ ├── cli.py # Configuration-based CLI
│ ├── ranking/ # Core ranking implementations
│ │ ├── rank_refiner.py # Iterative ranking engine
│ │ ├── prompt_manager.py # Prompt management
│ │ └── ranking_history_tracker.py
│ ├── config/ # Experiment configurations
│ │ ├── reranking/ # Single-pass configs
│ │ ├── self_refinement/ # Self-refinement configs
│ │ ├── refinement_with_critic/ # Critic configs
│ │ └── zephyr/ # Model-specific configs
│ ├── iterative_prompts/ # Prompt templates
│ ├── data_processing/ # Dataset loaders
│ └── evaluation/ # Evaluation metrics
├── rank_llm/ # Core reranking library (submodule)
│ └── src/rank_llm/
│ ├── rerank/ # Reranker implementations
│ └── data.py # Data structures
├── results-analysis/ # Analysis scripts
├── data-analysis/ # Query difficulty analysis
├── batch_launch_sagemaker_gpu.py # Run large-scale experiments
└── batch_evaluate_sagemaker_gpu_runs.py
This work is built on the RankLLM library and extends it with novel iterative refinement and critic-based feedback mechanisms.