Released model Performance lower than reported in the paper

I used the released HuggingFace checkpoint agentrl/ReSearch-Qwen-7B-Instruct, but the evaluation results are much lower than what is reported in the paper. My results:
Bamboogle: {'em': 0.16, 'f1': 0.20129523809523808, 'acc': 0.176, 'precision': 0.21292307692307694, 'recall': 0.19866666666666669}

ReSearch Paper Table 2 said that:
Bamboogle (testset) EM: 42.40

Here is my runing code (following the guidance in the README):
```
python run_eval.py \
    --config_path eval_config.yaml \
    --method_name re-call \
    --data_dir /home/a14-hliu/hl542/ReCall/data/ \
    --dataset_name bamboogle \
    --split test \
    --save_dir /home/a14-hliu/hl542/ReCall/eval_results/re-call_qwen3-7b-instruct \
    --save_note re-call_qwen3-7b_ins \
    --sgl_remote_url http://127.0.0.1:8083 \
    --remote_retriever_url http://127.0.0.1:8082 \
    --generator_model /home/a14-hliu/.cache/huggingface/hub/models--agentrl--ReSearch-Qwen-7B-Instruct/snapshots/f0787566dce64b1363746137aca5dd432ac48b9e \
    --sandbox_url http://127.0.0.1:8081
```
I also followed the README instructions for retrieval materials:

[E5-base-v2](https://huggingface.co/intfloat/e5-base-v2) 
[wiki18_100w_e5_index.zip](https://www.modelscope.cn/datasets/hhjinjiajie/FlashRAG_Dataset/file/view/master?id=47985&status=2&fileName=retrieval_corpus%252Fwiki18_100w_e5_index.zip) 
[wiki18_100w.zip](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/blob/main/retrieval-corpus/wiki18_100w.zip) 

Could you please advise what I might be missing? Thank you.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Released model Performance lower than reported in the paper #91

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Released model Performance lower than reported in the paper #91

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions