Skip to content

Released model Performance lower than reported in the paper #91

@Hyfred

Description

@Hyfred

I used the released HuggingFace checkpoint agentrl/ReSearch-Qwen-7B-Instruct, but the evaluation results are much lower than what is reported in the paper. My results:
Bamboogle: {'em': 0.16, 'f1': 0.20129523809523808, 'acc': 0.176, 'precision': 0.21292307692307694, 'recall': 0.19866666666666669}

ReSearch Paper Table 2 said that:
Bamboogle (testset) EM: 42.40

Here is my runing code (following the guidance in the README):

python run_eval.py \
    --config_path eval_config.yaml \
    --method_name re-call \
    --data_dir /home/a14-hliu/hl542/ReCall/data/ \
    --dataset_name bamboogle \
    --split test \
    --save_dir /home/a14-hliu/hl542/ReCall/eval_results/re-call_qwen3-7b-instruct \
    --save_note re-call_qwen3-7b_ins \
    --sgl_remote_url http://127.0.0.1:8083 \
    --remote_retriever_url http://127.0.0.1:8082 \
    --generator_model /home/a14-hliu/.cache/huggingface/hub/models--agentrl--ReSearch-Qwen-7B-Instruct/snapshots/f0787566dce64b1363746137aca5dd432ac48b9e \
    --sandbox_url http://127.0.0.1:8081

I also followed the README instructions for retrieval materials:

E5-base-v2
wiki18_100w_e5_index.zip
wiki18_100w.zip

Could you please advise what I might be missing? Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions