Skip to content

Support configurability of FIM tokens on SantaCoder #313

@Jay-Roberts

Description

@Jay-Roberts

Currently the SantaCoder and StarCoder FIM tasks have fixed FIM tokens; however, other models may use different fim-tokens. For example to use Qwen2.5-Coder models one might have to create

class SantaCoderQwen25CoderFIM(SantaCoderFIM):
    DATASET_PATH = "bigcode/santacoder-fim-task"

    def __init__(self):
        fim_prefix = "<|fim_prefix|>"
        fim_middle = "<|fim_middle|>"
        fim_suffix = "<|fim_suffix|>"
        stop_words = ["<|endoftext|>", "<|filename|>"]
        super().__init__(
            stop_words=stop_words,
            requires_execution=False,
            fim_prefix=fim_prefix,
            fim_middle=fim_middle,
            fim_suffix=fim_suffix,
        )

Allowing the FIM parameters (e.g. --fim_tokens and --stop_words) to be passed in similarly to the --instruction_tokens for HumanEval would allow this task to be a single class and support future FIM models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions