Rename submission modes: benchmark→private, leaderboard→public by msaroufim · Pull Request #413 · gpu-mode/kernelbot

msaroufim · 2026-02-02T01:59:54Z

This renames the user-facing submission modes for clarity:

BENCHMARK → PRIVATE (run benchmarks without affecting leaderboard ranking)
LEADERBOARD → PUBLIC (official submission to the public leaderboard)

Also adds SECRET mode for internal secret validation runs.

Updates Discord commands: /benchmark → /private, /ranked → /public

Review with gpu-mode/popcorn-cli#33 and gpu-mode/reference-kernels#100

This renames the user-facing submission modes for clarity: - BENCHMARK → PRIVATE (run benchmarks without affecting leaderboard ranking) - LEADERBOARD → PUBLIC (official submission to the public leaderboard) Also adds SECRET mode for internal secret validation runs. Updates Discord commands: /benchmark → /private, /ranked → /public

Copilot

Pull request overview

This PR renames user-facing submission modes from BENCHMARK/LEADERBOARD to PRIVATE/PUBLIC, introduces a new SECRET mode for internal validation, and propagates these changes through the backend, launch pipeline, reports, Discord commands, and tests.

Changes:

Update SubmissionMode enum and evaluation pipeline to use TEST, PRIVATE, PROFILE, PUBLIC, and SECRET, and adjust timeouts and run orchestration accordingly.
Adapt reporting, backend scoring/DB persistence, GitHub/Modal launchers, API validation, and Discord cogs to the new mode names and semantics (PRIVATE = non‑ranked benchmarks, PUBLIC/SECRET = ranked).
Update tests to use the new mode names and updated report strings, plus new run keys like "private" and "public" instead of "benchmark" and "leaderboard".

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_task.py	Swaps `SubmissionMode.BENCHMARK` with `SubmissionMode.PRIVATE` in task-config tests to reflect new non-ranked benchmark mode.
tests/test_modal.py	Updates Modal launcher tests to use `SubmissionMode.PRIVATE`/`PUBLIC` and to expect `"private"` run keys and “ranked submission” terminology.
tests/test_github.py	Adjusts GitHub launcher tests to use `SubmissionMode.PUBLIC` and to assert against `"private"` run keys in results.
tests/test_backend.py	Switches backend tests to PUBLIC/SECRET modes and new short-report wording; still expects `"leaderboard"` in configs/DB runs, which no longer matches the implementation.
src/libkernelbot/submission.py	Extends `compute_score` to accept a `mode_key` so scores can be computed from `"public"` or `"secret"` runs instead of hard-coded `"leaderboard"`.
src/libkernelbot/run_eval.py	Reworks evaluation modes so `PRIVATE` replaces bare `benchmark` and `PUBLIC`/`SECRET` replace `leaderboard`, updating how tests/benchmarks/ranked runs and timeouts are orchestrated.
src/libkernelbot/report.py	Changes reports to treat `"private"` as the benchmark run key and `"public"`/`\"secret\"` as ranked runs, and updates all user-facing messages from “Leaderboard” to “Ranked submission”.
src/libkernelbot/launchers/github.py	Maps GitHub timeouts to the new enum values, using benchmark timeout for PRIVATE and ranked timeout for PUBLIC.
src/libkernelbot/consts.py	Redefines `SubmissionMode` to `TEST`, `PRIVATE`, `PROFILE`, `PUBLIC`, `SECRET` and updates the docstring to document the new semantics.
src/libkernelbot/backend.py	Updates `submit_full`, `submit_leaderboard`, and `handle_submission` to drive PUBLIC and SECRET runs correctly, compute scores from the appropriate run key, and adjust short reports and secrecy behavior; currently also (incorrectly) ranks PRIVATE runs.
src/kernelbot/cogs/verify_run_cog.py	Adjusts verification slash commands to use PRIVATE/PUBLIC and default to PUBLIC when no mode is specified.
src/kernelbot/cogs/leaderboard_cog.py	Renames Discord commands from `/benchmark`→`/private` and `/ranked`→`/public` and wires them to the new submission modes.
src/kernelbot/api/api_utils.py	Restricts API-allowed submission modes to TEST, PRIVATE, PROFILE, and PUBLIC, matching the new public interface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/libkernelbot/backend.py

tests/test_backend.py

Update test data keys and expected values: - test_report.py: Change "benchmark"/"leaderboard" keys to "private"/"public" - test_submission.py: Update compute_score test to use "public" key - test_backend.py: Update mode values and mock data keys

- Add 'secret' key to mock launcher runs so SECRET mode can find its result - Fix second run's expected mode from 'public' to 'secret'

Set GITHUB_BRANCH env var to use the PR's source branch instead of falling back to main. Uses github.head_ref for PRs, github.ref_name for direct pushes.

Use side_effect to return different FullResult for each call: - First call (PUBLIC mode) returns {"public": eval_result} - Second call (SECRET mode) returns {"secret": eval_result} This prevents the backend from storing all keys from both calls.

github-actions · 2026-02-02T02:23:03Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
src/libkernelbot
backend.py
consts.py
report.py					347
submission.py
utils.py
Project Total

_{This report was generated by python-coverage-comment-action}

PRIVATE mode runs return timing info but should not affect leaderboard ranking. This change restricts score computation to only PUBLIC and SECRET modes, ensuring PRIVATE submissions have score=None as intended.

msaroufim · 2026-02-08T07:45:30Z

@claude are you working

msaroufim · 2026-02-08T07:52:00Z

@claude wake up

msaroufim · 2026-02-08T07:52:03Z

@codex wake up

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-08T07:52:21Z

src/libkernelbot/launchers/github.py

-        SubmissionMode.BENCHMARK.value: config.get("benchmark_timeout"),
-        SubmissionMode.LEADERBOARD.value: config.get("ranked_timeout"),
+        SubmissionMode.PRIVATE.value: config.get("benchmark_timeout"),
+        SubmissionMode.PUBLIC.value: config.get("ranked_timeout"),


get_timeout() maps test/private/public but not the new secret mode. When config['mode'] == 'secret', this falls back to DEFAULT_GITHUB_TIMEOUT_MINUTES, which can cause secret validation runs to time out (or use an unintended timeout). Add an explicit mapping for SubmissionMode.SECRET (likely to ranked_timeout, same as PUBLIC).

Suggested change

SubmissionMode.PUBLIC.value: config.get("ranked_timeout"),

SubmissionMode.PUBLIC.value: config.get("ranked_timeout"),

SubmissionMode.SECRET.value: config.get("ranked_timeout"),

Copilot · 2026-02-08T07:52:22Z

src/libkernelbot/consts.py

+    Public: Official submission to the leaderboard. This first runs public tests, then a
+        repeated invocation of a single benchmark. If all tests pass, the submission is evaluated
+        and ranked on the public leaderboard.
+    Secret: Internal mode for running the full evaluation flow with a secret seed. This is used


The SubmissionMode docstring says PUBLIC runs are a “repeated invocation of a single benchmark”, but the evaluation code selects either the last benchmark or all benchmarks depending on ranking_by (e.g., MEAN/GEOM use multiple benchmarks). Update the docstring to reflect that PUBLIC/SECRET may run multiple benchmarks depending on the ranking criterion, to avoid misleading users/operators.

Suggested change

Public: Official submission to the leaderboard. This first runs public tests, then a

repeated invocation of a single benchmark. If all tests pass, the submission is evaluated

and ranked on the public leaderboard.

Secret: Internal mode for running the full evaluation flow with a secret seed. This is used

Public: Official submission to the leaderboard. This first runs public tests, then runs the

benchmark suite according to the selected ranking criterion (for example, using only the

last benchmark or aggregating over all benchmarks). If all tests pass, the submission is

evaluated and ranked on the public leaderboard.

Secret: Internal mode for running the full evaluation flow with a secret seed. This mirrors

the public evaluation (including the benchmark selection/aggregation strategy) and is used

msaroufim · 2026-02-08T08:06:45Z

@claude any feebdback

claude · 2026-02-08T08:06:59Z

Claude encountered an error —— View job