Skip to content

feat(monitor): unify explorer/trainer wandb logging into one run#590

Open
MengsD wants to merge 1 commit into
agentscope-ai:mainfrom
MengsD:feat/unified-wandb-run
Open

feat(monitor): unify explorer/trainer wandb logging into one run#590
MengsD wants to merge 1 commit into
agentscope-ai:mainfrom
MengsD:feat/unified-wandb-run

Conversation

@MengsD

@MengsD MengsD commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

When monitor.monitor_type is wandb and monitor.shared_run is True, explorer and trainer log their metrics to the same run; metrics are labelled by {role}/step.

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Background

Closes #556

Why this design

A wandb run has a single global _step — a per-run counter assigned in arrival order. With two roles writing one shared run, their rows interleave, so each role lands on a gapped _step (e.g. explorer 0,2,4…, trainer 1,3,5…) and can't show a clean 1..N. The only fix is to give each metric its own x-axis via define_metric(key, step_metric="{role}/step").

Three constraints then shaped the implementation:

  • No glob. wandb's define_metric glob doesn't cross / (issue #9549), and a bare * can't map the two roles to different axes — so bindings must be declared per full key, and keys appear dynamically during the run.
  • Multi-process. explorer / trainer / launcher are separate processes, so writing one run requires wandb shared mode (one primary = the launcher, secondaries = the actors).
  • Config is overwritten, not merged. In shared mode each writer's config sync replaces the run's metric defs (last writer wins). So a single declarer is unstable: if only the primary declares, the secondaries clobber it (the live axis falls back to the gapped _step); if each secondary declares only its own role, the two secondaries clobber each other.

Approach: make every writer carry the full binding set, so no sync can drop anything.

  • Secondaries declare the full set on each flush (their own keys + all roles' keys read from shared binding files) → axis is correct during the run.
  • The launcher (primary) declares the full set once before finalize → axis is correct after the run (the primary's config wins at finalize).

Changes made

  • Add monitor.shared_run (opt-in, default False) and an internal monitor.run_id to MonitorConfig.
  • The launcher creates one primary shared run and injects run_id; explorer/trainer join it as secondary writers (wandb shared mode).
  • Each role's metrics are prefixed {role}/ and bound (via define_metric) to their own {role}/step axis, so explorer and trainer each show a clean 1..N x-axis within the same run — both during the run and after it finishes.
  • The config validator downgrades shared_run to False for backends that don't support it (only wandb does); shared_run=False keeps the original behavior (separate runs) byte-for-byte.
  • Note: the per-role step bindings are exchanged via files under checkpoint_job_dir/monitor/, so multi-node runs require that dir on shared storage (e.g., NAS).

How to test

  1. In an example config (e.g. examples/gsm8k-quick.yaml) set:
    monitor:
    monitor_type: wandb
    shared_run: true
  2. Run a both job: trinity run --config examples/gsm8k-quick.yaml.
  3. Open the wandb run and confirm: explorer and trainer metrics live in one run; each role's panels use its own {role}/step axis showing 1..N (verified both mid-run and after finish).
  4. With shared_run: false (default) or a non-wandb backend, behavior is unchanged (two separate runs).

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

When monitor.monitor_type is wandb and monitor.shared_run is True, explorer and trainer log their metrics to the same run; metrics are labelled by {role}/step.
@MengsD

MengsD commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/unittest-diff

@github-actions

Copy link
Copy Markdown

unittest: Run #1780

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Pending ⏳ Other ❓ Flaky 🍂 Duration ⏱️
203 201 0 2 0 0 0 39m 46s

🎉 All tests passed!

Github Test Reporter by CTRF 💚

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[monitor] Unify explorer and trainer wandb logging into a single run

1 participant