Skip to content

vLLM OpenAI API server supports recording experience data#591

Draft
pan-x-c wants to merge 29 commits into
agentscope-ai:mainfrom
pan-x-c:feature/model_self_record_experience
Draft

vLLM OpenAI API server supports recording experience data#591
pan-x-c wants to merge 29 commits into
agentscope-ai:mainfrom
pan-x-c:feature/model_self_record_experience

Conversation

@pan-x-c

@pan-x-c pan-x-c commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Description

As the title says

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

@pan-x-c pan-x-c marked this pull request as draft June 24, 2026 12:26
pan-x-c and others added 28 commits June 25, 2026 11:15
…th legacy)

Refactor experience production so heavy data (tokens/logprobs/routed_experts)
no longer rides runner->scheduler->coordinator as serialized bytes. The vLLM
recorder now captures it in-process into a MemoryStore keyed by task_id, and
the coordinator pulls it at finalize time via /records/consume_task. Runners
ship only a small reward map. Both paths coexist behind
`explorer.use_recorded_experience` (default off = legacy).

Recording module (trinity/common/models/vllm_patch/recording/):
- store: drop SqlStore; MemoryStore.update_reward_by_task_id stamps
  reward/run/task on a whole task-id group, pops and returns it (the
  in-memory replacement for the SQL HistoryRecorder join).
- recorder: track in-flight record tasks; add flush() (await pending +
  queue.join) so a consume sees a quiesced store; honor skip_recording_ctx.
- models: build_experience emits one Experience per completion (n>1) with
  info["sample_index"]; eid.suffix=request_id kept for traceability.
- context: add skip_recording_ctx; task_id already flows via api_key
  (RecordingIdentityMiddleware) and now also via VLLMModel.chat (Ray entry).
- query: POST /records/consume_task (flush -> update_reward_by_task_id ->
  serialize_many); drop the SqlStore 503 branch.
- config/server: remove RecordingConfig entirely; the logprob width is a
  recorder-internal constant (we store only the chosen token, which vLLM
  force-includes at logprobs=1). No static config threaded through launch.

task_id propagation (Ray entry, same contextvar as the HTTP middleware):
- vllm_model: chat/generate accept task_id_key, set task_id_ctx around
  _generate_internal; logprobs sets skip_recording_ctx (auxiliary forward).
- model: ModelWrapper.chat/chat_async forward task_id_key; SGLang.chat
  accepts-and-ignores it (recording is vLLM-only).

Coordinator + runner + workflow:
- rollout_coordinator: _resolve_rank_urls (ray.get_actor per engine) and a
  recording-mode finalize that fans out /records/consume_task per engine,
  deserializes, and feeds objects to the pipeline (no re-serialization).
- experience_pipeline: process_experiences(exps) public object entry.
- workflow_runner: recording mode returns a pickled reward map keyed by the
  per-sample task_id_key the workflow stamped; legacy path unchanged.
- workflow: SimpleWorkflow/AsyncSimpleWorkflow run a per-sample n=1 loop in
  recording mode (distinct task_id_key per sample == reward unit for GRPO),
  legacy n=repeat_times single-call path unchanged.
- config: ExplorerConfig.use_recorded_experience flag.

SQL path removal (MemoryStore only):
- delete proxy/recorder.py (HistoryRecorder) and proxy_test.py; proxy
  service/app drop /feedback, /commit, record_feedback, submit_experiences,
  ready_experiences (keep allocate_model + weight sync); allocator no longer
  fills record_db_url; drop InferenceModelConfig.record_db_url and the dead
  ExplorerConfig.db_url field; RecordingConfig deleted.

Serve-mode external reward reporting is intentionally left unimplemented this
version (proxy /feedback//commit removed); the affected serve integration
tests (TestServeWithTrainer, ServeTest) are skipped with a pointer to the
recording refactor plan. convert_messages_to_experience redirect (multi-turn)
is deferred with TODOs at its call sites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…-x-c/Trinity-RFT into feature/model_self_record_experience
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant