vMCP terminates MCP session on every legitimate access-token refresh



## Summary

The `hijackPreventionDecorator` in `pkg/vmcp/session/internal/security/security.go` binds an MCP session to an HMAC hash of the **raw bearer-token bytes** at session creation, and rejects any subsequent request whose token bytes don't hash to the same value. This treats every legitimate OAuth2 refresh-token grant (RFC 6749 §6) as a session hijack attempt — the new AT has different bytes, the hash mismatches, the session is terminated with `ErrUnauthorizedCaller` ("caller identity does not match session owner").

The MCP streamable-HTTP transport terminates sessions via HTTP 404, never via token rotation, and requires `Authorization` on every request "even if they are part of the same logical session." Conformant clients keep the same `Mcp-Session-Id` across token refreshes — and trigger this bug every time their AT rotates.

## Reproduction

Reproduced end-to-end on a kind cluster with the in-repo reproducer at `deploy/session-debug/`. Key config in `03-vmcp-debug.yaml`:

```yaml
tokenLifespans:
  accessTokenLifespan: "6m"
  refreshTokenLifespan: "1h"
```

The reproducer script `deploy/session-debug/06-repro-session-bind.py` is self-contained:
1. DCR-registers a client at the vMCP authserver
2. Drives a PKCE auth-code flow (one browser interaction), captures `AT₁ + RT₁`
3. MCP `initialize` with `AT₁` → captures `Mcp-Session-Id` (session is now bound to `hash(AT₁)`)
4. MCP `tools/list` → 200 (sanity)
5. `POST /oauth/token` with `grant_type=refresh_token` → `AT₂`. Confirmed: **same `sub`, same `tsid`, different bytes**.
6. MCP `tools/call <real-tool>` with `AT₂` and the same `Mcp-Session-Id` → bug fires.

Observed in the response body (HTTP 200, JSON-RPC `result.isError=true`):

```json
{"jsonrpc":"2.0","id":3,"result":{"content":[
  {"type":"text","text":"Unauthorized: caller identity does not match session owner"}
],"isError":true}}
```

And in the vMCP logs:

```
WARN  token validation failed: token hash mismatch     reason=token_hash_mismatch
WARN  caller authorization failed, terminating session
      session_id=<sid>  tool=debug-fetch_fetch
      error="caller identity does not match session owner"
```

The session is removed from storage; the client must re-establish.

## Code site

- `pkg/vmcp/session/internal/security/security.go:65` — `hashToken` HMACs the raw bearer-token bytes.
- `pkg/vmcp/session/internal/security/security.go:107` — `validateCaller` performs the comparison.
- `pkg/vmcp/session/internal/security/security.go:145` — the sole AT-hash comparison site.
- `pkg/vmcp/session/internal/security/security.go:207-260` — `RestoreHijackPrevention` (cross-pod state restore, #4405) has the same flaw.
- `pkg/vmcp/session/factory.go:525` — construction site (`PreventSessionHijacking`).
- `pkg/vmcp/server/sessionmanager/factory.go:252-262` — the **invocation** site: the per-tool SDK handler closure captures the decorated session and calls `capturedSess.CallTool`, which is where `validateCaller` actually runs and where the session-termination side-effect is triggered.

The decorator overrides `CallTool`, `ReadResource`, `GetPrompt`. Initialization and listing are unaffected, which makes the failure intermittent from a UX perspective: the client looks healthy until the user actually invokes a tool.

Note: `pkg/vmcp/server/adapter/handler_factory.go:97` calls `backendClient.CallTool` directly (no session decorator in the path) — this is a parallel handler factory used in non-session-managed flows, not the path under discussion. Mentioned to disambiguate.

## Why this is a bug, not a defensible design

1. **MCP transport semantics**: streamable-HTTP transport (`docs/arch/03-transport-architecture.md:141,497,515`) treats `Mcp-Session-Id` as transport state independent of the `Authorization` header — the spec terminates sessions via HTTP 404, never via token rotation, and requires auth on every request even within a session. Returning `Unauthorized: caller identity does not match session owner` on a refreshed AT is not a defined session-termination signal.
2. **OAuth conformance**: RFC 6749 §6 + the OAuth Security BCP (RFC 9700 §4.14) require short ATs and refresh-token rotation as the default deployment model. Binding session validity to AT bytes is incompatible with that model.
3. **Wrong primitive**: if real sender-constraint is required, the standards are DPoP (RFC 9449) and mTLS (RFC 8705), both of which **explicitly preserve the binding across refresh** (RFC 9449 §5: refresh token MUST be bound to the same key; RFC 8705 §4: refresh token SHOULD be bound to the same certificate). Hashing token bytes provides neither sender-constraint nor refresh-safety.
4. **Marginal protection given up is small**: the only attack the current check blocks is "attacker obtained a session ID but not the user's token, and has their own valid token for a different identity." In practice, session ID and bearer token travel in the same HTTP request over the same TLS connection; capturing one almost always means capturing both. Authorization decisions and upstream-token selection are keyed off JWT claims per-request (`pkg/auth/token.go:1203`, `authRoundTripper.Authenticate`), not session-bound state.

## Proposed fix

Bind sessions to the **`tsid` claim** in the signed JWT instead of raw token bytes.

- `tsid` is set by the embedded authserver at `pkg/authserver/server/session/session.go:57,116-123`, lives in the **signed JWT payload** (unforgeable without the AS signing key), and is stable across refresh because fosite reuses the persisted authorize-session on `refresh_token` grants (`pkg/authserver/server/handlers/token.go:19-28`). Downstream code at `pkg/auth/token.go:570-573,1184-1199` already treats it as the stable auth-session identifier (used to look up upstream tokens).
- The reproducer in this issue confirmed `same tsid` and `same sub` across a `refresh_token` exchange while `same bytes = False`.
- Replace `boundTokenHash` with `boundSessionID` (the `tsid` value).
- Update both `PreventSessionHijacking` and `RestoreHijackPrevention`.
- Drop the per-session salt + HMAC machinery (no longer needed; `tsid` is already an unforgeable signed claim).
- **Fallback for non-embedded-AS deployments** (JWT from an external IdP without `tsid`): bind on `iss + sub` per OIDC Core §2 (`sub` is issuer-scoped, not global, so `iss` is required to disambiguate across multi-IdP deployments). Document explicitly. Use `client_id` only for audit logging, not for matching.
- **Keep** the anonymous-session-upgrade prevention at `security.go:111-119`; it does not depend on the token-bytes hash and survives the fix.

## Migration

The persisted session metadata schema changes (`vmcp.token.hash` / `vmcp.token.salt` → `vmcp.session.tsid`). Options for existing sessions on upgrade:

- **Invalidate on read**: sessions without the new key are rejected, forcing one re-auth at deploy time. Simple.
- **Read both, write new**: tolerate old-format sessions until they expire naturally. Smoother rollout.

## Related

- #3867 (closed) — origin of the design.
- #4405, #4067, #4464 — hardening PRs that didn't change the comparison input.
- #3877 — proactive per-request identity resolution; related but distinct.


---

## Update — implementation plan & design refinement

After filing this issue we (a) verified the reproduction end-to-end on a kind cluster using `deploy/session-debug/06-repro-session-bind.py` (full script self-contained: DCR registration, PKCE auth, MCP `initialize`, `refresh_token` grant, `tools/call` with refreshed AT — the response body contains the exact `caller identity does not match session owner` text), and (b) consulted a three-expert panel (Go, OAuth, MCP) on the two open design points. Both came back unanimous.

### Design change: drop the two-tier proposal, bind on `iss + sub` only

The issue originally proposed `tsid` claim (primary) with `iss + sub` (fallback). Investigation revealed two problems with the two-tier approach:

1. **`tsid` is filtered out of `*auth.Identity.Claims` by `filterInternalClaims()` at `pkg/auth/context.go:96-111`** for a documented security reason ("exposing tsid widens the attack surface if a webhook receiver is compromised"). Using `tsid` at the security decorator would require adding a `TokenSessionID` field to `*auth.Identity` and populating it from raw claims pre-filter — touching three pkg/auth files for one consumer (violates `.claude/rules/vmcp-anti-patterns.md` #8).
2. **External IdPs (Keycloak, Entra, Okta) don't emit `tsid`.** For any non-embedded-AS deployment the "fallback" IS the primary path. A two-tier design that's "fallback-only" in production isn't worth the complexity.

`iss + sub` alone is threat-model adequate. The check defends against "attacker with stolen `Mcp-Session-Id` + their own token for a different identity" — `iss + sub` defeats that completely. The granularity `tsid` would buy (concurrent OAuth sessions for the *same* user) is same-user-vs-self, not a hijack, and is not in any MCP client's observed behavior.

### Migration: invalidate on read, not rebind-on-first-use

The issue presented "invalidate on read" and "read both, write new" as equivalent options. The panel was unanimous that **read-both-write-new is unsafe**: the rebind-on-first-use window re-creates the exact attack the check exists to prevent — an attacker with a stolen session-id and their own valid token could claim an orphaned legacy session during the window. Invalidate-on-read fails closed: legacy-format sessions are rejected on restore, the client gets the same `result.isError=true` MCP error it already handles by re-initializing, one forced re-auth per active session at deploy.

### Code-site clarification

The decorator's invocation point is `pkg/vmcp/server/sessionmanager/factory.go:252-262` (the per-tool SDK handler closure that captures the decorated session and calls `capturedSess.CallTool`). This is where the bug's session-termination side effect is triggered and where the regression test will exercise the fix. `pkg/vmcp/server/adapter/handler_factory.go:97` is a parallel (non-session-managed) path that bypasses the decorator and is not under discussion here.

### Plan document

Full implementation plan with commit-level breakdown lives in `vmcp-session-binding-fix-plan.md` in the repo root (not committed; intended to drive the PR).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vMCP terminates MCP session on every legitimate access-token refresh #5306

Summary

Reproduction

Code site

Why this is a bug, not a defensible design

Proposed fix

Migration

Related

Update — implementation plan & design refinement

Design change: drop the two-tier proposal, bind on `iss + sub` only

Migration: invalidate on read, not rebind-on-first-use

Code-site clarification

Plan document

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

vMCP terminates MCP session on every legitimate access-token refresh #5306

Description

Summary

Reproduction

Code site

Why this is a bug, not a defensible design

Proposed fix

Migration

Related

Update — implementation plan & design refinement

Design change: drop the two-tier proposal, bind on iss + sub only

Migration: invalidate on read, not rebind-on-first-use

Code-site clarification

Plan document

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Design change: drop the two-tier proposal, bind on `iss + sub` only