Skip to content

vMCP terminates MCP session on every legitimate access-token refresh #5306

@jhrozek

Description

@jhrozek

Summary

The hijackPreventionDecorator in pkg/vmcp/session/internal/security/security.go binds an MCP session to an HMAC hash of the raw bearer-token bytes at session creation, and rejects any subsequent request whose token bytes don't hash to the same value. This treats every legitimate OAuth2 refresh-token grant (RFC 6749 §6) as a session hijack attempt — the new AT has different bytes, the hash mismatches, the session is terminated with ErrUnauthorizedCaller ("caller identity does not match session owner").

The MCP streamable-HTTP transport terminates sessions via HTTP 404, never via token rotation, and requires Authorization on every request "even if they are part of the same logical session." Conformant clients keep the same Mcp-Session-Id across token refreshes — and trigger this bug every time their AT rotates.

Reproduction

Reproduced end-to-end on a kind cluster with the in-repo reproducer at deploy/session-debug/. Key config in 03-vmcp-debug.yaml:

tokenLifespans:
  accessTokenLifespan: "6m"
  refreshTokenLifespan: "1h"

The reproducer script deploy/session-debug/06-repro-session-bind.py is self-contained:

  1. DCR-registers a client at the vMCP authserver
  2. Drives a PKCE auth-code flow (one browser interaction), captures AT₁ + RT₁
  3. MCP initialize with AT₁ → captures Mcp-Session-Id (session is now bound to hash(AT₁))
  4. MCP tools/list → 200 (sanity)
  5. POST /oauth/token with grant_type=refresh_tokenAT₂. Confirmed: same sub, same tsid, different bytes.
  6. MCP tools/call <real-tool> with AT₂ and the same Mcp-Session-Id → bug fires.

Observed in the response body (HTTP 200, JSON-RPC result.isError=true):

{"jsonrpc":"2.0","id":3,"result":{"content":[
  {"type":"text","text":"Unauthorized: caller identity does not match session owner"}
],"isError":true}}

And in the vMCP logs:

WARN  token validation failed: token hash mismatch     reason=token_hash_mismatch
WARN  caller authorization failed, terminating session
      session_id=<sid>  tool=debug-fetch_fetch
      error="caller identity does not match session owner"

The session is removed from storage; the client must re-establish.

Code site

  • pkg/vmcp/session/internal/security/security.go:65hashToken HMACs the raw bearer-token bytes.
  • pkg/vmcp/session/internal/security/security.go:107validateCaller performs the comparison.
  • pkg/vmcp/session/internal/security/security.go:145 — the sole AT-hash comparison site.
  • pkg/vmcp/session/internal/security/security.go:207-260RestoreHijackPrevention (cross-pod state restore, Add RestoreHijackPrevention and RestoreSession interface stub #4405) has the same flaw.
  • pkg/vmcp/session/factory.go:525 — construction site (PreventSessionHijacking).
  • pkg/vmcp/server/sessionmanager/factory.go:252-262 — the invocation site: the per-tool SDK handler closure captures the decorated session and calls capturedSess.CallTool, which is where validateCaller actually runs and where the session-termination side-effect is triggered.

The decorator overrides CallTool, ReadResource, GetPrompt. Initialization and listing are unaffected, which makes the failure intermittent from a UX perspective: the client looks healthy until the user actually invokes a tool.

Note: pkg/vmcp/server/adapter/handler_factory.go:97 calls backendClient.CallTool directly (no session decorator in the path) — this is a parallel handler factory used in non-session-managed flows, not the path under discussion. Mentioned to disambiguate.

Why this is a bug, not a defensible design

  1. MCP transport semantics: streamable-HTTP transport (docs/arch/03-transport-architecture.md:141,497,515) treats Mcp-Session-Id as transport state independent of the Authorization header — the spec terminates sessions via HTTP 404, never via token rotation, and requires auth on every request even within a session. Returning Unauthorized: caller identity does not match session owner on a refreshed AT is not a defined session-termination signal.
  2. OAuth conformance: RFC 6749 §6 + the OAuth Security BCP (RFC 9700 §4.14) require short ATs and refresh-token rotation as the default deployment model. Binding session validity to AT bytes is incompatible with that model.
  3. Wrong primitive: if real sender-constraint is required, the standards are DPoP (RFC 9449) and mTLS (RFC 8705), both of which explicitly preserve the binding across refresh (RFC 9449 §5: refresh token MUST be bound to the same key; RFC 8705 §4: refresh token SHOULD be bound to the same certificate). Hashing token bytes provides neither sender-constraint nor refresh-safety.
  4. Marginal protection given up is small: the only attack the current check blocks is "attacker obtained a session ID but not the user's token, and has their own valid token for a different identity." In practice, session ID and bearer token travel in the same HTTP request over the same TLS connection; capturing one almost always means capturing both. Authorization decisions and upstream-token selection are keyed off JWT claims per-request (pkg/auth/token.go:1203, authRoundTripper.Authenticate), not session-bound state.

Proposed fix

Bind sessions to the tsid claim in the signed JWT instead of raw token bytes.

  • tsid is set by the embedded authserver at pkg/authserver/server/session/session.go:57,116-123, lives in the signed JWT payload (unforgeable without the AS signing key), and is stable across refresh because fosite reuses the persisted authorize-session on refresh_token grants (pkg/authserver/server/handlers/token.go:19-28). Downstream code at pkg/auth/token.go:570-573,1184-1199 already treats it as the stable auth-session identifier (used to look up upstream tokens).
  • The reproducer in this issue confirmed same tsid and same sub across a refresh_token exchange while same bytes = False.
  • Replace boundTokenHash with boundSessionID (the tsid value).
  • Update both PreventSessionHijacking and RestoreHijackPrevention.
  • Drop the per-session salt + HMAC machinery (no longer needed; tsid is already an unforgeable signed claim).
  • Fallback for non-embedded-AS deployments (JWT from an external IdP without tsid): bind on iss + sub per OIDC Core §2 (sub is issuer-scoped, not global, so iss is required to disambiguate across multi-IdP deployments). Document explicitly. Use client_id only for audit logging, not for matching.
  • Keep the anonymous-session-upgrade prevention at security.go:111-119; it does not depend on the token-bytes hash and survives the fix.

Migration

The persisted session metadata schema changes (vmcp.token.hash / vmcp.token.saltvmcp.session.tsid). Options for existing sessions on upgrade:

  • Invalidate on read: sessions without the new key are rejected, forcing one re-auth at deploy time. Simple.
  • Read both, write new: tolerate old-format sessions until they expire naturally. Smoother rollout.

Related


Update — implementation plan & design refinement

After filing this issue we (a) verified the reproduction end-to-end on a kind cluster using deploy/session-debug/06-repro-session-bind.py (full script self-contained: DCR registration, PKCE auth, MCP initialize, refresh_token grant, tools/call with refreshed AT — the response body contains the exact caller identity does not match session owner text), and (b) consulted a three-expert panel (Go, OAuth, MCP) on the two open design points. Both came back unanimous.

Design change: drop the two-tier proposal, bind on iss + sub only

The issue originally proposed tsid claim (primary) with iss + sub (fallback). Investigation revealed two problems with the two-tier approach:

  1. tsid is filtered out of *auth.Identity.Claims by filterInternalClaims() at pkg/auth/context.go:96-111 for a documented security reason ("exposing tsid widens the attack surface if a webhook receiver is compromised"). Using tsid at the security decorator would require adding a TokenSessionID field to *auth.Identity and populating it from raw claims pre-filter — touching three pkg/auth files for one consumer (violates .claude/rules/vmcp-anti-patterns.md Start simple hardcoded registry #8).
  2. External IdPs (Keycloak, Entra, Okta) don't emit tsid. For any non-embedded-AS deployment the "fallback" IS the primary path. A two-tier design that's "fallback-only" in production isn't worth the complexity.

iss + sub alone is threat-model adequate. The check defends against "attacker with stolen Mcp-Session-Id + their own token for a different identity" — iss + sub defeats that completely. The granularity tsid would buy (concurrent OAuth sessions for the same user) is same-user-vs-self, not a hijack, and is not in any MCP client's observed behavior.

Migration: invalidate on read, not rebind-on-first-use

The issue presented "invalidate on read" and "read both, write new" as equivalent options. The panel was unanimous that read-both-write-new is unsafe: the rebind-on-first-use window re-creates the exact attack the check exists to prevent — an attacker with a stolen session-id and their own valid token could claim an orphaned legacy session during the window. Invalidate-on-read fails closed: legacy-format sessions are rejected on restore, the client gets the same result.isError=true MCP error it already handles by re-initializing, one forced re-auth per active session at deploy.

Code-site clarification

The decorator's invocation point is pkg/vmcp/server/sessionmanager/factory.go:252-262 (the per-tool SDK handler closure that captures the decorated session and calls capturedSess.CallTool). This is where the bug's session-termination side effect is triggered and where the regression test will exercise the fix. pkg/vmcp/server/adapter/handler_factory.go:97 is a parallel (non-session-managed) path that bypasses the decorator and is not under discussion here.

Plan document

Full implementation plan with commit-level breakdown lives in vmcp-session-binding-fix-plan.md in the repo root (not committed; intended to drive the PR).

Metadata

Metadata

Assignees

Labels

authenticationbugSomething isn't workingvmcpVirtual MCP Server related issues

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions