Skip to content

feat(vmcp): plumb operational.timeouts.default into request timeouts#5205

Open
thomasandal wants to merge 1 commit intostacklok:mainfrom
thomasandal:tt/plumb-backend-request-timeout
Open

feat(vmcp): plumb operational.timeouts.default into request timeouts#5205
thomasandal wants to merge 1 commit intostacklok:mainfrom
thomasandal:tt/plumb-backend-request-timeout

Conversation

@thomasandal
Copy link
Copy Markdown

Problem

The CRD field VirtualMCPServer.spec.config.operational.timeouts.default is documented as "the default timeout for backend requests" but is currently dead config. It's defined (pkg/vmcp/config/config.go:524), defaulted (pkg/vmcp/config/defaults.go:54) and validated (pkg/vmcp/config/validator.go:367), but never read by any request-path code.

The actual ceiling is hardcoded in two places:

  1. Outbound — vmcp → backend MCPServer:
    pkg/vmcp/session/internal/backend/mcp_session.go:35

    defaultBackendRequestTimeout = 30 * time.Second

    Applied at lines 298–304 to the streamable-http http.Client.Timeout and the mark3labs SDK's WithHTTPTimeout.

  2. Inbound — vmcp's own http.Server:
    pkg/vmcp/server/server.go:53,59

    defaultReadTimeout  = 30 * time.Second
    defaultWriteTimeout = 30 * time.Second

    Applied at lines 696–697 in Server.Start.

User-visible symptom: any tool call that takes >30s upstream returns 502 from any nginx in front of vmcp, even with operational.timeouts.default: 120s set on the CR. Heavy ClickHouse aggregations, deep Perplexity research, and large GitHub searches are the common cases.

I confirmed both layers fire by bisecting in production: 28s queries succeed; 35s queries 502'd until BOTH timeouts were lifted. mcp-clickhouse pod logs show the upstream completing the query in all cases — vmcp hangs up before the response can be relayed.

Change

Plumb operational.timeouts.default (already a time.Duration-castable custom Duration type) through to both layers:

  • pkg/vmcp/session/internal/backend/mcp_session.goNewHTTPConnector(registry, requestTimeout) and createMCPClient(..., requestTimeout) accept a time.Duration. 0 falls back to the existing 30s constant. SSE backends are unaffected.
  • pkg/vmcp/session/factory.go — new WithBackendRequestTimeout(d) factory option, mirrors the existing WithBackendInitTimeout shape. NewSessionFactory pre-applies opts to a sentinel struct so the connector can be constructed with the configured timeout (option funcs are pure setters, double-application is idempotent).
  • pkg/vmcp/server/server.go — new Config.RequestTimeout. When > 0, overrides both ReadTimeout and WriteTimeout. SSE GET routes still clear their per-request write deadline via transportmiddleware.WriteTimeout (net/http: no way of manipulating timeouts in Handler golang/go#16100), so SSE behavior is unchanged.
  • pkg/vmcp/cli/serve.go — reads cfg.Operational.Timeouts.Default, passes it to createSessionFactory (forwarded via the new option) AND sets serverCfg.RequestTimeout. Nil-safe — defaults preserved when Operational or Timeouts is unset.

Tests in pkg/vmcp/session/internal/backend/mcp_session_test.go and pkg/vmcp/cli/serve_test.go pass 0 for the new arg to preserve the prior behavior.

Test plan

  • go build ./... — clean.
  • go test ./pkg/vmcp/session/... ./pkg/vmcp/cli/... ./pkg/vmcp/server/... — all green.
  • Built ko image from this branch and ran in production for ~30 minutes against a homelab cluster: 100 fresh-connection probes return 401 (no 5xx regressions); previously-502'ing 35s ClickHouse query now returns 200 in 35s. The remaining ceiling at ~60s is the operator-side streamable_proxy.defaultRequestTimeout (already env-overridable via TOOLHIVE_PROXY_REQUEST_TIMEOUT).

Notes

  • Backwards-compatible: the new requestTimeout/RequestTimeout parameters default to 0, which preserves the existing 30s constants. Existing callers that don't pass anything see no change.
  • Doesn't address the operator-side pkg/transport/proxy/streamable/streamable_proxy.go:35 60s default (already overridable via TOOLHIVE_PROXY_REQUEST_TIMEOUT env on the proxy container). That's a separate hop for a separate PR if anyone wants symmetry.

…imeout

The CRD field `VirtualMCPServer.spec.config.operational.timeouts.default`
exists and is documented as "the default timeout for backend requests"
but is dead config in v0.26.1: it's parsed (config.go:524), defaulted
(defaults.go:54) and validated (validator.go:367) but never read by the
request-path code. The actual ceiling is hardcoded at
`pkg/vmcp/session/internal/backend/mcp_session.go:35`
(`defaultBackendRequestTimeout = 30 * time.Second`) and applied on
lines 298-304 to both the streamable-HTTP `http.Client.Timeout` and
the mark3labs SDK's `WithHTTPTimeout`. Tool calls that take >30s
upstream — long ClickHouse aggregations, deep Perplexity research,
heavy GitHub searches — surface as 502 from any nginx in front of
vmcp because vmcp closes the backend HTTP client before the response
arrives.

Plumb the existing config through:
- `NewHTTPConnector` and `createMCPClient` accept a `requestTimeout
  time.Duration`; 0 falls back to the existing 30s constant so callers
  that pass nothing keep the old behavior.
- New `session.WithBackendRequestTimeout` factory option mirrors the
  existing `WithBackendInitTimeout`.
- `NewSessionFactory` pre-applies opts to a sentinel struct so the
  connector can be constructed with the configured timeout (option
  funcs are pure setters, so double-application is idempotent).
- `cmd/vmcp/cli/serve.go` reads `cfg.Operational.Timeouts.Default` and
  passes it to `createSessionFactory`, which forwards it via the new
  option. Nil-safe — falls back to 30s when `Operational` or
  `Timeouts` is unset, preserving upstream defaults.

SSE backends are unaffected: their stream lifetime is unbounded by
design and `http.Client.Timeout` is intentionally omitted on that
code path.

All existing tests pass; updated four call sites in
mcp_session_test.go and serve_test.go to pass 0 for the new timeout
arg (preserves prior behavior).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant