Skip to content

Retry OAuth token refresh on infrastructure 4xx#5170

Open
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix/retry-4xx-token-refresh-5169
Open

Retry OAuth token refresh on infrastructure 4xx#5170
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix/retry-4xx-token-refresh-5169

Conversation

@gkatz2
Copy link
Copy Markdown
Contributor

@gkatz2 gkatz2 commented May 3, 2026

Summary

  • The OAuth token refresh endpoint returns 4xx for infrastructure events (WAF/firewall blocks, rate limits, transient bad-config deploys) as well as for OAuth protocol failures. Today every 4xx is treated as a permanent auth failure, killing the workload until manual re-authentication. A momentary VPN flap that routes a refresh request through a non-allowlisted IP is enough to leave the workload dead.
  • Treat 4xx responses that lack a structured RFC 6749 error code as transient, so they enter the existing retry loop added in Retry OAuth token refresh on server errors #4513. Only 4xx with a populated error code (invalid_grant, invalid_client, etc.) remains permanent. 429 is always transient per HTTP standard.

Fixes #5169

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Added TestIsTransientNetworkError_AgainstRealOAuth2Library, which exercises the classification function end-to-end through golang.org/x/oauth2.Config.TokenSource against an httptest.Server returning each response shape (HTML 403, HTML 401, JSON invalid_grant, JSON invalid_client, 429, 503). This pins the assumption that RetrieveError.ErrorCode is populated only for parseable RFC 6749 error responses, so a future oauth2 upgrade that changes that behavior would surface here, not in production.

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

Does this introduce a user-facing change?

Remote MCP servers with OAuth authentication now survive transient 4xx token-endpoint events (WAF blocks, rate limits, brief upstream config errors) instead of permanently going dark on the first occurrence.

Special notes for reviewers

Classification rule

golang.org/x/oauth2 populates RetrieveError.ErrorCode only when the response body is parseable JSON containing an RFC 6749 error field. An empty ErrorCode therefore signals an infrastructure-level response (HTML page from a WAF, CDN, or reverse proxy), not an OAuth protocol failure. The new classifyOAuthRetrieveError helper applies this rule:

  • 5xx: transient (existing behavior, from Retry OAuth token refresh on server errors #4513)
  • 429: transient regardless of body (HTTP standard)
  • 4xx with empty ErrorCode: transient (infrastructure error)
  • 4xx with populated ErrorCode: permanent (OAuth protocol failure)

Scope

Classification only — no changes to retry parameters, monitor lifecycle, or workload state machine. Outages that span longer than a single refresh attempt (e.g. a multi-hour VPN outage that crosses a token expiry boundary) still result in the workload being marked unauthenticated; that's a separable architectural change requiring a new "transiently failing" state and is not part of this fix.

Interaction with #5044

PR #5044 (in flight) emits a DCR remediation hint when classification returns permanent 4xx. This change reduces false triggers of that hint: the hint will fire for invalid_grant/invalid_client (where DCR creds genuinely may be stale) but no longer for WAF blocks (where the hint would have been misleading). Pure mechanical merge conflict on the same file, no semantic conflict.

Generated with Claude Code

The OAuth token refresh endpoint returns 4xx for infrastructure
events (WAF/firewall blocks, rate limits, transient bad-config
deploys) as well as for OAuth protocol failures. Today every 4xx
is treated as a permanent auth failure, killing the workload
until manual re-authentication. A momentary VPN flap that routes
a refresh request through a non-allowlisted IP is enough to
leave the workload dead.

Treat 4xx responses that lack a structured RFC 6749 error code
as transient; only 4xx with a populated error code
(invalid_grant, invalid_client, etc.) remains permanent. 429 is
always transient per HTTP standard.

Fixes stacklok#5169

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions Bot added the size/S Small PR: 100-299 lines changed label May 3, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.53%. Comparing base (8c90184) to head (9de8e4f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5170      +/-   ##
==========================================
- Coverage   67.53%   67.53%   -0.01%     
==========================================
  Files         601      601              
  Lines       61093    61109      +16     
==========================================
+ Hits        41262    41267       +5     
- Misses      16714    16725      +11     
  Partials     3117     3117              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Classification logic is RFC 6749 / 6585 compliant, the real-library contract test is the right way to defend against future golang.org/x/oauth2 upgrades changing ErrorCode semantics, and singleflight already prevents the cross-replica refresh-rotation race from getting worse. PR scope is tight (classification only, ~206 LOC).

Two non-blocking comments below — both are suggestions for follow-up scope and a small test-design refinement, not anything that should hold this up.

return true
}

if statusCode == http.StatusTooManyRequests {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

429 is now transient, but Retry-After is ignored — the existing 10s → 2m exponential backoff applies regardless of what the server suggests. Real providers (GitHub, Google) routinely set the header, so a Retry-After: 600 will get hit ~5–30 times before the 5-minute cap stops things, instead of waiting the requested cooldown. RFC 6585 §4 says it's MAY, so ignoring it is technically compliant.

Honoring it cleanly would mean threading a "min wait" hint from the classifier into the retry loop, which is a bigger change than this PR's classification-only scope. Would you want to file a follow-up issue (or follow-up PR) for it?

// RetrieveError.ErrorCode for a given response shape. This test pins that
// assumption so a future oauth2 upgrade that changes ErrorCode population
// would surface here, not in production.
func TestIsTransientNetworkError_AgainstRealOAuth2Library(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All six subtests here mirror cases already covered by the synthetic table in TestMonitoredTokenSource_BackgroundMonitor_ErrorClassification (400 invalid_grant, 429 empty body, 503, HTML 401/403, etc.).

The doc comment frames the real value well — pinning the golang.org/x/oauth2 ErrorCode-population contract so a future library upgrade that changes that behavior fails here, not in production. That's worth keeping. But as written it reads like overlapping unit coverage, which .claude/rules/testing.md flags ("Avoid overlapping test cases exercising the same code path" / "Tests must only test code in the package under test").

Two ways to make the intent visible:

  1. Trim to just the cases where the synthetic helpers could diverge from real library output — i.e. the HTML/WAF shapes (where the empty-ErrorCode premise is non-obvious), plus an edge case like a form-encoded error body. Drop the JSON invalid_grant/invalid_client cases that the synthetic table already pins.
  2. Move to a contract_test.go with a header comment noting its role is dependency pinning, not unit coverage of the package's own logic.

Either works — preference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Token refresh treats transient 4xx as permanent auth failures

2 participants