Retry OAuth token refresh on infrastructure 4xx#5170
Retry OAuth token refresh on infrastructure 4xx#5170gkatz2 wants to merge 1 commit intostacklok:mainfrom
Conversation
The OAuth token refresh endpoint returns 4xx for infrastructure events (WAF/firewall blocks, rate limits, transient bad-config deploys) as well as for OAuth protocol failures. Today every 4xx is treated as a permanent auth failure, killing the workload until manual re-authentication. A momentary VPN flap that routes a refresh request through a non-allowlisted IP is enough to leave the workload dead. Treat 4xx responses that lack a structured RFC 6749 error code as transient; only 4xx with a populated error code (invalid_grant, invalid_client, etc.) remains permanent. 429 is always transient per HTTP standard. Fixes stacklok#5169 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Greg Katz <gkatz@indeed.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5170 +/- ##
==========================================
- Coverage 67.53% 67.53% -0.01%
==========================================
Files 601 601
Lines 61093 61109 +16
==========================================
+ Hits 41262 41267 +5
- Misses 16714 16725 +11
Partials 3117 3117 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jhrozek
left a comment
There was a problem hiding this comment.
LGTM. Classification logic is RFC 6749 / 6585 compliant, the real-library contract test is the right way to defend against future golang.org/x/oauth2 upgrades changing ErrorCode semantics, and singleflight already prevents the cross-replica refresh-rotation race from getting worse. PR scope is tight (classification only, ~206 LOC).
Two non-blocking comments below — both are suggestions for follow-up scope and a small test-design refinement, not anything that should hold this up.
| return true | ||
| } | ||
|
|
||
| if statusCode == http.StatusTooManyRequests { |
There was a problem hiding this comment.
429 is now transient, but Retry-After is ignored — the existing 10s → 2m exponential backoff applies regardless of what the server suggests. Real providers (GitHub, Google) routinely set the header, so a Retry-After: 600 will get hit ~5–30 times before the 5-minute cap stops things, instead of waiting the requested cooldown. RFC 6585 §4 says it's MAY, so ignoring it is technically compliant.
Honoring it cleanly would mean threading a "min wait" hint from the classifier into the retry loop, which is a bigger change than this PR's classification-only scope. Would you want to file a follow-up issue (or follow-up PR) for it?
| // RetrieveError.ErrorCode for a given response shape. This test pins that | ||
| // assumption so a future oauth2 upgrade that changes ErrorCode population | ||
| // would surface here, not in production. | ||
| func TestIsTransientNetworkError_AgainstRealOAuth2Library(t *testing.T) { |
There was a problem hiding this comment.
All six subtests here mirror cases already covered by the synthetic table in TestMonitoredTokenSource_BackgroundMonitor_ErrorClassification (400 invalid_grant, 429 empty body, 503, HTML 401/403, etc.).
The doc comment frames the real value well — pinning the golang.org/x/oauth2 ErrorCode-population contract so a future library upgrade that changes that behavior fails here, not in production. That's worth keeping. But as written it reads like overlapping unit coverage, which .claude/rules/testing.md flags ("Avoid overlapping test cases exercising the same code path" / "Tests must only test code in the package under test").
Two ways to make the intent visible:
- Trim to just the cases where the synthetic helpers could diverge from real library output — i.e. the HTML/WAF shapes (where the empty-
ErrorCodepremise is non-obvious), plus an edge case like a form-encoded error body. Drop the JSONinvalid_grant/invalid_clientcases that the synthetic table already pins. - Move to a
contract_test.gowith a header comment noting its role is dependency pinning, not unit coverage of the package's own logic.
Either works — preference?
Summary
invalid_grant,invalid_client, etc.) remains permanent. 429 is always transient per HTTP standard.Fixes #5169
Type of change
Test plan
task test)task lint-fix)Added
TestIsTransientNetworkError_AgainstRealOAuth2Library, which exercises the classification function end-to-end throughgolang.org/x/oauth2.Config.TokenSourceagainst anhttptest.Serverreturning each response shape (HTML 403, HTML 401, JSONinvalid_grant, JSONinvalid_client, 429, 503). This pins the assumption thatRetrieveError.ErrorCodeis populated only for parseable RFC 6749 error responses, so a future oauth2 upgrade that changes that behavior would surface here, not in production.API Compatibility
v1beta1API, OR theapi-break-allowedlabel is applied and the migration guidance is described above.Does this introduce a user-facing change?
Remote MCP servers with OAuth authentication now survive transient 4xx token-endpoint events (WAF blocks, rate limits, brief upstream config errors) instead of permanently going dark on the first occurrence.
Special notes for reviewers
Classification rule
golang.org/x/oauth2populatesRetrieveError.ErrorCodeonly when the response body is parseable JSON containing an RFC 6749errorfield. An emptyErrorCodetherefore signals an infrastructure-level response (HTML page from a WAF, CDN, or reverse proxy), not an OAuth protocol failure. The newclassifyOAuthRetrieveErrorhelper applies this rule:ErrorCode: transient (infrastructure error)ErrorCode: permanent (OAuth protocol failure)Scope
Classification only — no changes to retry parameters, monitor lifecycle, or workload state machine. Outages that span longer than a single refresh attempt (e.g. a multi-hour VPN outage that crosses a token expiry boundary) still result in the workload being marked unauthenticated; that's a separable architectural change requiring a new "transiently failing" state and is not part of this fix.
Interaction with #5044
PR #5044 (in flight) emits a DCR remediation hint when classification returns permanent 4xx. This change reduces false triggers of that hint: the hint will fire for
invalid_grant/invalid_client(where DCR creds genuinely may be stale) but no longer for WAF blocks (where the hint would have been misleading). Pure mechanical merge conflict on the same file, no semantic conflict.Generated with Claude Code