Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 79 additions & 38 deletions pkg/auth/monitored_token_source.go
Original file line number Diff line number Diff line change
Expand Up @@ -283,11 +283,11 @@ func (mts *MonitoredTokenSource) Stopped() <-chan struct{} {
return mts.stopped
}

// Token retrieves a token, retrying with exponential backoff on transient errors
// (see isTransientNetworkError for the full list). On non-transient errors
// (OAuth 4xx, TLS failures) it marks the workload as unauthenticated and returns
// immediately. Context cancellation (workload removal) stops the retry without
// marking the workload as unauthenticated.
// Token retrieves a token, retrying with exponential backoff on transient
// errors and marking the workload as unauthenticated on non-transient errors.
// See isTransientNetworkError for the classification rule. Context
// cancellation (workload removal) stops the retry without marking the
// workload as unauthenticated.
//
// Concurrent callers are deduplicated via singleflight so that only one retry
// loop runs at a time during transient failures.
Expand Down Expand Up @@ -382,13 +382,20 @@ func (mts *MonitoredTokenSource) onTick() (bool, time.Duration) {
return false, wait
}

// isTransientNetworkError reports whether err represents a transient condition
// (DNS failure, TCP transport error, timeout, OAuth server 5xx, unparsable
// token response) that is likely to resolve on its own.
// isTransientNetworkError reports whether err represents a transient
// condition that is likely to resolve on its own. The categories are:
//
// OAuth2 client-level auth failures (invalid_grant, 401, 400) and TLS errors
// (certificate verification, handshake failure) are NOT considered transient and
// return false so the workload is marked unauthenticated immediately.
// - Network-level failures: DNS lookup errors, TCP transport errors,
// timeouts.
// - OAuth token-endpoint responses classified as transient by
// isTransientRetrieveError.
// - Unparsable token responses on a 2xx status (typically an HTML page
// from a load balancer or CDN).
//
// All other errors return false, causing the workload to be marked
// unauthenticated. TLS failures (certificate verification, handshake
// failure) are intentionally non-transient even though they surface
// through net.OpError like transport-level errors.
//
// The function is side-effect free; callers that want to emit a DCR
// remediation hint on a permanent 4xx must do so themselves at the
Expand All @@ -400,17 +407,8 @@ func isTransientNetworkError(err error) bool {
return false
}

// OAuth HTTP-level errors: 5xx (Bad Gateway, Service Unavailable, Gateway
// Timeout) are transient server-side issues that typically resolve on their
// own. 4xx errors (invalid_grant, invalid_client) are permanent auth failures.
if retrieveErr, ok := errors.AsType[*oauth2.RetrieveError](err); ok {
if retrieveErr.Response != nil && retrieveErr.Response.StatusCode >= 500 {
slog.Debug("treating OAuth server error as transient",
"status_code", retrieveErr.Response.StatusCode,
)
return true
}
return false
return isTransientRetrieveError(retrieveErr)
}

// Non-JSON responses from the OAuth server (e.g. load balancer HTML pages).
Expand Down Expand Up @@ -445,33 +443,76 @@ func isTransientNetworkError(err error) bool {
}

// isPermanentTokenEndpointError reports whether err is an *oauth2.RetrieveError
// whose status implies the cached client credentials are themselves the
// problem — specifically 400 (invalid_grant / invalid_client), 401, or
// 403. Used at state-transition boundaries to decide whether to emit a
// DCR/CIMD remediation hint alongside the unauthentication.
// whose response carries a structured RFC 6749 'error' code, implying the
// OAuth server itself rendered a verdict on the cached credentials
// (invalid_grant, invalid_client, etc.). Used at state-transition
// boundaries to decide whether to emit a DCR/CIMD remediation hint
// alongside the unauthentication.
//
// Other 4xx codes are intentionally NOT treated as permanent here even
// though isTransientNetworkError classifies the whole RetrieveError
// branch as non-transient. 408 (Request Timeout) and 429 (Too Many
// Requests) are typically transient back-pressure that the operator
// cannot remediate by deleting cached credentials; firing the
// "delete the cached credentials and restart" Warn on those would
// mislead operators chasing a transient hiccup. The narrower allowlist
// keeps the remediation hint truthful.
// This is the strict inverse of isTransientRetrieveError on the
// *oauth2.RetrieveError branch: a response is "permanent" iff the
// classifier would NOT call it transient. Concretely, the Warn fires
// only when ErrorCode is populated. 4xx responses without an OAuth
// error code (HTML pages from a WAF, CDN, or reverse proxy) — like
// 5xx, 429, 408, and nil-Response shapes — are treated as
// non-permanent because we have no OAuth-protocol verdict to act on.
// Recommending the user delete cached credentials based on a non-
// spec-compliant response would frequently mislead operators whose
// real problem is upstream of the OAuth server.
func isPermanentTokenEndpointError(err error) bool {
retrieveErr, ok := errors.AsType[*oauth2.RetrieveError](err)
if !ok {
if !ok || retrieveErr.Response == nil {
return false
}
return !isTransientRetrieveError(retrieveErr)
}

// isTransientRetrieveError reports whether an *oauth2.RetrieveError should
// be treated as transient. The classification rules are:
//
// - nil Response: non-transient. There is no signal to act on, so we fall
// through to the unauthenticated path rather than retry blindly.
// - 5xx status: transient (server-side issue, likely to resolve).
// - 429 Too Many Requests: transient regardless of body (HTTP standard).
// - 4xx with an empty ErrorCode: transient. The oauth2 library populates
// ErrorCode from the RFC 6749 'error' field in a JSON response body. An
// empty ErrorCode means the response was not a parseable OAuth error —
// typically an HTML page from a WAF, CDN, or reverse proxy that
// intercepted the request before it reached the OAuth server. These
// infrastructure errors (Cloudflare blocks, residential-IP allowlist
// misses, transient bad-config deploys) commonly resolve on their own.
// - 4xx with a populated ErrorCode: permanent. The OAuth server returned
// a structured error code (invalid_grant, invalid_client, etc.) telling
// us specifically what's wrong; retrying won't help.
func isTransientRetrieveError(retrieveErr *oauth2.RetrieveError) bool {
if retrieveErr.Response == nil {
return false
}
switch retrieveErr.Response.StatusCode {
case http.StatusBadRequest, http.StatusUnauthorized, http.StatusForbidden:
statusCode := retrieveErr.Response.StatusCode

if statusCode >= 500 {
slog.Debug("treating OAuth server error as transient",
"status_code", statusCode,
)
return true
default:
return false
}

if statusCode == http.StatusTooManyRequests {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

429 is now transient, but Retry-After is ignored — the existing 10s → 2m exponential backoff applies regardless of what the server suggests. Real providers (GitHub, Google) routinely set the header, so a Retry-After: 600 will get hit ~5–30 times before the 5-minute cap stops things, instead of waiting the requested cooldown. RFC 6585 §4 says it's MAY, so ignoring it is technically compliant.

Honoring it cleanly would mean threading a "min wait" hint from the classifier into the retry loop, which is a bigger change than this PR's classification-only scope. Would you want to file a follow-up issue (or follow-up PR) for it?

slog.Debug("treating OAuth rate-limit response as transient",
"status_code", statusCode,
"error_code", retrieveErr.ErrorCode,
)
return true
}

if retrieveErr.ErrorCode == "" {
slog.Debug("treating OAuth 4xx without error code as transient",
"status_code", statusCode,
)
return true
}

return false
}

// isOAuthParseError detects errors from the oauth2 library that indicate the
Expand Down
Loading
Loading