Skip to content

fix: implement LRU eviction for SSL context cache and unify OAuth error handling#4241

Open
bogdanmariusc10 wants to merge 2 commits intomainfrom
4239-bug-ssl-context-cache-thrash-missing-test-assertion-and-oauth-error-handling-inconsistencies
Open

fix: implement LRU eviction for SSL context cache and unify OAuth error handling#4241
bogdanmariusc10 wants to merge 2 commits intomainfrom
4239-bug-ssl-context-cache-thrash-missing-test-assertion-and-oauth-error-handling-inconsistencies

Conversation

@bogdanmariusc10
Copy link
Copy Markdown
Collaborator

🔗 Related Issue

Closes #4239


📝 Summary

This PR fixes two critical bugs affecting production deployments:

  1. SSL Context Cache Thrash (HIGH): The SSL context cache was performing a full wipe when reaching capacity (100 entries), causing severe performance degradation in deployments with >100 gateways using custom CA certificates. Now implements proper LRU (Least Recently Used) eviction.

  2. OAuth Error Handling Inconsistency (MEDIUM): The refresh_token method used manual status code checking while other OAuth flows (_client_credentials_flow, _password_flow) used raise_for_status(). This inconsistency has been unified for better maintainability.


🏷️ Type of Change

  • Bug fix
  • Feature / Enhancement
  • Documentation
  • Refactor
  • Chore (deps, CI, tooling)
  • Other (describe below)

🔧 Changes Made

Issue 1: SSL Context Cache LRU Eviction

File: mcpgateway/utils/ssl_context_cache.py

  • Imported OrderedDict from collections for LRU tracking
  • Changed cache data structures from dict to OrderedDict
  • Added move_to_end() calls on cache hits to mark entries as recently used
  • Replaced full cache wipe with popitem(last=False) to remove only the oldest entry
  • Added graceful handling for empty timestamps dict during eviction

Before (Cache Thrash):

if len(_ssl_context_cache) > _SSL_CONTEXT_CACHE_MAX_SIZE:
    _ssl_context_cache.clear()  # Wiped ALL 100+ entries

After (LRU Eviction):

if len(_ssl_context_cache) >= _SSL_CONTEXT_CACHE_MAX_SIZE:
    _ssl_context_cache.popitem(last=False)  # Remove oldest only
    if _ssl_context_cache_timestamps:
        _ssl_context_cache_timestamps.popitem(last=False)

Issue 2: OAuth Error Handling Consistency

File: mcpgateway/services/oauth_manager.py

  • Standardized refresh_token to use response.raise_for_status() like other OAuth flows
  • Preserved special handling for 400/401 errors (invalid refresh tokens) via HTTPStatusError exception handling
  • Maintained retry logic with exponential backoff for transient errors

Before (Manual Status Check):

if response.status_code == 200:
    token_response = response.json()
    # ...
else:
    raise OAuthError(...)

After (Consistent raise_for_status):

response.raise_for_status()
token_response = response.json()
# ...
except httpx.HTTPStatusError as e:
    if e.response.status_code in [400, 401]:
        raise OAuthError(f"Refresh token invalid or expired: {e.response.text}")

Test Updates

Files: tests/unit/mcpgateway/utils/test_ssl_context_cache.py, tests/unit/mcpgateway/services/test_oauth_manager.py, tests/unit/mcpgateway/services/test_oauth_manager_pkce.py

  • Updated SSL cache tests to expect LRU behavior (100 entries maintained, oldest evicted)
  • Updated OAuth tests to mock raise_for_status() with HTTPStatusError exceptions
  • Added Mock import to test_oauth_manager_pkce.py
  • Adjusted error message regex patterns to match new error handling format

🧪 Verification

Check Command Status
Lint suite make lint Pass ✅
Unit tests make test Pass ✅
Coverage ≥ 80% make coverage Pass ✅

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • Tests added/updated for changes
  • Documentation updated (if applicable)
  • No secrets or credentials committed

📓 Notes

Impact Assessment

Issue 1 - SSL Cache (HIGH SEVERITY):

  • Before: Every 101st unique CA certificate cleared all 100 cached SSL contexts
  • After: Only the least recently used entry is evicted, maintaining cache efficiency
  • Benefit: Eliminates cache thrash in high-traffic deployments with many gateways

Issue 2 - OAuth Consistency (MEDIUM SEVERITY):

  • Before: Three different error handling patterns across OAuth flows
  • After: Unified raise_for_status() pattern with specialized exception handling
  • Benefit: Improved code maintainability and predictable error behavior

Testing Strategy

All existing tests updated to reflect new behavior:

  • SSL cache tests verify LRU eviction (oldest entry removed, 100 entries maintained)
  • OAuth tests mock HTTPStatusError exceptions for status code validation
  • Special handling for 400/401 refresh token errors preserved and tested

Backward Compatibility

Fully backward compatible - No breaking changes to public APIs or behavior, only internal implementation improvements.

…or handling

Issue 1 - SSL Context Cache Thrash (HIGH):
- Replace full cache wipe with LRU eviction using OrderedDict
- Move cache entries to end on access to mark as recently used
- Remove only oldest entry when cache reaches capacity (100 entries)
- Prevents cache thrash in deployments with >100 gateways using custom CA certs
- Gracefully handle empty timestamps dict during eviction

Issue 2 - OAuth Error Handling Consistency (MEDIUM):
- Standardize refresh_token to use raise_for_status() like other OAuth flows
- Preserve special handling for 400/401 errors (invalid refresh tokens)
- Maintain retry logic with exponential backoff for transient errors
- Improves code maintainability and predictability

Test Updates:
- Update SSL cache tests to expect LRU behavior (100 entries maintained)
- Update OAuth tests to mock raise_for_status() with HTTPStatusError
- Add Mock import to test_oauth_manager_pkce.py
- Adjust error message regex patterns for new error handling

Fixes cache thrash under high traffic and unifies OAuth error patterns.

Signed-off-by: Bogdan-Marius-Catanus <bogdan-marius.catanus@ibm.com>
@bogdanmariusc10 bogdanmariusc10 added this to the Release 1.0.0 milestone Apr 16, 2026
@bogdanmariusc10 bogdanmariusc10 added SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release api REST API Related item labels Apr 16, 2026
- Remove unused typing.Any and typing.Dict imports (ruff F401)
- Fix mypy type errors for _SSL_CONTEXT_CACHE_TTL variable
- Add type annotation for ca_certificate parameter to accept str | bytes
- Remove trailing whitespace from test file (pre-commit hook)

Resolves mypy assignment and arg-type errors while maintaining backward compatibility.

Signed-off-by: Bogdan-Marius-Catanus <bogdan-marius.catanus@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api REST API Related item SHOULD P2: Important but not vital; high-value items that are not crucial for the immediate release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: SSL context cache thrash, missing test assertion and OAuth error handling inconsistencies

1 participant