Skip to content

feat: entropy-based secret detection for exception code variables#692

Merged
ablaszkiewicz merged 3 commits into
mainfrom
code-variables-entropy-detection
Jun 23, 2026
Merged

feat: entropy-based secret detection for exception code variables#692
ablaszkiewicz merged 3 commits into
mainfrom
code-variables-entropy-detection

Conversation

@ablaszkiewicz

@ablaszkiewicz ablaszkiewicz commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Currently we detect secrets based on keys and values having common secret phrases like api-key, password, etc...

@hpouillot got an amazing idea to do an entropy based secrets detection. This PR does that and extends our detection with popular secrets format like sk-ant, ey..., gh_..., glpat_....

There is a problem however - sometimes high entropy strings are classified as secrets. We do our best to detect genuine high entropy non-secrets and stop the entropy detection process if it's:

  • UUID,
  • file or URL path,
  • something [in these brackets] or in {these brackets}.

There are much more cases and rules - it's in the PR.

I also tightened our search limits based on real world exceptions.

Did a benchmark on 3 real exceptions. This computes how long does it take to capture variables from frames

Case Before (master) (ms) After (ms) Δ (ms)
eventA 0.062 0.083 +0.021 (+34%)
eventB 0.040 0.053 +0.013 (+31%)
eventC 0.040 0.053 +0.012 (+31%)

@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Comments Outside Diff (3)

  1. posthog/test/test_code_variables.py, line 684-688 (link)

    P2 _is_high_entropy_secret called outside its documented pre-condition

    _is_high_entropy_secret is documented as "Assumes len(value) >= _SECRET_MIN_LENGTH" and the only call site in production code respects this (the length guard in _looks_like_secret runs first). The test calls _is_high_entropy_secret("xK9#mP2$") — an 8-character string — directly, bypassing that guard. The function happens to return False because the entropy of 8 unique characters is ≈ 3.0 bits/char (below the 3.8 threshold), but the test is exercising behaviour that lies outside the function's contract. A future maintainer who tightens the generic gate could inadvertently shift the function's short-string behaviour without any failing tests.

    Consider testing the cheap-exit via the public _looks_like_secret instead, which is the only intended caller and enforces the length precondition explicitly.

  2. posthog/test/test_code_variables.py, line 726-744 (link)

    P2 No test for the per-context override of detect_secrets

    The per-context toggle set_code_variables_detect_secrets_context is a new public API entry point wired through contexts.py and client.py, but TestSecretDetection only exercises the client-level option (code_variables_detect_secrets=False in make_client). A test that wraps a capture in a with posthog.new_context(): set_code_variables_detect_secrets_context(False) block and asserts the secret leaks through would close the remaining gap and mirror the coverage pattern the other context-level toggles already have.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  3. posthog/test/test_code_variables.py, line 679-692 (link)

    P2 Duplicate coverage of already-parameterised values

    test_uuids_and_object_ids_are_never_flagged and test_pure_hex_is_treated_as_an_id_not_a_secret assert on exact string values (the UUID, Mongo ObjectId, and md5 hex digest) that are already present in NON_SECRETS and exercised by the parameterised test_non_secrets_are_not_flagged. This violates the OnceAndOnlyOnce rule and the team preference for parameterised tests over dedicated single-case tests. Either remove these methods and let the parameterised suite carry the coverage, or pull these "must-never-flag" entries into a named sub-list with their own parameterised test and remove the duplicates from NON_SECRETS.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "feat: entropy-based secret detection for..." | Re-trigger Greptile

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

posthog-python Compliance Report

Date: 2026-06-23 16:33:49 UTC
Duration: 530093ms

✅ All Tests Passed!

45/45 tests passed


Capture Tests

29/29 tests passed

View Details
Test Status Duration
Format Validation.Event Has Required Fields 516ms
Format Validation.Event Has Uuid 10007ms
Format Validation.Event Has Lib Properties 10007ms
Format Validation.Distinct Id Is String 10007ms
Format Validation.Token Is Present 10007ms
Format Validation.Custom Properties Preserved 10008ms
Format Validation.Event Has Timestamp 10007ms
Retry Behavior.Retries On 503 18020ms
Retry Behavior.Does Not Retry On 400 12003ms
Retry Behavior.Does Not Retry On 401 10009ms
Retry Behavior.Respects Retry After Header 16013ms
Retry Behavior.Implements Backoff 30018ms
Retry Behavior.Retries On 500 13009ms
Retry Behavior.Retries On 502 16011ms
Retry Behavior.Retries On 504 16011ms
Retry Behavior.Max Retries Respected 30017ms
Deduplication.Generates Unique Uuids 7002ms
Deduplication.Preserves Uuid On Retry 16017ms
Deduplication.Preserves Uuid And Timestamp On Retry 23019ms
Deduplication.Preserves Uuid And Timestamp On Batch Retry 16005ms
Deduplication.No Duplicate Events In Batch 10003ms
Deduplication.Different Events Have Different Uuids 10006ms
Compression.Sends Gzip When Enabled 10009ms
Batch Format.Uses Proper Batch Structure 10007ms
Batch Format.Flush With No Events Sends Nothing 5005ms
Batch Format.Multiple Events Batched Together 10005ms
Error Handling.Does Not Retry On 403 12009ms
Error Handling.Does Not Retry On 413 10007ms
Error Handling.Retries On 408 14014ms

Feature_Flags Tests

16/16 tests passed

View Details
Test Status Duration
Request Payload.Request With Person Properties Device Id 9502ms
Request Payload.Flags Request Uses V2 Query Param 10007ms
Request Payload.Flags Request Hits Flags Path Not Decide 10006ms
Request Payload.Flags Request Omits Authorization Header 10007ms
Request Payload.Token In Flags Body Matches Init 10007ms
Request Payload.Groups Round Trip 10007ms
Request Payload.Groups Default To Empty Object 10006ms
Request Payload.Person Properties Distinct Id Auto Populated When Caller Omits It 10007ms
Request Payload.Disable Geoip False Propagates As Geoip Disable False 10007ms
Request Payload.Disable Geoip Omitted Defaults To False 10006ms
Request Payload.Flag Keys To Evaluate Contains Only Requested Key 10007ms
Request Lifecycle.No Flags Request On Init Alone 5004ms
Request Lifecycle.No Flags Request On Normal Capture 10507ms
Request Lifecycle.Two Flag Calls Produce Two Remote Requests 9512ms
Request Lifecycle.Mock Response Value Is Returned To Caller 10001ms
Side Effect Events.Get Feature Flag Captures Feature Flag Called Event 10509ms

@ablaszkiewicz ablaszkiewicz force-pushed the code-variables-entropy-detection branch from a245e6e to d206ea9 Compare June 23, 2026 11:51
Add a last-resort entropy-based detector that redacts high-entropy,
secret-looking values (API keys, tokens, strong passwords) sitting in
innocuously-named code variables, after the existing name-pattern and
URL-credential checks.

- Known vendor key formats (OpenAI, Anthropic, AWS, Stripe, GitHub,
  GitLab, Slack, Google, JWT, PEM private keys) are matched directly.
- Structured identifiers (UUIDs, Mongo ObjectIds, hashes), object reprs,
  file paths and URLs are never flagged.
- Exposed as the `code_variables_detect_secrets` option (default True)
  with a per-context override, threaded through client/contexts.
- Tighten the masking size caps to keep capture cost bounded.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ablaszkiewicz ablaszkiewicz force-pushed the code-variables-entropy-detection branch from d206ea9 to 615244d Compare June 23, 2026 12:07

# Synthetic, format-correct fakes (no real credentials). Vendor keys are assembled from
# prefix + body so no complete secret literal lives in source (which trips secret scanners).
def _key(prefix, body):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to split secrets into two parts. Otherwise GitHub wasn't happy

@ablaszkiewicz ablaszkiewicz marked this pull request as ready for review June 23, 2026 12:08
@ablaszkiewicz ablaszkiewicz requested a review from a team as a code owner June 23, 2026 12:08
@ablaszkiewicz ablaszkiewicz requested review from a team, cat-ph and hpouillot June 23, 2026 12:08
@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Reviews (2): Last reviewed commit: "feat: entropy-based secret detection for..." | Re-trigger Greptile

@cat-ph cat-ph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this!! 🚀🚀

Comment thread posthog/exception_utils.py
An opaque object whose __repr__ returns a bare high-entropy token bypassed
detection, since _safe_repr only checked the name/keyword mask. Run
_looks_like_secret on the rendered repr too; normal reprs (with parens/brackets/
angle-brackets) are rejected by the repr-punctuation guard, so only bare-token
reprs are caught.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@marandaneto

marandaneto commented Jun 23, 2026

Copy link
Copy Markdown
Member

i've used https://github.com/Yelp/detect-secrets/tree/master/detect_secrets/plugins (see regexes) in the past, not sure if its up to date tho

Add high-confidence, distinctive-prefix patterns adapted from the gitleaks /
detect-secrets rule sets: Hugging Face, Google OAuth, DigitalOcean, Square,
Grafana, Twilio, SendGrid, Mailgun, Mailchimp, npm, PyPI, Databricks, Doppler,
Postman, Linear, Notion, Shopify, New Relic, and more GitLab/Slack variants.

The entropy gate already catches most high-entropy keys generically; the known
list mainly adds deterministic guarantees and covers low-character-class formats
(e.g. AWS AKIA) the entropy gate rejects. Runs only on the fallback path, so no
measurable perf change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ablaszkiewicz

Copy link
Copy Markdown
Contributor Author

Thanks @marandaneto I borrowed some stuff from them 8)

@ablaszkiewicz ablaszkiewicz merged commit 42ff4ca into main Jun 23, 2026
31 checks passed
@ablaszkiewicz ablaszkiewicz deleted the code-variables-entropy-detection branch June 23, 2026 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants