feat: entropy-based secret detection for exception code variables#692
Conversation
|
posthog-python Compliance ReportDate: 2026-06-23 16:33:49 UTC ✅ All Tests Passed!45/45 tests passed Capture Tests✅ 29/29 tests passed View Details
Feature_Flags Tests✅ 16/16 tests passed View Details
|
a245e6e to
d206ea9
Compare
Add a last-resort entropy-based detector that redacts high-entropy, secret-looking values (API keys, tokens, strong passwords) sitting in innocuously-named code variables, after the existing name-pattern and URL-credential checks. - Known vendor key formats (OpenAI, Anthropic, AWS, Stripe, GitHub, GitLab, Slack, Google, JWT, PEM private keys) are matched directly. - Structured identifiers (UUIDs, Mongo ObjectIds, hashes), object reprs, file paths and URLs are never flagged. - Exposed as the `code_variables_detect_secrets` option (default True) with a per-context override, threaded through client/contexts. - Tighten the masking size caps to keep capture cost bounded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
d206ea9 to
615244d
Compare
|
|
||
| # Synthetic, format-correct fakes (no real credentials). Vendor keys are assembled from | ||
| # prefix + body so no complete secret literal lives in source (which trips secret scanners). | ||
| def _key(prefix, body): |
There was a problem hiding this comment.
I had to split secrets into two parts. Otherwise GitHub wasn't happy
|
Reviews (2): Last reviewed commit: "feat: entropy-based secret detection for..." | Re-trigger Greptile |
An opaque object whose __repr__ returns a bare high-entropy token bypassed detection, since _safe_repr only checked the name/keyword mask. Run _looks_like_secret on the rendered repr too; normal reprs (with parens/brackets/ angle-brackets) are rejected by the repr-punctuation guard, so only bare-token reprs are caught. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
i've used https://github.com/Yelp/detect-secrets/tree/master/detect_secrets/plugins (see regexes) in the past, not sure if its up to date tho |
Add high-confidence, distinctive-prefix patterns adapted from the gitleaks / detect-secrets rule sets: Hugging Face, Google OAuth, DigitalOcean, Square, Grafana, Twilio, SendGrid, Mailgun, Mailchimp, npm, PyPI, Databricks, Doppler, Postman, Linear, Notion, Shopify, New Relic, and more GitLab/Slack variants. The entropy gate already catches most high-entropy keys generically; the known list mainly adds deterministic guarantees and covers low-character-class formats (e.g. AWS AKIA) the entropy gate rejects. Runs only on the fallback path, so no measurable perf change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks @marandaneto I borrowed some stuff from them 8) |
Currently we detect secrets based on keys and values having common secret phrases like
api-key,password, etc...@hpouillot got an amazing idea to do an entropy based secrets detection. This PR does that and extends our detection with popular secrets format like
sk-ant,ey...,gh_...,glpat_....There is a problem however - sometimes high entropy strings are classified as secrets. We do our best to detect genuine high entropy non-secrets and stop the entropy detection process if it's:
There are much more cases and rules - it's in the PR.
I also tightened our search limits based on real world exceptions.
Did a benchmark on 3 real exceptions. This computes how long does it take to capture variables from frames