Add AWS Comprehend PII redactor for JS SDK#562
Conversation
|
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be6959bcce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| redactedText.slice(0, entity.BeginOffset) + | ||
| replacement + | ||
| redactedText.slice(entity.EndOffset) |
There was a problem hiding this comment.
Convert Comprehend code-point offsets before slicing
When the prompt contains supplementary Unicode characters before the PII span, this slices at the wrong positions because Comprehend reports offsets in UTF-8 code points while JavaScript slice and text.length use UTF-16 code units. For example, an emoji earlier in the text shifts every following AWS offset by one code unit, so the redactor can leave part of an email/phone number unredacted or remove adjacent non-PII text. Convert code-point offsets to JS string indices before validating and slicing.
Useful? React with 👍 / 👎.
|
I have read the Arakoo CLA Document and I hereby sign the CLA |
|
recheck |
|
Fixed the Unicode offset issue in dd70025 by converting Comprehend code point offsets to JavaScript UTF-16 code unit indices before slicing. Added an emoji regression test. Local verification: awsComprehendRedactor.test.ts now passes 6/6 tests, and npx tsc -p tsconfig.json --noEmit passes. |
Implements the AWS Comprehend PII redaction utility requested in #290 for the JavaScript SDK.
/claim #290
What changed
AwsComprehendRedactorunder the AI package.DetectPiiEntitiesto detect PII spans.@arakoodev/edgechains.js/ai.examples/aws-comprehend-redaction.Verification
Results locally:
awsComprehendRedactor.test.ts: 6 tests passed.tsc --noEmit: passed.tsc -b: passed.js: passed.CLAAssistant: passed.npm run build: not run successfully on Windows because the existing script uses Unixrm -rf.Note
The real AWS call requires AWS credentials and region configured in the environment. Tests mock the Comprehend client so CI does not need AWS credentials.