Skip to content

Add prod migration runbook for DynamoDB GSI projection changes #110

@scoropeza

Description

@scoropeza

Follow-up from PR #88 — surfaced during the Cedar HITL merge when adding matching_rule_ids to the TaskApprovalsTable.user_id-status-index GSI projection.

Functional description

ABCA's TaskApprovalsTable exposes a user_id-status-index Global Secondary Index used by GET /tasks/pending to return per-user pending approvals in a single query. When a new feature needs another attribute available on that GSI (e.g. matching_rule_ids for the Cedar HITL UX), DynamoDB rejects in-place updates to the GSI's nonKeyAttributes — the only way forward is to delete the GSI and recreate it, which on a backing-table-replacing CDK construct means destroy + redeploy.

Today this works fine in dev (the affected stack rebuilds cleanly), but there is no documented prod-safe procedure for adding a field to a GSI projection on a table holding live customer data. Any team taking ABCA past dev hits this the first time they evolve the schema.

User-visible impact:

  • Operator running cdk deploy on a stack with live data sees a CloudFormation rollback with Cannot update GSI's properties other than Provisioned Throughput and Contributor Insights Specification.
  • No documented recovery path; the only way through (in dev) was to rename the construct id (TaskApprovalsTableTaskApprovalsTableV2) which forces table replace and drops every pending approval row.
  • A team running this in prod would either lose data or have to write the migration tooling from scratch under time pressure.

Technical context

Where the constraint lives: AWS DynamoDB API. UpdateTable accepts GSI updates only for provisioned throughput and contributor insights — the projection list is fixed at GSI creation time.

Where it bit us in PR #88:

  • cdk/src/constructs/task-approvals-table.ts — the V2 suffix on the construct id is the dev workaround; commit c149993 (initial) and the rename to V2 (during PR feat: Cedar HITL approval gates for agent tool use #88) are the audit trail.
  • cdk/test/constructs/task-approvals-table.test.ts — uses Match.arrayWith([..., 'matching_rule_ids']) to lock the projection list. This catches future projection drift but doesn't help with the migration itself.

Two viable patterns for prod:

  1. Dual-index shadow pattern: ship a new GSI alongside the old one (e.g. user_id-status-v2-index), backfill it asynchronously, switch reads via a feature flag, then remove the old GSI in a follow-up release. Zero data loss; operator visibility into progress; safe to abort. ~1 week of work to template + document.
  2. Drain-and-rebuild: pause writes via a scheduled rule, snapshot the table to S3, recreate the table+GSI, restore. Faster (~1 hour for our table sizes) but requires a write outage and a tested restore procedure.

Other ABCA tables with GSIs that may face the same evolution: TaskTable.status-index, TaskEventsTable (no GSI today but planned per design §10).

Proposed options

Recommend option 1 (dual-index shadow) as the canonical procedure, with option 2 documented as the "small table / acceptable outage" alternative. Land as:

  • A runbook under docs/operations/ (new directory) titled "Migrating DynamoDB GSI projections."
  • A reusable construct cdk/src/constructs/dual-index-table.ts that wraps the dual-index ship + backfill pattern, parameterized by table name + index name + new attribute list.
  • A worked example doing the migration the runbook describes against TaskApprovalsTable on the backgroundagent-dev stack so reviewers can see the procedure end-to-end.

Acceptance criteria

  • docs/operations/dynamodb-gsi-migration.md exists and is linked from docs/design/ARCHITECTURE.md
  • At least one ABCA construct uses the documented pattern as an example (does not need to be PR feat: Cedar HITL approval gates for agent tool use #88's table — pick one with the lowest blast radius)
  • Runbook covers: detection (what error you see), preconditions, step-by-step (with aws CLI commands and CDK changes), rollback, and how to verify success
  • Construct unit tests pin the projection list with Match.arrayWith (already done for TaskApprovalsTable; confirm for any new construct)
  • Cross-link PR feat: Cedar HITL approval gates for agent tool use #88 issue #X (this one) from the runbook header

Out of scope

  • Migrating the LIVE backgroundagent-dev stack; the runbook can be tested against an ephemeral stack.
  • Adding migration tooling for non-DDB resources (S3 lifecycle, Lambda env-var changes, etc.) — separate concerns.
  • Auto-migration on deploy; this should be an operator-driven procedure, not implicit.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions