Skip to content

Set log retention + RemovalPolicy.RETAIN on CloudWatch log groups #119

@scoropeza

Description

@scoropeza

Follow-up from PR #88 — surfaced when discussing the log group rename / migration story for future production adopters; ABCA today has no production users so the migration scenario doesn't apply, but the underlying log-retention policy is worth setting now while it's cheap.

Functional description

ABCA's CDK constructs create CloudWatch log groups with default retention (NEVER) and default removal policy (DESTROY). That combination has two failure modes that show up only when ABCA gets used past its current reference-application stage:

  1. cdk destroy deletes log groups along with the stack. For dev iteration, fine — short-lived stacks, throwaway logs. For a future production adopter, this means the first cdk destroy (intentional or accidental) silently deletes every log of every agent run that ever happened. Forensic data, security audit trail, "what did the agent run last week?" — all gone.

  2. No retention cap. Log groups accumulate indefinitely. CloudWatch charges per GB stored. Long-running ABCA stacks eventually pay for years of agent stdout that nobody will ever read.

Both fixes are 1-line changes per construct. Both should land before the first production adopter goes live, not after — because retrofitting changes how live log groups behave (potentially destroying data) and creates the exact "circular migration problem" we want to avoid.

ABCA is currently a reference application — there are no production adopters today, so no immediate user-visible problem. Filing this as a "ship before the first prod use" issue so it doesn't decay.

Why this is its own issue, not a code change:

  • Ship-it-now is tempting (it's small) but the audit needs to be deliberate: every LogGroup construct, every implicit log group from aws-lambda constructs, every AgentCore-managed log group needs to be considered separately. Some have policy reasons to NOT retain (e.g. ephemeral CI logs).
  • Open as an issue so the decision conversation happens publicly and the choices are documented in the issue thread for future maintainers.

Technical context

Current state (audit needed for a complete list):

  • cdk/src/constructs/task-orchestrator.ts — Lambda log group, default DESTROY + NEVER.
  • cdk/src/constructs/fanout-consumer.ts — Lambda log group, default DESTROY + NEVER.
  • cdk/src/constructs/approval-metrics-publisher-consumer.ts — Lambda log group, default DESTROY + NEVER.
  • cdk/src/handlers/* — every Lambda construct has an implicit log group.
  • cdk/src/stacks/agent.ts — AgentCore Runtime log group (managed by AgentCore service, may not be directly settable via CDK).
  • cdk/src/constructs/agent-vpc.ts — VPC Flow Logs may have their own log group.

The two changes per construct:

// Before:
new lambda.Function(this, 'Fn', { ... });

// After:
const fn = new lambda.Function(this, 'Fn', {
  logRetention: logs.RetentionDays.ONE_MONTH,  // or longer for audit-relevant
  ...
});

// Or for explicit log groups:
const lg = new logs.LogGroup(this, 'Lg', {
  retention: logs.RetentionDays.ONE_MONTH,
  removalPolicy: RemovalPolicy.RETAIN,
});

Retention period choice:

  • 30 days — minimum for security incident response (most pen-test reports require ≥30 days of access logs).
  • 90 days — common compliance baseline.
  • 365 days — if the team treats agent runs as audit-relevant.

Recommend 30 days as the ABCA reference default, document in AGENTS.md / CLAUDE.md that adopters should override per their compliance requirements.

Removal policy choice:

  • RETAIN — log groups survive cdk destroy. Operator manually cleans up if they want zero residue.
  • DESTROY — log groups vanish with the stack. Cleanup is automatic; data is lost.

Recommend RETAIN for ABCA reference. The cost of orphans is "operator runs aws logs delete-log-group once per stack" — bounded and recoverable. The cost of unintended DESTROY is unbounded data loss.

Why this issue and not just a PR:

  • The retention period is a policy choice that should be discussed publicly.
  • Some constructs may have reasons to opt out (CI-only log groups, smoke-test scaffolding).
  • Future adopters reading the codebase should be able to find the rationale.

Proposed approach

  1. Audit phase (~30 min): identify every log group resource in cdk/src/. Output: a table in this issue thread of (construct, current state, proposed state, rationale).
  2. Decision (in-issue discussion): confirm the retention default (30/90/365 days) and the removal policy (RETAIN vs DESTROY per construct).
  3. Implementation (~30 min): make the changes per the audit table. Tests stay green (CDK construct tests don't pin retention by default, but worth adding Match.objectLike({ RetentionInDays: ... }) assertions for the major constructs to prevent silent regressions).
  4. Documentation (~15 min): add a "Log retention policy" section to AGENTS.md / CLAUDE.md explaining the chosen defaults and how to override.

Acceptance criteria

  • Every CDK-managed log group has an explicit retention set (no defaults to NEVER without rationale)
  • Every CDK-managed log group has an explicit removalPolicy set (default RETAIN unless rationale)
  • Construct tests pin the retention value with Match.objectLike({ RetentionInDays: <value> })
  • AGENTS.md / CLAUDE.md documents the chosen defaults under an "ABCA conventions" or "Operations" section
  • If any construct opts out of RETAIN, the rationale is in a code comment

Out of scope

  • Migrating already-deployed stacks to the new policy (not applicable today; ABCA is a reference application).
  • Cross-region log replication.
  • Encryption-at-rest configuration on log groups (separate security hardening issue if appetite).
  • Defining a CI sweep that auto-deletes orphaned RETAIN groups across test accounts (operational tooling; separate concern).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions