Fix poison pill deadlock: Make retry heap unbounded by the-mann · Pull Request #2014 · aws/amazon-cloudwatch-agent

the-mann · 2026-02-11T21:00:46Z

Summary

Fix deadlock when failing log groups exceed concurrency limit by making the retry heap unbounded.

Problem

With bounded retry heap (size = concurrency):

Retry heap size = concurrency (e.g., 2)
When failing log groups (10) > heap size (2)
Heap fills with failed batches
Workers block trying to push more failed batches
System deadlocks - no progress possible
Allowed log groups get starved

Solution

Make retry heap unbounded:

Remove maxSize constraint and semaphore
Push() is now non-blocking
Failed batches queue up without blocking workers
Allowed log groups continue publishing normally

Changes

Remove maxSize and semaphore from retryHeap struct
Make Push() non-blocking (no semaphore wait)
Remove semaphore release from PopReady()
Replace sync.Cond halt/resume with channel-based approach + mutex to prevent shutdown deadlock
Add TestQueueStopWhileHalted to verify no shutdown deadlock
Add state callback tests for retry, expiry, and shutdown scenarios
Add poison_pill_test.go with comprehensive poison pill scenario tests
Clean up test assertions and remove redundant tests

Test Results

Before Fix: Test deadlocked after 30s with all goroutines blocked

After Fix: ✅ Test PASSES

Allowed success=5 (all batches published)
Denied attempts=110
Heap size=28 (grew beyond concurrency limit of 2)
No deadlock or starvation

Integration Test

Integration test run: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/21973935546

New integration test added in amazon-cloudwatch-agent-test:

Test directory: test/cloudwatchlogs_concurrency
Config: concurrency=2, force_flush_interval=5s
Validates: 1 allowed + 10 denied log groups
Expected: Allowed log group continues publishing despite denied groups

Manual Memory Test

1. Memory Stabilizes, Not Growing Unbounded ✅

Initial spike: +1.6 MB in first 3 minutes (161s)
Stabilization: Memory drops to +1.1 MB and remains stable
Final 2 minutes: No growth despite 5,000 more events

2. Garbage Collection Active

Memory actually decreased from 125,244 KB (161s) to 124,816 KB (247s), showing Go's garbage collector is reclaiming memory from processed batches.

3. Retry Heap Bounded by Target Count

With 10 denied log groups:

Each has 1 queue → max 1 batch in retry heap per target
Circuit breaker halts queues after failure → prevents new batch creation
Retry heap stabilizes at ~10 batches (one per denied log group)
Memory growth is from batch metadata, not unbounded accumulation

4. No Memory Leak

0.9% growth over 5 minutes with 29,000 events
Stable memory in final 2 minutes despite continued event ingestion
Well below threshold: <1% growth vs 50% leak threshold

Conclusion

✅ No memory leak: 0.9% growth over 5 minutes with 29,000 events
✅ Memory stabilizes: Peaks at 1.6 MB, drops to 1.1 MB, remains stable
✅ Garbage collection works: Memory decreases after peak
✅ Retry heap bounded: Limited by number of failing targets (~10 batches)
✅ Production ready: Safe for long-running deployments with persistent failures

Related PRs

fix(pusher): Prevent batch expiration recalculation on retry #1988 - Original PR (DRAFT, targeting sender-block-on-failure)
Fix poison pill bug: Make retry heap unbounded #2012 - Duplicate targeting sender-block-on-failure
feat(pusher): Add concurrent log publishing with circuit breaker and retry heap #2023 - Draft PR targeting main (full feature set)

…ock-on-failure # Conflicts: # plugins/outputs/cloudwatchlogs/cloudwatchlogs.go # plugins/outputs/cloudwatchlogs/internal/pusher/batch.go # plugins/outputs/cloudwatchlogs/internal/pusher/batch_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/pool_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/pusher.go # plugins/outputs/cloudwatchlogs/internal/pusher/pusher_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/queue_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/retryheap.go # plugins/outputs/cloudwatchlogs/internal/pusher/retryheap_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/sender.go # plugins/outputs/cloudwatchlogs/internal/pusher/sender_test.go

…nder-block-on-failure # Conflicts: # plugins/outputs/cloudwatchlogs/internal/pusher/pool_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/retryheap_test.go

- Add mutex protection to Stop() method to prevent race conditions - Add stopped flag checks in Push() to prevent pushing after Stop() - Ensure Push() checks stopped flag both before and after acquiring semaphore - Fix TestRetryHeapStopTwice to verify correct behavior

- Add TestRetryHeapProcessorExpiredBatchShouldResume to demonstrate bug - When a batch expires after 14 days, RetryHeapProcessor calls updateState() but not done(), leaving circuit breaker permanently closed - Target remains blocked forever even though bad batch was dropped - Test currently fails, demonstrating the bug from PR comment

Verifies that startTime and expireAfter are only set once on first call and remain unchanged on subsequent calls, ensuring the 14-day expiration is measured from the first send attempt, not from each retry.

Concurrency is now determined by whether workerPool and retryHeap are provided, making the explicit concurrency parameter redundant. 🤖 Assisted by AI

Add comprehensive recovery tests validating: 1. Permission granted during retry - system recovers and publishes logs 2. System restart during retry - resumes correctly with preserved metadata 3. Multiple targets - healthy targets unaffected by failing target Tests validate circuit breaker behavior, retry heap functionality, and proper isolation between targets during permission failures. Addresses CWQS-3192 (P1 requirement) 🤖 Assisted by AI

Add test_os_filter and test_dir_filter inputs to allow running specific tests on specific OS platforms. Filters use jq to filter generated test matrices before execution. Usage: -f test_os_filter=al2023 (run only on al2023) -f test_dir_filter=./test/cloudwatchlogs (run only cloudwatchlogs) When filters are omitted, all tests run (default behavior).

…THUB_OUTPUT errors

…default' into sender-block-on-failure

…move_maxretryduration # Conflicts: # .github/workflows/test-artifacts.yml

plugins/outputs/cloudwatchlogs/internal/pusher/queue_test.go

plugins/outputs/cloudwatchlogs/internal/pusher/retryheap.go

plugins/outputs/cloudwatchlogs/internal/pusher/pool_test.go

plugins/outputs/cloudwatchlogs/internal/pusher/queue.go

plugins/outputs/cloudwatchlogs/internal/pusher/retryheap.go

plugins/outputs/cloudwatchlogs/internal/pusher/queue_test.go

plugins/outputs/cloudwatchlogs/internal/pusher/circuitbreaker_test.go

plugins/outputs/cloudwatchlogs/internal/pusher/retryheap_expiry_test.go

The retry heap is now unbounded, so maxSize is no longer used. 🤖 Assisted by AI

batch.done() already calls updateState() internally, so the explicit call is unnecessary. 🤖 Assisted by AI

Test had no assertions and was not validating any behavior. 🤖 Assisted by AI

🤖 Assisted by AI

Variable was set but never checked in the test. 🤖 Assisted by AI

Circuit breaker should always block after exactly 1 send attempt, not "at most 1". 🤖 Assisted by AI

The dummyBatch was not connected to the queue's circuit breaker, so calling done() on it had no effect. Simplified test to only verify halt behavior. 🤖 Assisted by AI

- Replace sync.Cond with channel-based halt/resume to prevent shutdown deadlock (waitIfHalted now selects on haltCh and stopCh) - Add mutex to halt/resume/waitIfHalted for thread safety - Add TestQueueStopWhileHalted to verify no shutdown deadlock - Add TestQueueHaltResume with proper resume assertions - Clean up verbose test comments and weak assertions - Remove orphaned TestQueueResumeOnBatchExpiry comment 🤖 Assisted by AI

Verify state file management during retry, expiry, and shutdown: - Successful retry persists file offsets via state callbacks - Expired batch (14d) still persists offsets to prevent re-read - Clean shutdown does not persist state for unprocessed batches 🤖 Assisted by AI

- Fix TestRetryHeapProcessorSendsBatch: add events to batch, verify PutLogEvents is called and done callback fires (was testing empty batch) - Fix TestRetryHeapProcessorExpiredBatch: set expireAfter field so isExpired() actually returns true, verify done() is called - Fix race in TestRetryHeapProcessorSendsBatch: use atomic.Bool - Reduce TestRetryHeap_UnboundedPush sleep from 3s to 100ms 🤖 Assisted by AI

…adlock

…Groups TestPoisonPillScenario already covers the same scenario (10 denied + 1 allowed with low concurrency). The bounded heap no longer exists so the 'smaller than' framing is no longer meaningful. 🤖 Assisted by AI

🤖 Assisted by AI

jefchien

I think the tests could be improved, but functionally, it looks good to me.

jefchien · 2026-02-14T00:27:18Z

plugins/outputs/cloudwatchlogs/internal/pusher/queue_test.go

+	// Trigger resume by calling the success callback directly
+	queueImpl.resume()


nit: Comment isn't quite right, but it accomplishes the same thing.

jefchien · 2026-02-14T00:33:07Z

plugins/outputs/cloudwatchlogs/internal/pusher/queue_test.go

+	// Add second event - should be queued but not sent due to halt
+	q.AddEvent(newStubLogEvent("second message", time.Now()))
+
+	// Verify only one send happened (queue is halted)
+	assert.Equal(t, int32(1), sendCount.Load(), "Should have only one send due to halt")


This doesn't wait long enough for the second batch to have been created and attempted to flush, so even without the halting logic this could be true. The flush interval is 10ms, so we should wait at least that amount of time before checking the sendCount and resuming.

jefchien · 2026-02-14T00:39:58Z

plugins/outputs/cloudwatchlogs/internal/pusher/retryheap_recovery_test.go

+func stringPtr(s string) *string {
+	return &s
+}
+
+func int64Ptr(i int64) *int64 {
+	return &i
+}


nit: If we need these functions, we typically use the functions provided by AWS SDK (e.g. aws.String("string")). Can see them in other tests/code in the repo.

jefchien · 2026-02-14T00:41:44Z

plugins/outputs/cloudwatchlogs/internal/pusher/retryheap.go

@@ -139,7 +134,7 @@ type RetryHeapProcessor struct {
 func NewRetryHeapProcessor(retryHeap RetryHeap, workerPool WorkerPool, service cloudWatchLogsService, targetManager TargetManager, logger telegraf.Logger, maxRetryDuration time.Duration, retryer *retryer.LogThrottleRetryer) *RetryHeapProcessor {


maxRetryDuration was removed everywhere else. This is no longer used.

jefchien · 2026-02-14T00:50:53Z

plugins/outputs/cloudwatchlogs/internal/pusher/retryheap_recovery_test.go

+
+// TestRecoveryAfterSystemRestart validates that when the system restarts with
+// retry ongoing, it resumes correctly by loading state and continuing retries.
+func TestRecoveryAfterSystemRestart(t *testing.T) {


I'm not seeing the restart or anything to do with the state file.

jefchien · 2026-02-14T00:53:57Z

plugins/outputs/cloudwatchlogs/internal/pusher/poison_pill_test.go

+	// Process batches continuously
+	processorDone := make(chan struct{})
+	go func() {
+		ticker := time.NewTicker(20 * time.Millisecond)
+		defer ticker.Stop()
+		for {
+			select {
+			case <-processorDone:
+				return
+			case <-ticker.C:
+				processor.processReadyMessages()
+			}
+		}
+	}()


nit: Isn't this what the retry heap processor is for? Is there a reason we aren't using processor.Start()?

jefchien · 2026-02-14T00:55:12Z

plugins/outputs/cloudwatchlogs/internal/pusher/poison_pill_test.go

+// TestPoisonPillScenario validates that when 10 denied + 1 allowed log groups
+// share a worker pool with concurrency=2, the allowed log group continues
+// publishing without being starved by failed retries.
+func TestPoisonPillScenario(t *testing.T) {


So this only tests the retry heap? Is that intentional?

jefchien · 2026-02-14T01:01:51Z

plugins/outputs/cloudwatchlogs/internal/pusher/retryheap_recovery_test.go

+
+// TestRecoveryWithMultipleTargets validates that when one target has permission
+// issues, other healthy targets continue publishing successfully.
+func TestRecoveryWithMultipleTargets(t *testing.T) {


What is this testing differently than TestSingleDeniedLogGroup? Don't see which part of this is recovery.

jefchien · 2026-02-14T01:03:32Z

plugins/outputs/cloudwatchlogs/internal/pusher/state_callback_test.go

+
+// TestRetryHeapSuccessCallsStateCallback verifies that when a batch succeeds
+// on retry through the heap, state callbacks fire to persist file offsets.
+func TestRetryHeapSuccessCallsStateCallback(t *testing.T) {


This one and TestRetryHeapProcessorSendsBatch are pretty similar and a bit redundant.

jefchien · 2026-02-14T01:05:05Z

plugins/outputs/cloudwatchlogs/internal/pusher/retryheap_expiry_test.go

+
+// TestRetryHeapProcessorExpiredBatchShouldResume verifies that expired batches
+// resume the circuit breaker, preventing the target from being permanently blocked.
+func TestRetryHeapProcessorExpiredBatchShouldResume(t *testing.T) {


This, TestRetryHeapProcessorExpiredBatch, and TestRetryHeapExpiryCallsStateCallback are pretty similar as well. Could have all of the assertions in a single test. Would help improve maintainability and test run time.

agarakan and others added 30 commits December 30, 2025 05:00

introduce retry metadata to batch struct

6b231dc

Remove unused reset method

8521373

add unit tests for retryMetadata

7244af8

fix lint

d2f21e1

Introduce retryHeap and retryHeapProcessor

66186a7

Exchange pushch for semaphor to enformce heap size and blocking

83224b4

Add conditional logic to sender to call batch.Fail() during concurrency

7cfc794

Add unit tests

b4ffd7a

Instantiate RetryHeap and RetryHeapProcessor if concurrency enabled

0e4b0bc

Add unit tests for retryheap instantiation

9c1332a

Update sender to reference retryHeap to call push on fail

dddb691

Add unit tests for sender logic

02bc5c6

Implement halt on target logic

ef7d627

lint

309f904

lint

d9296a6

Merge remote-tracking branch 'origin/sender-block-on-failure' into se…

a5621fc

…nder-block-on-failure # Conflicts: # plugins/outputs/cloudwatchlogs/internal/pusher/pool_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/retryheap_test.go

fix tests

7051a0c

lx

de410f1

Remove configurable maxRetryTimeout in favor of default hardcoded value

f4c7620

Update tests for removed retryDuration parameter

c791abd

Add test for initializeStartTime idempotency

4d798e2

Verifies that startTime and expireAfter are only set once on first call and remain unchanged on subsequent calls, ensuring the 14-day expiration is measured from the first send attempt, not from each retry.

refactor(pusher): Remove unused concurrency parameter from NewPusher

cdf1651

Concurrency is now determined by whether workerPool and retryHeap are provided, making the explicit concurrency parameter redundant. 🤖 Assisted by AI

Merge workflow filtering from sender-block-on-failure

a47cc27

fix: Use compact JSON output in apply_filters to prevent multiline GI…

8e20ddf

…THUB_OUTPUT errors

the-mann added 6 commits February 11, 2026 15:25

Merge remote-tracking branch 'origin/main' into sender-block-on-failure

78c947b

Remove test filtering feature (moved to separate PR)

60b6f49

Trigger PR diff refresh

1b1973b

Merge remote-tracking branch 'origin/enable-multithreaded-logging-by-…

8a4960f

…default' into sender-block-on-failure

Merge remote-tracking branch 'origin/sender-block-on-failure' into re…

bd7147b

…move_maxretryduration # Conflicts: # .github/workflows/test-artifacts.yml

Revert gitignore changes (remove agent-sops)

a814482

the-mann requested a review from a team as a code owner February 11, 2026 21:00

jefchien reviewed Feb 12, 2026

View reviewed changes

the-mann added 18 commits February 12, 2026 10:23

Merge main into fix/poison-pill-deadlock

5e215da

refactor(pusher): Remove unused maxSize parameter from NewRetryHeap

fd2ea56

The retry heap is now unbounded, so maxSize is no longer used. 🤖 Assisted by AI

fix(pusher): Remove redundant updateState call in retryheap

38afc5f

batch.done() already calls updateState() internally, so the explicit call is unnecessary. 🤖 Assisted by AI

test(pusher): Remove empty TestSenderPoolRetryHeap test

3e1ed82

Test had no assertions and was not validating any behavior. 🤖 Assisted by AI

docs(pusher): Clean up verbose test comment in queue_test.go

2d07b38

🤖 Assisted by AI

docs(pusher): Clean up verbose test comment in retryheap_expiry_test.go

c739bb9

🤖 Assisted by AI

test(pusher): Remove unused circuitBreakerHalted variable

4e7393a

Variable was set but never checked in the test. 🤖 Assisted by AI

test(pusher): Use exact assertion for circuit breaker send count

f640a19

Circuit breaker should always block after exactly 1 send attempt, not "at most 1". 🤖 Assisted by AI

test(pusher): Remove ineffective dummyBatch code in TestQueueHaltResume

eb58912

The dummyBatch was not connected to the queue's circuit breaker, so calling done() on it had no effect. Simplified test to only verify halt behavior. 🤖 Assisted by AI

Merge main into fix/poison-pill-deadlock

6f941fe

Merge enable-multithreaded-logging-by-default into fix/poison-pill-de…

e59274a

…adlock

docs(pusher): Remove internal ticket references from test comments

334acdf

🤖 Assisted by AI

refactor(pusher): Simplify fail callback to direct method reference

b6f3b3e

🤖 Assisted by AI

style(pusher): Fix unused parameter lint warnings

98bdc89

🤖 Assisted by AI

the-mann force-pushed the fix/poison-pill-deadlock branch 2 times, most recently from c3d5e69 to 98bdc89 Compare February 13, 2026 17:10

jefchien approved these changes Feb 14, 2026

View reviewed changes

the-mann force-pushed the fix/poison-pill-deadlock branch from a45d9be to 98bdc89 Compare February 19, 2026 18:24

		// Trigger resume by calling the success callback directly
		queueImpl.resume()

		@@ -139,7 +134,7 @@ type RetryHeapProcessor struct {
		func NewRetryHeapProcessor(retryHeap RetryHeap, workerPool WorkerPool, service cloudWatchLogsService, targetManager TargetManager, logger telegraf.Logger, maxRetryDuration time.Duration, retryer retryer.LogThrottleRetryer) RetryHeapProcessor {

Conversation

the-mann commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Test Results

Integration Test

Manual Memory Test

1. Memory Stabilizes, Not Growing Unbounded ✅

2. Garbage Collection Active

3. Retry Heap Bounded by Target Count

4. No Memory Leak

Conclusion

Related PRs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jefchien left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

the-mann commented Feb 11, 2026 •

edited

Loading