Skip to content
Closed

closed #2617

Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -173,9 +173,22 @@
return
}

// Dedupe LoginUserOperation for the same user — prevents RecoverFromDroppedLoginBug
// and the real login() call from both enqueuing a login op during the timing window.
// Wake the waiter as if we succeeded, since the already-queued op will do the work
// and enqueueAndWait callers (e.g. loginSuspend) would otherwise hang forever.
val op = queueItem.operation
if (op is LoginUserOperation &&
queue.any { it.operation is LoginUserOperation && it.operation.onesignalId == op.onesignalId }
) {
Logging.debug("OperationRepo: internalEnqueue - LoginUserOperation for onesignalId: ${op.onesignalId} already exists in the queue.")
queueItem.waiter?.wake(true)
return
}

if (index != null) {
queue.add(index, queueItem)
} else {

Check failure on line 191 in OneSignalSDK/onesignal/core/src/main/java/com/onesignal/core/internal/operations/impl/OperationRepo.kt

View check run for this annotation

Claude / Claude Code Review

Dedupe block: false-success signal and store leak

The new dedupe block has two issues: (1) it calls waiter.wake(true) immediately on the deduped caller before the original operation completes — if that op ultimately fails (e.g. FAIL_PAUSE_OPREPO, FAIL_UNAUTHORIZED for anonymous user), the caller silently receives a false success; (2) when loadSavedOperations() loads two persisted LoginUserOperations with the same onesignalId (state left from a pre-PR race), the dedupe drops the second op without calling _operationModelStore.remove(queueItem.ope
Comment on lines +183 to 191
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new dedupe block has two issues: (1) it calls waiter.wake(true) immediately on the deduped caller before the original operation completes — if that op ultimately fails (e.g. FAIL_PAUSE_OPREPO, FAIL_UNAUTHORIZED for anonymous user), the caller silently receives a false success; (2) when loadSavedOperations() loads two persisted LoginUserOperations with the same onesignalId (state left from a pre-PR race), the dedupe drops the second op without calling _operationModelStore.remove(queueItem.operation.id), leaving it orphaned in the store for the entire session until the next cold start.

Extended reasoning...

Bug 1 — False success signal on deduped waiter

When a LoginUserOperation is detected as a duplicate in internalEnqueue, the code immediately calls queueItem.waiter?.wake(true) and returns. This unblocks any enqueueAndWait caller (e.g. loginSuspend) with a success result before the already-queued operation has actually executed.

The specific failure paths that expose this: if the original queued op (op_A) returns FAIL_PAUSE_OPREPO (server 5xx, repo paused), the deduped caller already returned true from enqueueAndWait and the login state machine proceeds as if login succeeded — but the queue is now paused and all subsequent enqueued operations will never execute for this session. Similarly, FAIL_UNAUTHORIZED for an anonymous user (null externalId) causes op_A to be dropped with wake(false), while the deduped caller already got true.

The current code path that surfaces this: LoginHelper.enqueueLogin (the only direct caller of enqueueAndWait with a LoginUserOperation) checks if (!result) { Logging.warn("Could not login user") }. With the false success, this warning is never triggered even when login actually failed. Since loginSuspend returns Unit and callers don't inspect the boolean further today, the practical impact is limited to a missing log line — but the contract violation is real and future callers that act on the result would be silently misled.

The PR comment acknowledges the trade-off ("Wake the waiter as if we succeeded to avoid hanging forever"). A more precise fix would be to let op_A's actual result propagate to the deduped waiter by chaining it (e.g. storing deduped waiters on the existing queue item and waking them when op_A completes), though that requires more restructuring.

Bug 2 — Store leak when dedup fires during loadSavedOperations

loadSavedOperations() iterates persisted ops and calls internalEnqueue(..., addToStore=false). If the on-disk store already contains two LoginUserOperations with the same onesignalId (which the pre-PR race condition could produce), the dedupe check fires on the second op and returns early — before the if (addToStore) block, and crucially without ever calling _operationModelStore.remove().

Concrete step-by-step:

  1. Pre-PR code writes two LoginUserOperation(onesignalId="local-AAA") entries to the store during a race.
  2. App restarts with this PR installed. loadSavedOperations() loads them in reversed order.
  3. op_A is enqueued successfully into the in-memory queue.
  4. op_B hits the dedupe check — queue.any { ... onesignalId == "local-AAA" } is true. Returns early. _operationModelStore.remove(op_B.id) is never called.
  5. op_A executes, calls _operationModelStore.remove(op_A.id). op_B's ID remains in the store.
  6. Next cold start: op_B is the only LoginUserOperation in the store, so no dedup fires — it executes as an idempotent upsert and is finally removed.

The leak is self-healing after exactly one additional cold start, and the stale login op runs idempotently (upsert path). However, the orphaned entry persists for the entire current session.

Fix for bug 2: Inside the dedupe branch, add if (!addToStore) _operationModelStore.remove(queueItem.operation.id) before return. Since addToStore=false is exclusively used by loadSavedOperations, this correctly means 'this op came from the store; since we are dropping it from the queue, clean it from the store too.'

queue.add(queueItem)
}
}
Expand Down
Loading