Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,11 @@
private val ioDispatcher: CoroutineDispatcher = Dispatchers.IO,
) : IOneSignal, IServiceProvider {

private val suspendCompletion = CompletableDeferred<Unit>()
// Reset every time the synchronized(initLock) block flips state to IN_PROGRESS so that
// a retry-after-FAILED gets a fresh latch instead of an already-completed one. Mutated only
// under initLock; reads outside that lock must local-capture before suspending on it.
@Volatile
private var suspendCompletion = CompletableDeferred<Unit>()

@Volatile
private var initState: InitState = InitState.NOT_STARTED
Expand Down Expand Up @@ -302,12 +306,15 @@
if (initState.isSDKAccessible()) {
Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress")
return true
}

initState = InitState.IN_PROGRESS
// Fresh latch for this init attempt -- prevents retry-after-FAILED waiters from
// observing the prior init's already-completed deferred.
suspendCompletion = CompletableDeferred()
}

// FeatureManager depends on ConfigModelStore/PreferencesService which requires appContext.

Check failure on line 317 in OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt

View check run for this annotation

Claude / Claude Code Review

Sync initWithContext leaks IN_PROGRESS state and uncompleted latch on pre-internalInit throw

Sync `initWithContext(context, appId)` resets `suspendCompletion` and sets `initState=IN_PROGRESS` inside the synchronized block (lines 305-315), then calls `ensureApplicationServiceStarted(context)` at line 319 outside any try/catch — the new try/catch only wraps `internalInit`'s body. If that call throws (e.g. `ApplicationService.start` line 81's `context.applicationContext as Application` cast on a non-Application context), `initState` is leaked at `IN_PROGRESS` and the just-reset latch is ne
Comment on lines 309 to 317
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Sync initWithContext(context, appId) resets suspendCompletion and sets initState=IN_PROGRESS inside the synchronized block (lines 305-315), then calls ensureApplicationServiceStarted(context) at line 319 outside any try/catch — the new try/catch only wraps internalInit's body. If that call throws (e.g. ApplicationService.start line 81's context.applicationContext as Application cast on a non-Application context), initState is leaked at IN_PROGRESS and the just-reset latch is never completed, causing every re-entrant suspend caller (initWithContextSuspend, waitUntilInitInternal) to hang on await() indefinitely — the exact inverted failure mode this PR's internalInit catch block was designed to prevent. Fix by mirroring the same try/catch around lines 317-332 (or moving ensureApplicationServiceStarted inside internalInit's protected body).

Extended reasoning...

What the bug is

The PR added a try/catch around internalInit's body (with a comment explicitly naming the hang failure mode it prevents) and a fresh-latch reset under synchronized(initLock) at the start of every init attempt. But on the sync initWithContext(context, appId) entry path, there is unprotected code between the latch reset and the internalInit call. That gap re-introduces the very hang the catch block was designed to prevent.

Code path

OneSignalImp.kt lines 297-333:

override fun initWithContext(context: Context, appId: String): Boolean {
    synchronized(initLock) {
        if (initState.isSDKAccessible()) { return true }
        initState = InitState.IN_PROGRESS
        suspendCompletion = CompletableDeferred()  // L1: fresh latch
    }
    ensureApplicationServiceStarted(context)       // <-- line 319, OUTSIDE try/catch
    if (isBackgroundThreadingEnabled) {
        suspendifyOnIO { internalInit(context, appId) }  // catch lives inside internalInit
        return true
    }
    return runBlocking(runtimeIoDispatcher) { internalInit(context, appId) }
}

ensureApplicationServiceStarted reaches ApplicationService.start(context) (ApplicationService.kt line 78), which does:

val application = context.applicationContext as Application  // line 81

That cast throws ClassCastException for any Context whose applicationContext is not an Application instance — uncommon in production but realistic for hosts initializing from ContentProvider contexts (some apps do this for pre-onCreate init), test/instrumentation harnesses that wrap or mock Context, or restricted/launcher contexts that return a ContextWrapper.

Why existing code does not prevent it

The PR's new try { ... } catch (t: Throwable) { ... initState = FAILED; notifyInitComplete() ... } is scoped to internalInit's body (OneSignalImp.kt lines ~360-405). It does not wrap the lines between the synchronized block and the internalInit invocation in the sync entry path. The PR comment on that catch even enumerates the exact symptom — "would otherwise leave initState at IN_PROGRESS forever and suspendCompletion uncompleted -- causing re-entrant suspend callers (e.g. SyncJobService) to hang indefinitely on await()" — but the protection is one stack frame too narrow.

isBackgroundThreadingEnabled itself is already defensively wrapped (catches Throwable from the feature-manager lookup), so the only realistic throw point in the unprotected window is ensureApplicationServiceStarted.

Step-by-step proof

  1. Host MainApplication.onCreate() calls OneSignal.initWithContext(weirdContext, "appId") where weirdContext.applicationContext is a ContextWrapper (not an Application).
  2. Synchronized block (lines 305-315): initState = IN_PROGRESS, suspendCompletion = L1 (fresh).
  3. Line 319: ensureApplicationServiceStarted(weirdContext)(applicationService as ApplicationService).start(weirdContext)val application = weirdContext.applicationContext as Application throws ClassCastException.
  4. Throw escapes initWithContext. SDK state: initState=IN_PROGRESS, suspendCompletion=L1 uncompleted, applicationServiceStarted=false. Host catches the throw (or the process keeps running because Application onCreate exceptions are sometimes survivable, or a persisted JobService entry outlives the throwing process invocation).
  5. Later, SyncJobService.onStartJob fires and calls OneSignal.initWithContext(this) (suspend, no-appId) → initWithContextSuspend(ctx, null).
  6. Synchronized block at lines 678-693: isSDKAccessible() is true (IN_PROGRESS counts). shouldRunInit=false, captures completionToAwait = suspendCompletion = L1.
  7. Line 703: completionToAwait!!.await()blocks forever on L1, which no code path will ever complete because no internalInit was invoked for this generation.
  8. SyncJobService coroutine hangs until the OS kills the worker, holding its JobService budget slot — the same hang failure mode the PR's catch block was supposed to prevent.

The same hang affects waitUntilInitInternal (line 533): any subsequent accessor call (os.user, os.notifications, etc.) under SDK_BACKGROUND_THREADING blocks forever in runBlocking { waitUntilInitInternal() } -> await() on L1.

Why this is a regression introduced by this PR

  • Pre-PR: suspendCompletion was a single-shot val; initWithContextSuspend did not await() on it. A re-entrant suspend caller observing isSDKAccessible() == true early-returned true without awaiting (the SDK-4475 NPE this PR fixes). A leaked latch was harmless because nothing awaited.
  • Post-PR: re-entrant suspend callers now await() on the freshly-reset latch (line 703), and waitUntilInitInternal does the same (line 533). A leaked latch from a pre-internalInit throw turns the prior NPE-on-stale-state into an indefinite hang. The bug class shifts from "returns too early" to "never returns" — exactly the inversion the PR's internalInit catch was designed to prevent.

Impact

Trigger probability is narrow (most hosts pass an Activity / Application / Service context whose applicationContext is the Application), but the consequence is severe and matches the precise hang the PR is attempting to close. SyncJobService callers hold JobService budget slots until the OS terminates the worker; accessor calls under SDK_BACKGROUND_THREADING will block forever. Importantly, the PR's defensive intent is explicit in the new catch comment, so this is a literal gap in the same defensive pattern rather than a hypothetical future hardening.

Suggested fix

Mirror the internalInit catch around the sync-entry body. Two equivalent options:

Option A — wrap the post-lock body:

synchronized(initLock) { ... }
try {
    ensureApplicationServiceStarted(context)
    if (isBackgroundThreadingEnabled) {
        suspendifyOnIO { internalInit(context, appId) }
        return true
    }
    return runBlocking(runtimeIoDispatcher) { internalInit(context, appId) }
} catch (t: Throwable) {
    initFailureException = (initFailureException ?: IllegalStateException(...)).apply { addSuppressed(t) }
    initState = InitState.FAILED
    notifyInitComplete()
    throw t
}

Option B — move ensureApplicationServiceStarted into internalInit so it falls under the existing try/catch. Would need to preserve the order with the isBackgroundThreadingEnabled evaluation (which currently relies on applicationServiceStarted being true to avoid returning false prematurely).

// Ensure app context is available before evaluating feature gates.
ensureApplicationServiceStarted(context)

Expand All @@ -333,48 +340,69 @@
return initWithContextSuspend(context, null)
}

// ReturnCount: 4 returns vs detekt's limit of 2. The function intentionally has
// multiple early-return failure paths (user-locked, missing appId, exception). The
// existing detekt baseline entry was for the unannotated signature; once we added
// @Suppress(TooGenericExceptionCaught) the baseline ID no longer matched, so we
// suppress ReturnCount inline rather than churning the baseline.
Comment on lines +345 to +347
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, autogenerated comments can be quite verbose and not that relevant, and I'm quite guilty of this too

@Suppress("ReturnCount", "TooGenericExceptionCaught")
private fun internalInit(
context: Context,
appId: String?,
): Boolean {
// Check whether current Android user is accessible.
// Return early if it is inaccessible, as we are unable to complete initialization without access
// to device storage like SharedPreferences.
if (!AndroidUtils.isAndroidUserUnlocked(context)) {
Logging.warn("initWithContext called when device storage is locked, no user data is accessible!")
initState = InitState.FAILED
notifyInitComplete()
return false
}
try {
// Check whether current Android user is accessible.
// Return early if it is inaccessible, as we are unable to complete initialization without access
// to device storage like SharedPreferences.
if (!AndroidUtils.isAndroidUserUnlocked(context)) {
Logging.warn("initWithContext called when device storage is locked, no user data is accessible!")
initState = InitState.FAILED
notifyInitComplete()
return false
}

initEssentials(context)

initEssentials(context)
val startupService = bootstrapServices()

val startupService = bootstrapServices()
// Now that the IoC container is ready, subscribe the Otel lifecycle
// manager to config store events so it reacts to fresh remote config.
otelManager?.subscribeToConfigStore(services.getService<ConfigModelStore>())

// Now that the IoC container is ready, subscribe the Otel lifecycle
// manager to config store events so it reacts to fresh remote config.
otelManager?.subscribeToConfigStore(services.getService<ConfigModelStore>())
val result = resolveAppId(appId, configModel, preferencesService)
if (result.failed) {
val message = "suspendInitInternal: no appId provided or found in local storage. Please pass a valid appId to initWithContext()."
val exception = IllegalStateException(message)
// attach the real crash cause to the init failure exception that will be throw shortly after
initFailureException?.addSuppressed(exception)
Logging.warn(message)
initState = InitState.FAILED
notifyInitComplete()
return false
}
configModel.appId = result.appId!! // safe because failed is false
val forceCreateUser = result.forceCreateUser

val result = resolveAppId(appId, configModel, preferencesService)
if (result.failed) {
val message = "suspendInitInternal: no appId provided or found in local storage. Please pass a valid appId to initWithContext()."
val exception = IllegalStateException(message)
// attach the real crash cause to the init failure exception that will be throw shortly after
initFailureException?.addSuppressed(exception)
Logging.warn(message)
updateConfig()
userSwitcher.initUser(forceCreateUser)
startupService.scheduleStart()
initState = InitState.SUCCESS
notifyInitComplete()
return true
} catch (t: Throwable) {
// Any unchecked throw from initEssentials/bootstrapServices/subscribeToConfigStore/
// updateConfig/userSwitcher.initUser/startupService.scheduleStart would otherwise leave
// initState at IN_PROGRESS forever and `suspendCompletion` uncompleted -- causing
// re-entrant suspend callers (e.g. SyncJobService) to hang indefinitely on `await()`.
// Always reach a terminal state and signal waiters before propagating.
Logging.error("OneSignal: internalInit threw unexpectedly; marking init FAILED", t)
initFailureException = (initFailureException ?: IllegalStateException("OneSignal initWithContext failed.")).apply {
addSuppressed(t)
}
initState = InitState.FAILED
notifyInitComplete()
return false
}

Check failure on line 405 in OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt

View check run for this annotation

Claude / Claude Code Review

Latch lifecycle TOCTOU on retry-after-FAILED races re-introduces SDK-4475 failure modes

Two TOCTOU races on the latch lifecycle re-introduce the very SDK-4475 failure modes this PR is fixing.\n\n**(1) Completer-side wrong-latch race.** `notifyInitComplete()` (L484) reads the volatile `suspendCompletion` at `.complete()` time with no lock and no local capture. All three FAILED paths in `internalInit` (L359-360, L379-380, L402-403) execute `initState = FAILED` and `notifyInitComplete()` as two unsynchronized statements. A concurrent retry can land between them, acquire `initLock`, ob
Comment on lines +392 to 405
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Two TOCTOU races on the latch lifecycle re-introduce the very SDK-4475 failure modes this PR is fixing.\n\n**(1) Completer-side wrong-latch race.** notifyInitComplete() (L484) reads the volatile suspendCompletion at .complete() time with no lock and no local capture. All three FAILED paths in internalInit (L359-360, L379-380, L402-403) execute initState = FAILED and notifyInitComplete() as two unsynchronized statements. A concurrent retry can land between them, acquire initLock, observe FAILED → !isSDKAccessible, flip state→IN_PROGRESS and reset suspendCompletion = L2. The original thread then completes L2 (the wrong generation) instead of L1 — L1 waiters hang forever (the SyncJobService budget-slot leak this PR was meant to close), L2 waiters wake on transient IN_PROGRESS state.\n\n**(2) Awaiter-side stale-state race.** After completionToAwait!!.await() returns at L536 / L703, both call sites re-read live volatile initState outside any lock (L548 / L704). Coroutine continuation reschedule is non-trivial, and a concurrent retry-after-FAILED can flip state during that gap. In initWithContextSuspend this returns false based on transient IN_PROGRESS (silent JobService work drop); in waitUntilInitInternal the L548 FAILED check fails, control falls through past "// initState is guaranteed to be SUCCESS here" and the caller invokes getter() against a not-yet-bootstrapped service — exactly the SessionService-NPE class.\n\nBoth PR regression tests serialize the FAILED→retry sequence so neither covers the concurrent window. A single fix addresses both: change the latch payload to CompletableDeferred<InitState>, have completer paths call .complete(state) on a local captured at the start of internalInit, and have awaiter sites return based on the awaited result rather than re-reading the volatile field.

Extended reasoning...

Summary

The @volatile var + reset-under-lock + local-capture-at-await-site pattern landed in 4101068 closes the PR's targeted failure mode but leaves two unaddressed TOCTOU windows on the same latch lifecycle. Both windows are reachable via realistic host-app + SyncJobService interleavings, and both re-introduce the contract violations this PR was specifically designed to fix.

(1) Completer-side wrong-latch race

notifyInitComplete() (OneSignalImp.kt:483-485) is:

private fun notifyInitComplete() {
    suspendCompletion.complete(Unit)
}

It reads the @Volatile field at .complete() time. The await sites local-capture under initLock, but the completer side does not — so it can complete a different generation of the latch than the one it was supposed to release. The non-atomic pair initState = FAILED + notifyInitComplete() appears at three sites:

  • L358-361 (user-locked branch)
  • L378-381 (missing appId branch)
  • L402-403 (catch block)

All three live outside any synchronized block. isSDKAccessible() returns false for FAILED, so a concurrent caller landing in the synchronized block at L678-693 (or L305-315) is allowed to transition state and reset the latch.

Step-by-step proof

Initial state: initState = NOT_STARTED, suspendCompletion = D0.

  1. T_A calls initWithContextSuspend(ctx, "appId"). Synchronized block flips initState = IN_PROGRESS, allocates suspendCompletion = D1. Releases lock. Begins internalInit.
  2. T_C (re-entrant suspend caller, e.g. SyncJobService.onStartJob → initWithContext(this) → initWithContextSuspend(ctx, null)) enters the synchronized block at L678. Observes isSDKAccessible() == true (state is IN_PROGRESS), captures completionToAwait = suspendCompletion = D1. Releases lock. Suspends on D1.await() at L703.
  3. T_A: internalInit body throws (e.g. inside bootstrapServices iterating registered IBootstrapService implementations). Catch block executes at L398-404. Reaches L402: initState = InitState.FAILED. ⏸️ (thread preemption, GC pause, etc.)
  4. T_B (separate thread, e.g. host-app retry OneSignal.initWithContext(ctx, "appId")) enters the synchronized block at L305 (sync path) or L678 (suspend path). Observes initState == FAILED, !isSDKAccessible(). Flips initState = IN_PROGRESS, allocates suspendCompletion = D2. Releases lock. Spawns its own internalInit.
  5. T_A: resumes at L403 → notifyInitComplete() → reads @Volatile suspendCompletion = D2D2.complete(Unit).

Outcomes

  • D1 is never completed. T_C hangs on D1.await() forever — the exact SyncJobService budget-slot leak this PR was meant to prevent.
  • D2 is completed prematurely. Any waiter that captures D2 after T_B's reset wakes immediately. In initWithContextSuspend it returns initState == SUCCESS against transient IN_PROGRESS → false, silently dropping JobService work. In waitUntilInitInternal the L548 FAILED check fails, the function falls through past the misleading "// initState is guaranteed to be SUCCESS here" comment, and the caller proceeds to getter() against a service whose retry-init bootstrap has not run.

The SUCCESS path at L389-390 is safe because state == SUCCESS keeps isSDKAccessible() == true, so no concurrent caller can reset the latch.

(2) Awaiter-side stale-state race

The symmetric defect is on the awaiter side. After completionToAwait!!.await() returns, both call sites re-read live volatile initState outside the lock:

  • waitUntilInitInternal: L536 (await) → L548 (if (initState == InitState.FAILED))
  • initWithContextSuspend: L703 (await) → L704 (return@withContext initState == InitState.SUCCESS)

The gap between latch completion and continuation resumption depends on coroutine dispatcher scheduling and can be substantial. A concurrent retry-after-FAILED during that gap flips state from FAILED to IN_PROGRESS, defeating the post-await check. This race exists even when notifyInitComplete() correctly completes the right latch — i.e. it is independent of the bug above.

For initWithContextSuspend, returning false on transient IN_PROGRESS is a silent JobService work drop. For waitUntilInitInternal, falling through past the FAILED check on transient IN_PROGRESS is the SessionService-NPE class — the caller invokes getter() against services whose retry init has not bootstrapped (configModel.appId not set, userSwitcher.initUser not called, startupService.scheduleStart not run).

Why the PR's regression tests don't cover this

  • initWithContextSuspend resets latch on retry-after-FAILED serializes: firstInit shouldBe false runs to completion before the second init starts.
  • initWithContextSuspend reaches terminal state when internalInit throws is single-threaded.
  • The original in-flight init waits for completion test does not exercise FAILED retry at all.

Neither test launches a thread between initState = FAILED and notifyInitComplete(), and none launch a re-entrant caller between await() returning and the post-await state read.

Suggested fix

A single change addresses both races:

  1. Change the latch payload to encode the terminal state: CompletableDeferred<InitState>.
  2. At the start of internalInit, capture val myCompletion = suspendCompletion (under or just after the lock that reset it).
  3. All terminal-state paths call myCompletion.complete(InitState.FAILED) or myCompletion.complete(InitState.SUCCESS) on the captured local — never on the volatile field.
  4. Awaiter sites use the awaited result (val terminalState = completionToAwait!!.await()) instead of re-reading volatile initState.

This pattern eliminates both windows by making the terminal state of each generation observable as the latch payload, decoupling it from the live volatile field that concurrent retries mutate.

Alternatively, holding initLock across initState = FAILED and notifyInitComplete() closes window (1) but does not close window (2); window (2) requires either the result-in-payload pattern or a loop that re-captures and re-awaits while observed state is IN_PROGRESS.

configModel.appId = result.appId!! // safe because failed is false
val forceCreateUser = result.forceCreateUser

updateConfig()
userSwitcher.initUser(forceCreateUser)
startupService.scheduleStart()
initState = InitState.SUCCESS
notifyInitComplete()
return true
}

override fun login(
Expand Down Expand Up @@ -474,7 +502,16 @@
* @param operationName Optional operation name to include in error messages (e.g., "login", "logout")
*/
private suspend fun waitUntilInitInternal(operationName: String? = null) {
when (initState) {
// Local-capture state + deferred under initLock so we await on the same generation
// we observed (a concurrent retry-after-FAILED can replace `suspendCompletion`).
val observedState: InitState
val completionToAwait: CompletableDeferred<Unit>?
synchronized(initLock) {
observedState = initState
completionToAwait = if (observedState == InitState.IN_PROGRESS) suspendCompletion else null
}

when (observedState) {
InitState.NOT_STARTED -> {
val message = if (operationName != null) {
"Must call 'initWithContext' before '$operationName'"
Expand All @@ -496,7 +533,7 @@
// This is intentional per PR #2412: "ANR is the lesser of two evils and the app can recover,
// where an uncaught throw it can not." To avoid ANRs, call SDK methods from background threads
// or use the suspend API from coroutines.
suspendCompletion.await()
completionToAwait!!.await()

// Log how long initialization took
val elapsed = System.currentTimeMillis() - startTime
Expand Down Expand Up @@ -632,19 +669,39 @@
): Boolean {
Logging.log(LogLevel.DEBUG, "initWithContext(context: $context, appId: $appId)")

// This ensures the stack trace points to the caller that triggered init, not the async worker thread.
initFailureException = IllegalStateException("OneSignal initWithContext failed.")

// Use IO dispatcher for initialization to prevent ANRs and optimize for I/O operations
return withContext(runtimeIoDispatcher) {
// do not do this again if already initialized or init is in progress
val shouldRunInit: Boolean
// Local-capture under the lock so that even if a concurrent retry-after-FAILED
// resets `suspendCompletion`, we await on the same generation we observed.
val completionToAwait: CompletableDeferred<Unit>?
synchronized(initLock) {
if (initState.isSDKAccessible()) {
Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress")
return@withContext true
shouldRunInit = false
completionToAwait = suspendCompletion
} else {
shouldRunInit = true
completionToAwait = null
initState = InitState.IN_PROGRESS
// Fresh latch for this init attempt.
suspendCompletion = CompletableDeferred()
// Only the call that actually starts init owns the failure-attribution exception.
// Re-entrant callers must not overwrite it -- otherwise the failure stack trace
// would point at the SyncJobService coroutine instead of the original initiator.
initFailureException = IllegalStateException("OneSignal initWithContext failed.")

Check warning on line 691 in OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt

View check run for this annotation

Claude / Claude Code Review

initFailureException stack trace regressed from caller frames to IO worker thread

**Nit / pre-existing-style:** The fix to gate `initFailureException` assignment behind `shouldRunInit` (responding to Copilot 3196748195) inadvertently moved the `IllegalStateException(...)` constructor *inside* `withContext(runtimeIoDispatcher)`. The field's declaration comment at OneSignalImp.kt:62 still says "Save the exception pointing to the caller that triggered init, not the async worker thread," and the deleted comment at the old assignment site ("This ensures the stack trace points to t
Comment on lines +685 to +691
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit / pre-existing-style: The fix to gate initFailureException assignment behind shouldRunInit (responding to Copilot 3196748195) inadvertently moved the IllegalStateException(...) constructor inside withContext(runtimeIoDispatcher). The field's declaration comment at OneSignalImp.kt:62 still says "Save the exception pointing to the caller that triggered init, not the async worker thread," and the deleted comment at the old assignment site ("This ensures the stack trace points to the caller that triggered init, not the async worker thread") spelled out the same intent — both are now technically violated. Practical impact is small (the message + suppressed cause carry the real diagnostic info, and on suspend paths the "caller frames" were always coroutine continuation frames anyway), but if you want to honor the documented invariant the construction can be hoisted above withContext while the assignment stays under the lock — i.e. val newFailureEx = IllegalStateException(...) outside, then initFailureException = newFailureEx inside the shouldRunInit branch.

Extended reasoning...

What changed

Pre-PR (per the diff context), initWithContextSuspend constructed and assigned initFailureException immediately on entry, before withContext(runtimeIoDispatcher):

// pre-PR
initFailureException = IllegalStateException("OneSignal initWithContext failed.")
// This ensures the stack trace points to the caller that triggered init, not the async worker thread.
return withContext(runtimeIoDispatcher) { ... }

Post-PR (OneSignalImp.kt:691), the assignment lives inside the shouldRunInit branch of the synchronized(initLock) block — which is itself inside withContext(runtimeIoDispatcher). The constructor therefore now runs on an IO worker continuation, and the captured stack trace contains IO dispatcher frames rather than the caller's coroutine continuation frames. The pre-PR comment that documented the invariant at the assignment site was deleted, but the field-level comment at OneSignalImp.kt:62 still asserts it: Save the exception pointing to the caller that triggered init, not the async worker thread.

Why the change happened

This was a deliberate response to Copilot inline-comment 3196748195: initFailureException is overwritten at the start of every initWithContextSuspend call, even when this invocation does not actually start init. The fix correctly gates the assignment so re-entrant suspend callers (notably the SyncJobService entry point) no longer overwrite the original initiator's failure-attribution exception. That part is sound. The accidental side effect is that hoisting the assignment under the lock also relocated the constructor into the IO continuation.

Practical impact (why this is a nit, not a blocker)

  1. The substantive diagnostic info is preserved. The exception's message (OneSignal initWithContext failed.) is unchanged, and the suppressed cause attached in internalInit (the addSuppressed paths at lines ~691 and ~702) carries the actual failure-site stack into the bootstrap path. Crash reporters and debuggers surfacing this exception still get the meaningful information.
  2. Pre-PR, the "caller's stack" guarantee was already weak on the suspend path. initWithContextSuspend is a suspend function — its callers are already in a coroutine, so the JVM stack at the constructor call site (whether before or after withContext) is dominated by continuation/dispatcher machinery; the user's actual code site (e.g. launch { os.initWithContextSuspend(...) }) is not present at any point. The deleted comment overstated what was actually preserved.
  3. The typical sync initWithContext(ctx, appId) path is unaffected — that path doesn't touch initFailureException at all (only the catch block in internalInit does, on whichever runner thread).
  4. The existing test accessor instances after initWithContext without appID shows the failure reason (SDKInitTests.kt:75-83) only asserts on .message and suppressed[0].message, not on the stack trace, so the regression is not test-detected.

Step-by-step proof of the regression

  1. Host calls os.initWithContextSuspend(context) from inside a coroutine on the Main dispatcher (or from runBlocking on a non-coroutine thread).
  2. Execution enters initWithContextSuspend body. Pre-PR, IllegalStateException(...) is constructed here — its captured stack contains the coroutine's continuation frames on the Main dispatcher (or the runBlocking thread). Post-PR, no construction happens yet.
  3. withContext(runtimeIoDispatcher) switches to an IO dispatcher worker. Post-PR, the synchronized block runs here, and IllegalStateException(...) is constructed now — its captured stack contains IO worker continuation frames.
  4. Init eventually fails. Some later waitUntilInitInternal caller throws initFailureException. Pre-PR: stack trace shows Main dispatcher / runBlocking-thread continuation. Post-PR: stack trace shows IO worker continuation.
  5. Both stacks are coroutine machinery; neither cleanly points at the host app's call site. But pre-PR was at least consistent with the documented invariant of "caller's dispatcher," whereas post-PR is unambiguously "IO worker."

Suggested fix (if you want to honor the documented invariant)

Hoist the constructor above withContext and only the assignment under the lock. The two concerns (no-overwrite and caller-dispatcher capture) are orthogonal:

override suspend fun initWithContextSuspend(...): Boolean {
    Logging.log(...)
    // Capture stack on caller's dispatcher.
    val newFailureEx = IllegalStateException("OneSignal initWithContext failed.")
    return withContext(runtimeIoDispatcher) {
        synchronized(initLock) {
            if (initState.isSDKAccessible()) { ... }
            else {
                initState = InitState.IN_PROGRESS
                suspendCompletion = CompletableDeferred()
                initFailureException = newFailureEx  // assignment under lock; stack captured outside withContext
            }
        }
        ...
    }
}

Or alternately: drop the field-level comment to match what's actually preserved (message + suppressed cause), since the "caller frames" claim was already largely aspirational on a suspend path. Either fix is fine; the current code is internally inconsistent.

}
}
Comment on lines +674 to +693
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4b5d86f. suspendCompletion is now a @Volatile var reset under synchronized(initLock) whenever state flips into IN_PROGRESS — applied to both the sync initWithContext(ctx, appId) and suspend initWithContextSuspend paths to keep them consistent. Both await sites (initWithContextSuspend and waitUntilInitInternal) now local-capture the deferred under the lock so they wait on the same generation they observed.

Added regression test initWithContextSuspend resets latch on retry-after-FAILED which: forces a FAILED first init, kicks off a stalled retry, then a re-entrant suspend caller. Verified to fail on the un-hardened code with expected:<true> but was:<false> (the re-entrant caller wakes on the stale latch and reads transient IN_PROGRESS state) and pass with the fix.


initState = InitState.IN_PROGRESS
if (!shouldRunInit) {
// Another caller has already started (or completed) init. Honor this method's
// contract by suspending until initialization is *fully* completed -- not just
// kicked off. This closes a race where re-entrant suspend callers (e.g. the
// SyncJobService entry point under SDK_BACKGROUND_THREADING) would otherwise
// proceed to use IBackgroundService implementations like SessionService whose
// bootstrap() had not yet run, NPE'ing on still-null model fields.
Logging.log(LogLevel.DEBUG, "initWithContext: init already in progress or completed, awaiting completion")
completionToAwait!!.await()
return@withContext initState == InitState.SUCCESS
Comment thread
claude[bot] marked this conversation as resolved.
}

val result = internalInit(context, appId)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,24 @@ internal class SessionService(
private val _time: ITime,
) : ISessionService, IBootstrapService, IStartableService, IBackgroundService, IApplicationLifecycleHandler {
override val startTime: Long
get() = session!!.startTime
// Pre-bootstrap default returns "now" so call sites computing `_time.currentTimeMillis - startTime`
// (e.g. IAM session-duration / SESSION_TIME triggers) see ~0ms elapsed instead of ~58 years
// (which is what `0L` / Jan 1970 would produce).
get() = session?.startTime ?: _time.currentTimeMillis

/**
* Run in the background when the session would time out, only if a session is currently active.
*
* Returns null if [bootstrap] has not yet run -- callers that invoke us before bootstrap
* (possible under SDK_BACKGROUND_THREADING when init is still in flight) should treat this
* as "no schedule needed" rather than crashing on a null-deref.
*/
override val scheduleBackgroundRunIn: Long?
get() = if (session!!.isValid) config!!.sessionFocusTimeout else null
get() {
val session = this.session ?: return null
val config = this.config ?: return null
return if (session.isValid) config.sessionFocusTimeout else null
}

private val sessionLifeCycleNotifier: EventProducer<ISessionLifecycleHandler> = EventProducer()
private var session: SessionModel? = null
Expand Down Expand Up @@ -69,13 +80,17 @@ internal class SessionService(
}

private fun endSession() {
if (!session!!.isValid) return
val activeDuration = session!!.activeDuration
// Defensive: if bootstrap() has not run yet, there is no session state to end.
// This can happen under SDK_BACKGROUND_THREADING when SyncJobService races with
// an in-flight initWithContext that has not yet reached bootstrapServices().
val session = this.session ?: return
if (!session.isValid) return
val activeDuration = session.activeDuration
Logging.debug("SessionService.backgroundRun: Session ended. activeDuration: $activeDuration")

session!!.isValid = false
session.isValid = false
sessionLifeCycleNotifier.fire { it.onSessionEnded(activeDuration) }
session!!.activeDuration = 0L
session.activeDuration = 0L
}

/**
Expand All @@ -90,34 +105,45 @@ internal class SessionService(
override fun onFocus(firedOnSubscribe: Boolean) {
Logging.log(LogLevel.DEBUG, "SessionService.onFocus() - fired from start: $firedOnSubscribe")

val session = this.session
if (session == null) {
Logging.warn("SessionService.onFocus called before bootstrap; ignoring.")
return
}

// Treat app cold starts as a new session, we attempt to end any previous session to do this.
if (!hasFocused) {
hasFocused = true
endSession()
}

if (!session!!.isValid) {
if (!session.isValid) {
// As the old session was made inactive, we need to create a new session
shouldFireOnSubscribe = firedOnSubscribe
session!!.sessionId = UUID.randomUUID().toString()
session!!.startTime = _time.currentTimeMillis
session!!.focusTime = session!!.startTime
session!!.isValid = true
Logging.debug("SessionService: New session started at ${session!!.startTime}")
session.sessionId = UUID.randomUUID().toString()
session.startTime = _time.currentTimeMillis
session.focusTime = session.startTime
session.isValid = true
Logging.debug("SessionService: New session started at ${session.startTime}")
sessionLifeCycleNotifier.fire { it.onSessionStarted() }
} else {
// existing session: just remember the focus time so we can calculate the active time
// when onUnfocused is called.
session!!.focusTime = _time.currentTimeMillis
session.focusTime = _time.currentTimeMillis
sessionLifeCycleNotifier.fire { it.onSessionActive() }
}
}

override fun onUnfocused() {
val session = this.session
if (session == null) {
Logging.warn("SessionService.onUnfocused called before bootstrap; ignoring.")
return
}
// capture the amount of time the app was focused
val dt = _time.currentTimeMillis - session!!.focusTime
session!!.activeDuration += dt
Logging.log(LogLevel.DEBUG, "SessionService.onUnfocused adding time $dt for total: ${session!!.activeDuration}")
val dt = _time.currentTimeMillis - session.focusTime
session.activeDuration += dt
Logging.log(LogLevel.DEBUG, "SessionService.onUnfocused adding time $dt for total: ${session.activeDuration}")
}

override fun subscribe(handler: ISessionLifecycleHandler) {
Expand Down
Loading