-
Notifications
You must be signed in to change notification settings - Fork 377
fix: SDK-4475 wait for in-flight init in initWithContextSuspend to avoid SessionService NPE #2637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -50,7 +50,11 @@ | |
| private val ioDispatcher: CoroutineDispatcher = Dispatchers.IO, | ||
| ) : IOneSignal, IServiceProvider { | ||
|
|
||
| private val suspendCompletion = CompletableDeferred<Unit>() | ||
| // Reset every time the synchronized(initLock) block flips state to IN_PROGRESS so that | ||
| // a retry-after-FAILED gets a fresh latch instead of an already-completed one. Mutated only | ||
| // under initLock; reads outside that lock must local-capture before suspending on it. | ||
| @Volatile | ||
| private var suspendCompletion = CompletableDeferred<Unit>() | ||
|
|
||
| @Volatile | ||
| private var initState: InitState = InitState.NOT_STARTED | ||
|
|
@@ -302,12 +306,15 @@ | |
| if (initState.isSDKAccessible()) { | ||
| Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress") | ||
| return true | ||
| } | ||
|
|
||
| initState = InitState.IN_PROGRESS | ||
| // Fresh latch for this init attempt -- prevents retry-after-FAILED waiters from | ||
| // observing the prior init's already-completed deferred. | ||
| suspendCompletion = CompletableDeferred() | ||
| } | ||
|
|
||
| // FeatureManager depends on ConfigModelStore/PreferencesService which requires appContext. | ||
|
Check failure on line 317 in OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt
|
||
| // Ensure app context is available before evaluating feature gates. | ||
| ensureApplicationServiceStarted(context) | ||
|
|
||
|
|
@@ -333,48 +340,69 @@ | |
| return initWithContextSuspend(context, null) | ||
| } | ||
|
|
||
| // ReturnCount: 4 returns vs detekt's limit of 2. The function intentionally has | ||
| // multiple early-return failure paths (user-locked, missing appId, exception). The | ||
| // existing detekt baseline entry was for the unannotated signature; once we added | ||
| // @Suppress(TooGenericExceptionCaught) the baseline ID no longer matched, so we | ||
| // suppress ReturnCount inline rather than churning the baseline. | ||
|
Comment on lines
+345
to
+347
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit, autogenerated comments can be quite verbose and not that relevant, and I'm quite guilty of this too |
||
| @Suppress("ReturnCount", "TooGenericExceptionCaught") | ||
| private fun internalInit( | ||
| context: Context, | ||
| appId: String?, | ||
| ): Boolean { | ||
| // Check whether current Android user is accessible. | ||
| // Return early if it is inaccessible, as we are unable to complete initialization without access | ||
| // to device storage like SharedPreferences. | ||
| if (!AndroidUtils.isAndroidUserUnlocked(context)) { | ||
| Logging.warn("initWithContext called when device storage is locked, no user data is accessible!") | ||
| initState = InitState.FAILED | ||
| notifyInitComplete() | ||
| return false | ||
| } | ||
| try { | ||
| // Check whether current Android user is accessible. | ||
| // Return early if it is inaccessible, as we are unable to complete initialization without access | ||
| // to device storage like SharedPreferences. | ||
| if (!AndroidUtils.isAndroidUserUnlocked(context)) { | ||
| Logging.warn("initWithContext called when device storage is locked, no user data is accessible!") | ||
| initState = InitState.FAILED | ||
| notifyInitComplete() | ||
| return false | ||
| } | ||
|
|
||
| initEssentials(context) | ||
|
|
||
| initEssentials(context) | ||
| val startupService = bootstrapServices() | ||
|
|
||
| val startupService = bootstrapServices() | ||
| // Now that the IoC container is ready, subscribe the Otel lifecycle | ||
| // manager to config store events so it reacts to fresh remote config. | ||
| otelManager?.subscribeToConfigStore(services.getService<ConfigModelStore>()) | ||
|
|
||
| // Now that the IoC container is ready, subscribe the Otel lifecycle | ||
| // manager to config store events so it reacts to fresh remote config. | ||
| otelManager?.subscribeToConfigStore(services.getService<ConfigModelStore>()) | ||
| val result = resolveAppId(appId, configModel, preferencesService) | ||
| if (result.failed) { | ||
| val message = "suspendInitInternal: no appId provided or found in local storage. Please pass a valid appId to initWithContext()." | ||
| val exception = IllegalStateException(message) | ||
| // attach the real crash cause to the init failure exception that will be throw shortly after | ||
| initFailureException?.addSuppressed(exception) | ||
| Logging.warn(message) | ||
| initState = InitState.FAILED | ||
| notifyInitComplete() | ||
| return false | ||
| } | ||
| configModel.appId = result.appId!! // safe because failed is false | ||
| val forceCreateUser = result.forceCreateUser | ||
|
|
||
| val result = resolveAppId(appId, configModel, preferencesService) | ||
| if (result.failed) { | ||
| val message = "suspendInitInternal: no appId provided or found in local storage. Please pass a valid appId to initWithContext()." | ||
| val exception = IllegalStateException(message) | ||
| // attach the real crash cause to the init failure exception that will be throw shortly after | ||
| initFailureException?.addSuppressed(exception) | ||
| Logging.warn(message) | ||
| updateConfig() | ||
| userSwitcher.initUser(forceCreateUser) | ||
| startupService.scheduleStart() | ||
| initState = InitState.SUCCESS | ||
| notifyInitComplete() | ||
| return true | ||
| } catch (t: Throwable) { | ||
| // Any unchecked throw from initEssentials/bootstrapServices/subscribeToConfigStore/ | ||
| // updateConfig/userSwitcher.initUser/startupService.scheduleStart would otherwise leave | ||
| // initState at IN_PROGRESS forever and `suspendCompletion` uncompleted -- causing | ||
| // re-entrant suspend callers (e.g. SyncJobService) to hang indefinitely on `await()`. | ||
| // Always reach a terminal state and signal waiters before propagating. | ||
| Logging.error("OneSignal: internalInit threw unexpectedly; marking init FAILED", t) | ||
| initFailureException = (initFailureException ?: IllegalStateException("OneSignal initWithContext failed.")).apply { | ||
| addSuppressed(t) | ||
| } | ||
| initState = InitState.FAILED | ||
| notifyInitComplete() | ||
| return false | ||
| } | ||
|
Check failure on line 405 in OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt
|
||
|
Comment on lines
+392
to
405
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 Two TOCTOU races on the latch lifecycle re-introduce the very SDK-4475 failure modes this PR is fixing.\n\n**(1) Completer-side wrong-latch race.** Extended reasoning...SummaryThe @volatile var + reset-under-lock + local-capture-at-await-site pattern landed in 4101068 closes the PR's targeted failure mode but leaves two unaddressed TOCTOU windows on the same latch lifecycle. Both windows are reachable via realistic host-app + SyncJobService interleavings, and both re-introduce the contract violations this PR was specifically designed to fix. (1) Completer-side wrong-latch race
private fun notifyInitComplete() {
suspendCompletion.complete(Unit)
}It reads the
All three live outside any synchronized block. Step-by-step proofInitial state:
Outcomes
The SUCCESS path at L389-390 is safe because (2) Awaiter-side stale-state raceThe symmetric defect is on the awaiter side. After
The gap between latch completion and continuation resumption depends on coroutine dispatcher scheduling and can be substantial. A concurrent retry-after-FAILED during that gap flips state from FAILED to IN_PROGRESS, defeating the post-await check. This race exists even when For Why the PR's regression tests don't cover this
Neither test launches a thread between Suggested fixA single change addresses both races:
This pattern eliminates both windows by making the terminal state of each generation observable as the latch payload, decoupling it from the live volatile field that concurrent retries mutate. Alternatively, holding |
||
| configModel.appId = result.appId!! // safe because failed is false | ||
| val forceCreateUser = result.forceCreateUser | ||
|
|
||
| updateConfig() | ||
| userSwitcher.initUser(forceCreateUser) | ||
| startupService.scheduleStart() | ||
| initState = InitState.SUCCESS | ||
| notifyInitComplete() | ||
| return true | ||
| } | ||
|
|
||
| override fun login( | ||
|
|
@@ -474,7 +502,16 @@ | |
| * @param operationName Optional operation name to include in error messages (e.g., "login", "logout") | ||
| */ | ||
| private suspend fun waitUntilInitInternal(operationName: String? = null) { | ||
| when (initState) { | ||
| // Local-capture state + deferred under initLock so we await on the same generation | ||
| // we observed (a concurrent retry-after-FAILED can replace `suspendCompletion`). | ||
| val observedState: InitState | ||
| val completionToAwait: CompletableDeferred<Unit>? | ||
| synchronized(initLock) { | ||
| observedState = initState | ||
| completionToAwait = if (observedState == InitState.IN_PROGRESS) suspendCompletion else null | ||
| } | ||
|
|
||
| when (observedState) { | ||
| InitState.NOT_STARTED -> { | ||
| val message = if (operationName != null) { | ||
| "Must call 'initWithContext' before '$operationName'" | ||
|
|
@@ -496,7 +533,7 @@ | |
| // This is intentional per PR #2412: "ANR is the lesser of two evils and the app can recover, | ||
| // where an uncaught throw it can not." To avoid ANRs, call SDK methods from background threads | ||
| // or use the suspend API from coroutines. | ||
| suspendCompletion.await() | ||
| completionToAwait!!.await() | ||
|
|
||
| // Log how long initialization took | ||
| val elapsed = System.currentTimeMillis() - startTime | ||
|
|
@@ -632,19 +669,39 @@ | |
| ): Boolean { | ||
| Logging.log(LogLevel.DEBUG, "initWithContext(context: $context, appId: $appId)") | ||
|
|
||
| // This ensures the stack trace points to the caller that triggered init, not the async worker thread. | ||
| initFailureException = IllegalStateException("OneSignal initWithContext failed.") | ||
|
|
||
| // Use IO dispatcher for initialization to prevent ANRs and optimize for I/O operations | ||
| return withContext(runtimeIoDispatcher) { | ||
| // do not do this again if already initialized or init is in progress | ||
| val shouldRunInit: Boolean | ||
| // Local-capture under the lock so that even if a concurrent retry-after-FAILED | ||
| // resets `suspendCompletion`, we await on the same generation we observed. | ||
| val completionToAwait: CompletableDeferred<Unit>? | ||
| synchronized(initLock) { | ||
| if (initState.isSDKAccessible()) { | ||
| Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress") | ||
| return@withContext true | ||
| shouldRunInit = false | ||
| completionToAwait = suspendCompletion | ||
| } else { | ||
| shouldRunInit = true | ||
| completionToAwait = null | ||
| initState = InitState.IN_PROGRESS | ||
| // Fresh latch for this init attempt. | ||
| suspendCompletion = CompletableDeferred() | ||
| // Only the call that actually starts init owns the failure-attribution exception. | ||
| // Re-entrant callers must not overwrite it -- otherwise the failure stack trace | ||
| // would point at the SyncJobService coroutine instead of the original initiator. | ||
| initFailureException = IllegalStateException("OneSignal initWithContext failed.") | ||
|
Check warning on line 691 in OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt
|
||
|
Comment on lines
+685
to
+691
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 Nit / pre-existing-style: The fix to gate Extended reasoning...What changedPre-PR (per the diff context), // pre-PR
initFailureException = IllegalStateException("OneSignal initWithContext failed.")
// This ensures the stack trace points to the caller that triggered init, not the async worker thread.
return withContext(runtimeIoDispatcher) { ... }Post-PR (OneSignalImp.kt:691), the assignment lives inside the Why the change happenedThis was a deliberate response to Copilot inline-comment 3196748195: Practical impact (why this is a nit, not a blocker)
Step-by-step proof of the regression
Suggested fix (if you want to honor the documented invariant)Hoist the constructor above override suspend fun initWithContextSuspend(...): Boolean {
Logging.log(...)
// Capture stack on caller's dispatcher.
val newFailureEx = IllegalStateException("OneSignal initWithContext failed.")
return withContext(runtimeIoDispatcher) {
synchronized(initLock) {
if (initState.isSDKAccessible()) { ... }
else {
initState = InitState.IN_PROGRESS
suspendCompletion = CompletableDeferred()
initFailureException = newFailureEx // assignment under lock; stack captured outside withContext
}
}
...
}
}Or alternately: drop the field-level comment to match what's actually preserved (message + suppressed cause), since the "caller frames" claim was already largely aspirational on a suspend path. Either fix is fine; the current code is internally inconsistent. |
||
| } | ||
| } | ||
|
Comment on lines
+674
to
+693
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Addressed in 4b5d86f. Added regression test |
||
|
|
||
| initState = InitState.IN_PROGRESS | ||
| if (!shouldRunInit) { | ||
| // Another caller has already started (or completed) init. Honor this method's | ||
| // contract by suspending until initialization is *fully* completed -- not just | ||
| // kicked off. This closes a race where re-entrant suspend callers (e.g. the | ||
| // SyncJobService entry point under SDK_BACKGROUND_THREADING) would otherwise | ||
| // proceed to use IBackgroundService implementations like SessionService whose | ||
| // bootstrap() had not yet run, NPE'ing on still-null model fields. | ||
| Logging.log(LogLevel.DEBUG, "initWithContext: init already in progress or completed, awaiting completion") | ||
| completionToAwait!!.await() | ||
| return@withContext initState == InitState.SUCCESS | ||
|
claude[bot] marked this conversation as resolved.
|
||
| } | ||
|
|
||
| val result = internalInit(context, appId) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Sync
initWithContext(context, appId)resetssuspendCompletionand setsinitState=IN_PROGRESSinside the synchronized block (lines 305-315), then callsensureApplicationServiceStarted(context)at line 319 outside any try/catch — the new try/catch only wrapsinternalInit's body. If that call throws (e.g.ApplicationService.startline 81'scontext.applicationContext as Applicationcast on a non-Application context),initStateis leaked atIN_PROGRESSand the just-reset latch is never completed, causing every re-entrant suspend caller (initWithContextSuspend,waitUntilInitInternal) to hang onawait()indefinitely — the exact inverted failure mode this PR'sinternalInitcatch block was designed to prevent. Fix by mirroring the same try/catch around lines 317-332 (or movingensureApplicationServiceStartedinsideinternalInit's protected body).Extended reasoning...
What the bug is
The PR added a try/catch around
internalInit's body (with a comment explicitly naming the hang failure mode it prevents) and a fresh-latch reset undersynchronized(initLock)at the start of every init attempt. But on the syncinitWithContext(context, appId)entry path, there is unprotected code between the latch reset and theinternalInitcall. That gap re-introduces the very hang the catch block was designed to prevent.Code path
OneSignalImp.ktlines 297-333:ensureApplicationServiceStartedreachesApplicationService.start(context)(ApplicationService.ktline 78), which does:That cast throws
ClassCastExceptionfor anyContextwhoseapplicationContextis not anApplicationinstance — uncommon in production but realistic for hosts initializing fromContentProvidercontexts (some apps do this for pre-onCreateinit), test/instrumentation harnesses that wrap or mockContext, or restricted/launcher contexts that return aContextWrapper.Why existing code does not prevent it
The PR's new
try { ... } catch (t: Throwable) { ... initState = FAILED; notifyInitComplete() ... }is scoped tointernalInit's body (OneSignalImp.ktlines ~360-405). It does not wrap the lines between the synchronized block and theinternalInitinvocation in the sync entry path. The PR comment on that catch even enumerates the exact symptom — "would otherwise leave initState at IN_PROGRESS forever andsuspendCompletionuncompleted -- causing re-entrant suspend callers (e.g. SyncJobService) to hang indefinitely onawait()" — but the protection is one stack frame too narrow.isBackgroundThreadingEnableditself is already defensively wrapped (catchesThrowablefrom the feature-manager lookup), so the only realistic throw point in the unprotected window isensureApplicationServiceStarted.Step-by-step proof
MainApplication.onCreate()callsOneSignal.initWithContext(weirdContext, "appId")whereweirdContext.applicationContextis aContextWrapper(not anApplication).initState = IN_PROGRESS,suspendCompletion = L1(fresh).ensureApplicationServiceStarted(weirdContext)→(applicationService as ApplicationService).start(weirdContext)→val application = weirdContext.applicationContext as ApplicationthrowsClassCastException.initWithContext. SDK state:initState=IN_PROGRESS,suspendCompletion=L1uncompleted,applicationServiceStarted=false. Host catches the throw (or the process keeps running because Application onCreate exceptions are sometimes survivable, or a persisted JobService entry outlives the throwing process invocation).SyncJobService.onStartJobfires and callsOneSignal.initWithContext(this)(suspend, no-appId) →initWithContextSuspend(ctx, null).isSDKAccessible()istrue(IN_PROGRESS counts).shouldRunInit=false, capturescompletionToAwait = suspendCompletion = L1.completionToAwait!!.await()— blocks forever on L1, which no code path will ever complete because nointernalInitwas invoked for this generation.SyncJobServicecoroutine hangs until the OS kills the worker, holding its JobService budget slot — the same hang failure mode the PR's catch block was supposed to prevent.The same hang affects
waitUntilInitInternal(line 533): any subsequent accessor call (os.user,os.notifications, etc.) underSDK_BACKGROUND_THREADINGblocks forever inrunBlocking { waitUntilInitInternal() } -> await()on L1.Why this is a regression introduced by this PR
suspendCompletionwas a single-shotval;initWithContextSuspenddid notawait()on it. A re-entrant suspend caller observingisSDKAccessible() == trueearly-returnedtruewithout awaiting (the SDK-4475 NPE this PR fixes). A leaked latch was harmless because nothing awaited.await()on the freshly-reset latch (line 703), andwaitUntilInitInternaldoes the same (line 533). A leaked latch from a pre-internalInitthrow turns the prior NPE-on-stale-state into an indefinite hang. The bug class shifts from "returns too early" to "never returns" — exactly the inversion the PR'sinternalInitcatch was designed to prevent.Impact
Trigger probability is narrow (most hosts pass an
Activity/Application/Servicecontext whoseapplicationContextis theApplication), but the consequence is severe and matches the precise hang the PR is attempting to close. SyncJobService callers hold JobService budget slots until the OS terminates the worker; accessor calls underSDK_BACKGROUND_THREADINGwill block forever. Importantly, the PR's defensive intent is explicit in the new catch comment, so this is a literal gap in the same defensive pattern rather than a hypothetical future hardening.Suggested fix
Mirror the
internalInitcatch around the sync-entry body. Two equivalent options:Option A — wrap the post-lock body:
synchronized(initLock) { ... } try { ensureApplicationServiceStarted(context) if (isBackgroundThreadingEnabled) { suspendifyOnIO { internalInit(context, appId) } return true } return runBlocking(runtimeIoDispatcher) { internalInit(context, appId) } } catch (t: Throwable) { initFailureException = (initFailureException ?: IllegalStateException(...)).apply { addSuppressed(t) } initState = InitState.FAILED notifyInitComplete() throw t }Option B — move
ensureApplicationServiceStartedintointernalInitso it falls under the existing try/catch. Would need to preserve the order with theisBackgroundThreadingEnabledevaluation (which currently relies onapplicationServiceStartedbeing true to avoid returningfalseprematurely).