Skip to content

fix: [SDK-4474] self-heal SDK-4388-stuck push subscriptions on session start#2636

Open
sherwinski wants to merge 1 commit intoar/sdk-4388-android-sdk-subscription-state-permanently-stuck-at-neverfrom
sherwin/sdk-4474-self-heal-stuck-push-subscriptions
Open

fix: [SDK-4474] self-heal SDK-4388-stuck push subscriptions on session start#2636
sherwinski wants to merge 1 commit intoar/sdk-4388-android-sdk-subscription-state-permanently-stuck-at-neverfrom
sherwin/sdk-4474-self-heal-stuck-push-subscriptions

Conversation

@sherwinski
Copy link
Copy Markdown
Contributor

Description

One Line Summary

On session start, detect when the server's record of the device's push subscription is disabled while the local model says enabled-and-opted-in, and re-assert local truth via PATCH /subscriptions/{id} so users already stuck by SDK-4388 self-heal on next launch — no consumer code changes, no migration script, no toggle.

Details

Motivation

The fix in #2627 (parent PR) prevents new installs from getting stuck by flipping POST /subscriptions to PATCH /subscriptions/{id} in SubscriptionOperationExecutor when the subscription id is server-assigned. That covers the forward path, but it does not rescue the cohort that already triggered the bug on an older SDK build:

  • The buggy POST /subscriptions returned 200 {} (server-side no-op for an existing id).
  • The merged create-subscription + update-subscription(SUBSCRIBED) group was therefore acked and dropped from the local op queue.
  • On in-place upgrade to the fixed SDK there is nothing left in the queue to retry, and the server view stays at enabled:false, notification_types:0 permanently.

I confirmed this empirically with the SDK-4388 demo-app repro: force-stop on the buggy build (server stuck), upgrade to the fixed build, relaunch — no PATCH /subscriptions/{id} is ever observed.

Scope

  • Affected: RefreshUserOperationExecutor.getUser. On every session start the SDK already issues GET /users/by/onesignal_id/{osid}, whose response carries enabled / notification_types per subscription. We now inspect the entry whose id matches the device's cached push subscription id, and if local.enabled && !server.enabled, we emit a follow-up UpdateSubscriptionOperation returned via ExecutionResponse.operations. That op runs through the standard executor path and produces one PATCH /subscriptions/{id} carrying the device's true state.
  • Unchanged: "Device is the source of truth for push" policy. We still do not hydrate the server's push subscription into SubscriptionModelStore (the existing if (subscriptionModel.type != SubscriptionType.PUSH) filter is preserved). The new code only enforces that policy across the wire when divergence is detected.
  • No-op for healthy users. Server-matches-local triggers no PATCH.
  • Respects user intent. A locally-opted-out subscription (optedIn=false, status UNSUBSCRIBE) or one with NO_PERMISSION does not trigger a reconcile — those server-side disables are correct, not stuck.

Cost

One extra PATCH /subscriptions/{id} per stuck device on its first foreground session under the new SDK build. Healthy devices pay nothing. Once the PATCH succeeds, server matches local and no further reconciles fire.

Testing

Unit testing

Added 4 cases to RefreshUserOperationExecutorTests:

  1. divergence — server enabled:false, notification_types:0 + local opted-in + SUBSCRIBED + non-empty token → returns exactly one UpdateSubscriptionOperation with enabled=true, status=SUBSCRIBED, correct ids.
  2. healthy — server enabled:true, notification_types:1 + local opted-in → no follow-up op.
  3. opted-out — server enabled:false, notification_types:-2 + local optedIn=false → no follow-up op (UNSUBSCRIBE is intentional).
  4. NO_PERMISSION — server enabled:false, notification_types:0 + local status=NO_PERMISSION → no follow-up op (OS denial is real).

The existing happy-path test refresh user is successful and models are hydrated properly was updated to set notificationTypes = 1 on the cached push subscription so it stays on the new healthy path.

Test gate: ./gradlew :OneSignal:core:testReleaseUnitTest --tests \"...RefreshUserOperationExecutorTests\" → 10/10 PASS. Two unrelated SDKInitTests failures exist on the parent branch and are not introduced by this PR.

Detekt + production assemble (:OneSignal:core:detekt, :OneSignal:core:assembleRelease) clean.

Manual testing

End-to-end repro and validation are intended to run on the demo app shipped with the parent SDK-4388 branch (which has the "Reproduce SDK-4388" + "Self-heal test" sections):

  1. Force-stop demo on the broken SDK build with cached stuck state (enabled:false, notification_types:0 on dashboard).
  2. Upgrade in place to this branch, relaunch.
  3. Expect on first session: GET /users/by/onesignal_id/{osid} followed by a single PATCH /subscriptions/{id} with enabled:true, notification_types:1.
  4. Dashboard subscription flips from "Never Subscribed" to "Subscribed".
  5. On subsequent sessions: only GET (no further PATCH) — divergence check is a no-op once reconciled.

Affected code checklist

  • Notifications
    • Display
    • Open
    • Push Processing
    • Confirm Deliveries
  • Outcomes
  • Sessions
  • In-App Messaging
  • REST API requests — adds at most one PATCH /subscriptions/{id} per stuck device on first session post-upgrade
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing — adds a single divergence detection branch in RefreshUserOperationExecutor.getUser
  • No public API changes

Testing

  • Unit test coverage added for all four boundary cases
  • All automated tests pass (pre-existing SDKInitTests failures are unrelated to this change)
  • Personally tested on device — pending end-to-end verification with the demo-app repro flow described above

Final pass

  • Code is as readable as possible — divergence logic isolated in a single helper buildPushSelfHealOperationOrNull with explanatory comments referencing SDK-4388 / SDK-4474
  • Self-reviewed

Branching: This PR targets the SDK-4388 fix branch (#2627 parent) so the two changes can land together as a complete "prevent new + heal old" pair. If the team prefers to ship the SDK-4388 fix first and follow up with this independently, retarget to main.

Refs: SDK-4474, SDK-4388

…n start

The SDK-4388 fix prevents new installs from getting stuck (POST -> PATCH
flip in SubscriptionOperationExecutor), but already-stuck users on older
SDK builds will not self-heal on a plain in-place upgrade: the buggy POST
already returned 200 {} and the local op queue is empty by the time the
fixed SDK runs.

Add a divergence check to RefreshUserOperationExecutor.getUser. The SDK
already calls GET /users/by/onesignal_id/{osid} on every session start,
and the response carries the server view of `enabled` /
`notification_types`. The "device is source of truth" policy for push
subscriptions is preserved (we still don't hydrate the server's push
subscription into the local model), but when the cached local model
resolves to enabled-and-opted-in while the server's record of the same
subscription id reports disabled, we now emit a follow-up
UpdateSubscriptionOperation. That op runs through the standard executor
path and re-asserts local truth via PATCH /subscriptions/{id} on the
next session start - one extra request per stuck device, no consumer
code changes, no migration script, no toggle.

Tests cover the four boundary cases in RefreshUserOperationExecutorTests:
divergence (server stuck-disabled, local enabled) emits the follow-up
op; healthy (server matches local) does not; locally-opted-out does not
(opt-out is intentional); local NO_PERMISSION does not (OS denial is
real).

Refs: SDK-4388
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant