Skip to content

chore: hotfix #3361 to v1.1.x#3362

Open
julienrbrt wants to merge 5 commits into
v1.1.xfrom
julien/hotfix-ws
Open

chore: hotfix #3361 to v1.1.x#3362
julienrbrt wants to merge 5 commits into
v1.1.xfrom
julien/hotfix-ws

Conversation

@julienrbrt

Copy link
Copy Markdown
Member

Overview

Bring #3361 to a v1.1.x branch as main is already v1.2.x because of the key rotation PR.

@julienrbrt julienrbrt changed the base branch from main to v1.1.x June 19, 2026 11:29
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJun 20, 2026, 9:12 PM

@claude

claude Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Claude finished @julienrbrt's task in 2m 50s —— View job


Code Review: Hotfix #3361 → v1.1.x

Overview: This backport adds WebSocket fallback resilience to the DA client — when the initial WS dial fails, it falls back to HTTP polling and spawns a background goroutine that retries the WS connection. When WS becomes available, the subscriber transparently upgrades from pollLoop to followLoop.


  • Gather PR context
  • Read changed files and understand the backport
  • Review code quality, error handling, concurrency
  • Review security and performance
  • Review test coverage
  • Post comprehensive review feedback

Findings

1. Data Race in tryUpgradeWS — Function Pointer Swap (Medium-High)

In pkg/da/jsonrpc/client.go:157-159, the Internal structs are replaced under c.mu:

c.mu.Lock()
c.Blob.Internal = wsClient.Blob.Internal   // write under lock
c.Header.Internal = wsClient.Header.Internal
c.IsWebSocket.Store(true)
c.mu.Unlock()

However, concurrent goroutines executing RPCs (e.g. blobAPI.GetAll, blobAPI.Subscribe) read blobAPI.Internal.GetAll without holding c.mu. Since block/internal/da/client.go:131 stores a raw pointer: blobAPI: &cfg.DA.Blob, any concurrent RPC call during the swap constitutes a data race that the Go race detector would flag.

The atomic IsWebSocket.Store(true) does guarantee that goroutines which subsequently read IsWebSocket.Load() == true will see the updated Internal fields (Go 1.19+ sequentially-consistent atomics), but goroutines that are already executing RPCs at the time of the swap are not protected.

Suggested mitigation: Guard reads inside the RPC call wrappers with c.mu.RLock(), or introduce an atomic.Pointer over the entire Internal struct. Fix this →


2. Close() Does Not Nil the Closer Under Lock — Double-Close Risk (Low)

pkg/da/jsonrpc/client.go:30-44:

c.mu.Lock()
// retryCancel is nilled under lock ✓
closer := c.closer   // copied but NOT nilled
c.mu.Unlock()
if closer != nil {
    closer()         // two concurrent Close() calls both reach here
}

If two goroutines call Close() concurrently, both will copy the non-nil closer and call it twice. Add c.closer = nil under the lock. Fix this →


3. pollLoopfollowLoop Transition Not Tested (Low-Medium)

All existing tests mock SupportsSubscribe() to return true, so pollLoop is never exercised. There is no test for:

  • SupportsSubscribe() == false on startup → poll path taken
  • Transport upgrade mid-run → pollLoop switches to followLoop

This is the core new behavior introduced in the hotfix, yet codecov confirms block/internal/da/subscriber.go has only 7.14% patch coverage. Consider a test that starts with SupportsSubscribe()→false, delivers a few heights via GetLatestDAHeight, then flips the flag and verifies followLoop takes over.


4. WS Check Not in the select — Up to daBlockTime Delay on Upgrade (Negligible)

block/internal/da/subscriber.go:195-211:

for {
    if s.client.SupportsSubscribe() {  // checked once per tick
        ...
    }
    select {
    case <-ctx.Done():
    case <-ticker.C:
        s.pollDAHeight(ctx)
    }
}

After WS is restored, detection is delayed by up to one full daBlockTime tick. For Celestia (~12 s block time) this is 30 s max delay, which is acceptable. Adding a dedicated ticker or notification channel could reduce this, but it adds complexity for a minor win.


5. wsClient.Close() Called Under c.mu — Potential Deadlock Scrutiny (Non-Issue, Worth Noting)

c.mu.Lock()
if c.IsWebSocket.Load() {
    wsClient.Close()   // acquires wsClient.mu, not c.mu
    return true
}

wsClient.Close() acquires wsClient.mu, which is a different mutex from c.mu. No deadlock risk. Worth noting because the code pattern (calling a method that acquires its own lock while holding another lock) can be subtle — but the analysis confirms it is safe here.


Positive Highlights

  • Clean interface extension: SupportsSubscribe() on the Client interface is well-placed and properly threaded through tracedClient, DummyDA, and all mocks.
  • Atomic-first design: Using atomic.Bool for IsWebSocket is the right primitive here — avoids lock contention on the hot-path read in SupportsSubscribe().
  • retryWSLoop cleanup: retryCancel is correctly set before the goroutine is started and properly cancelled in Close(), avoiding a goroutine leak.
  • pollLoop WS check ordering: Checking before entering the select ensures the transition to followLoop doesn't wait an extra tick after WS is detected.
  • wg.Add(1) before go s.followLoop: The pollLoop correctly increments the WaitGroup before spawning followLoop and decrements itself via defer, keeping the Stop() contract intact.
  • pollDAHeight idempotency: Correctly uses updateHighest (CAS loop) so concurrent calls from catchupLoop inline processing don't clobber each other.

Summary

The logic is sound and the approach is correct. The main concern worth addressing before merge is the data race on Internal struct swap (finding #1) — it's unlikely to manifest in practice but would be caught by -race in CI. The double-close in finding #2 is a minor defensive fix. Test coverage for the pollLoop path (finding #3) would also meaningfully increase confidence in the fallback behavior.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a1ddc87e-b394-454c-a255-0fc5ea6dfb99

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/hotfix-ws

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

An error occurred during the review process. Please try again later.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/hotfix-ws

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 9.47368% with 86 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.62%. Comparing base (d208cd0) to head (934371c).

Files with missing lines Patch % Lines
pkg/da/jsonrpc/client.go 10.41% 41 Missing and 2 partials ⚠️
block/internal/da/subscriber.go 7.14% 38 Missing and 1 partial ⚠️
block/internal/da/client.go 33.33% 2 Missing ⚠️
block/internal/da/tracing.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           v1.1.x    #3362      +/-   ##
==========================================
- Coverage   60.98%   60.62%   -0.36%     
==========================================
  Files         127      127              
  Lines       13879    13969      +90     
==========================================
+ Hits         8464     8469       +5     
- Misses       4491     4573      +82     
- Partials      924      927       +3     
Flag Coverage Δ
combined 60.62% <9.47%> (-0.36%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant