Skip to content

Conversation

@magnetised
Copy link
Contributor

@magnetised magnetised commented Jan 15, 2026

Reduce load on sqlite by moving the core metadata checked by most read requests into ETS and prevent timeouts during sustained parallel shape creation by writing to sqlite asynchronously in a background process. Writes are now done directly to ETS and flushed to disk in batches.

Only shape -> handle and handle -> shapes lookups are now handled by reading directly from sqlite. All other, small, metadata is held in an ets wriite-through cache and populated from sqlite on boot.

Pending writes are flushed to disk on shutdown. This is quick, ~1,000 writes/s, but could cause delays to shutdown in an overloaded system. There is a risk of loss of in-flight writes if the system crashes.

Associated changes:

  • Move ShapeCleaner up in the supervisor hierarchy so it's available before the ShapeStatus process attempts to cleanup incomplete/invalid shapes
  • Allow for configuring load shedding via an environment variable - makes it easier to run load tests locally without maintaining a patch on the default configuration

Summary by CodeRabbit

  • New Features

    • Write-buffered shape metadata with background flushes to reduce DB timeouts
    • Optional max concurrent requests configuration
    • New telemetry for pending shape write buffer size
  • Bug Fixes

    • More accurate snapshot state tracking and consistency during add/remove/flush
  • Chores

    • Adjusted shape-cleaner supervision and startup wiring
  • Tests

    • Updated tests and integration flows to exercise buffering and snapshot semantics

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 15, 2026

📝 Walkthrough

Walkthrough

Reworks shape metadata storage to an ETS-backed shape_meta_table and adds a WriteBuffer GenServer that batches shape metadata mutations (add/remove/snapshot) before persisting to SQLite. Query and connection layers made buffer-aware; supervisors, telemetry, config, and tests updated for buffered semantics and labeled checkouts. Public signatures largely preserved.

Changes

Cohort / File(s) Summary
ETS metadata refactor
packages/sync-service/lib/electric/shape_cache/shape_status.ex
Replace shape_last_used_table with shape_meta_table storing 4-tuples (handle, hash, snapshot_started, last_read_time) and update all ETS ops to the new tuple shape.
Shape DB core & queries
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex, packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/query.ex
Add buffer-aware read/write paths that consult the WriteBuffer before SQLite; enqueue adds/removes/snapshot markers; rename streaming/query helpers to expose shape metadata; reduce_shape_handles/3reduce_shape_meta/3.
WriteBuffer implementation
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/write_buffer.ex
New GenServer providing ETS-backed buffering: pending operation queue, tombstones, buffered shape lookups, snapshot queueing, periodic/background flush, synchronous flush API, batching to SQLite, and many inspection helpers (new public API).
Connection pool & telemetry labeling
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex
Checkout/check-out-write APIs extended to accept a label; enqueue/checkout flows carry label; emit OpenTelemetry queue metrics and label explain calls.
Supervisor wiring
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/supervisor.ex, packages/sync-service/lib/electric/monitored_core_supervisor.ex, packages/sync-service/lib/electric/core_supervisor.ex, packages/sync-service/lib/electric/shapes/supervisor.ex
Add ShapeDb.WriteBuffer child to ShapeDb supervisor; add CleanupTaskSupervisor child via tweaks; remove legacy shape_cleaner from some supervisors.
Telemetry periodic metrics
packages/sync-service/lib/electric/stack_supervisor/telemetry.ex
Add periodic reporting for pending write-buffer size and emit telemetry [:electric, :shape_db, :write_buffer, :pending_writes].
Config and runtime
packages/sync-service/config/runtime.exs
Add optional max_concurrent_requests config read from ELECTRIC_MAX_CONCURRENT_REQUESTS (JSON or nil).
Tests & test helpers
packages/sync-service/test/.../shape_db_test.exs, packages/sync-service/test/support/component_setup.ex, integration-tests/tests/crash-recovery.lux, packages/sync-service/test/.../request_batcher_test.exs
Update tests to use reduce_shape_meta/3, flush write buffer where needed, adjust snapshot state expectations, add small sync delay in one test, and set manual_flush_only: true in test setup.
Changelog / changeset
.changeset/gorgeous-readers-return.md
New changeset documenting the ETS-based write-through cache and asynchronous SQLite writes.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant ShapeDb
  participant WriteBuffer
  participant SQLite

  Client->>ShapeDb: add_shape(handle, shape, hash, relations)
  ShapeDb->>WriteBuffer: enqueue :add (buffer shape, relations, hash)
  WriteBuffer-->>ShapeDb: {:ok, shape_hash}
  Note right of WriteBuffer: background or explicit flush will persist batched ops
  WriteBuffer->>SQLite: batch apply (add/remove/snapshot ops)
  SQLite-->>WriteBuffer: ok
  WriteBuffer-->>ShapeDb: flush completion (cleanup tombstones)
  ShapeDb-->>Client: {:ok, shape_hash}
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

  • icehaunter

Poem

🐰 I buffered shapes in an ETSy nest,
Queued snaps and tombs, then let them rest.
Batches woke and scurried to write,
Labels kept checkouts cozy and bright.
Hop, hop—now metadata sleeps through the night.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main architectural change: moving SQLite writes to a background thread and buffering read state in ETS, which aligns with the PR's core objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Jan 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.36%. Comparing base (6b7b453) to head (9af4ea1).
⚠️ Report is 11 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3716      +/-   ##
==========================================
+ Coverage   83.40%   87.36%   +3.96%     
==========================================
  Files          35       23      -12     
  Lines        2380     2011     -369     
  Branches      530      533       +3     
==========================================
- Hits         1985     1757     -228     
+ Misses        393      252     -141     
  Partials        2        2              
Flag Coverage Δ
electric-telemetry ?
elixir ?
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 93.47% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 87.36% <ø> (ø)
unit-tests 87.36% <ø> (+3.96%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blacksmith-sh

This comment has been minimized.

@magnetised magnetised force-pushed the magnetised/sqlite-ets-cache branch from 1cdec3c to 2734d7b Compare January 15, 2026 12:34
@magnetised
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Jan 15, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@magnetised
Copy link
Contributor Author

benchmark this

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex`:
- Around line 106-125: The fallback clause handle_checkout({:checkout, _label},
_from, conn, pool_state) is unreachable because handle_enqueue({:checkout,
label}, pool_state) always enqueues {:checkout, label, now()}, so remove the
dead clause (the two-arity pattern) from connection.ex (or instead replace it
with a clear comment if you intentionally want defensive behavior); update tests
accordingly to ensure no behavior changes.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c25526c and 2734d7b.

📒 Files selected for processing (5)
  • packages/sync-service/lib/electric/shape_cache/shape_status.ex
  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex
  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex
  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/query.ex
  • packages/sync-service/test/electric/shape_cache/shape_status/shape_db_test.exs
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: electric-sql/electric PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-14T14:45:05.838Z
Learning: Define shapes in server/proxy – no client-defined tables/WHERE clauses
📚 Learning: 2026-01-14T14:45:05.838Z
Learnt from: CR
Repo: electric-sql/electric PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-14T14:45:05.838Z
Learning: Avoid old Electric patterns (bidirectional SQLite sync, `electrify()` API) – use Electric HTTP streaming with TanStack DB collections instead

Applied to files:

  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex
🧬 Code graph analysis (2)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex (3)
packages/sync-service/lib/electric/connection/manager.ex (1)
  • pool_name (180-182)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex (1)
  • explain (182-186)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/query.ex (2)
  • explain (263-265)
  • explain (267-269)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex (1)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex (2)
  • checkout_write! (168-180)
  • checkout! (159-166)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Check and build examples/linearlite-read-only
  • GitHub Check: Check and build examples/tanstack-db-web-starter
  • GitHub Check: Check and build examples/linearlite
  • GitHub Check: Check and build examples/nextjs
  • GitHub Check: Check and build examples/react
  • GitHub Check: Check and build examples/encryption
  • GitHub Check: Check and build examples/proxy-auth
  • GitHub Check: Run Lux integration tests
🔇 Additional comments (18)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/query.ex (2)

6-8: LGTM!

The new list_shape_meta query correctly selects the required metadata fields (handle, hash, snapshot_state) for the ETS cache population.


249-253: LGTM!

The stream function correctly transforms the database's integer snapshot_state to a boolean representing "snapshot started" status, consistent with the snapshot_started? function logic at Line 199.

packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex (3)

12-15: LGTM!

The updated imports correctly reflect the new function signatures that accept action labels for telemetry tracking.


35-47: LGTM!

Action labels are consistently applied across all checkout operations, enabling meaningful telemetry tracking per operation type.


100-106: LGTM!

The renamed reduce_shape_meta/3 correctly uses the new list_shape_meta_stream and the name accurately reflects the richer metadata being processed.

packages/sync-service/test/electric/shape_cache/shape_status/shape_db_test.exs (1)

176-199: LGTM!

The test comprehensively validates the renamed reduce_shape_meta/3 function, correctly verifying the {handle, hash, snapshot_started} tuple structure with proper setup of both started and non-started snapshots.

packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex (3)

132-140: LGTM!

The reduced SQLite page cache (512KB) is appropriate since hot-path lookups now go through the ETS cache. The comment clearly explains the negative value convention.


114-118: LGTM!

The telemetry emission provides valuable observability for pool checkout latency, enabling monitoring and debugging of potential connection pool bottlenecks per operation type.


159-180: LGTM!

The updated function signatures correctly propagate the label through to NimblePool, enabling the telemetry instrumentation in handle_checkout.

packages/sync-service/lib/electric/shape_cache/shape_status.ex (9)

32-34: LGTM!

The tuple structure {handle, hash, snapshot_started, last_read_time} and position constant are clearly documented, making the code self-explanatory.


69-75: LGTM!

The ETS insertion correctly uses insert_new to prevent race conditions and properly sequences the persistent store write before the cache write.


169-175: LGTM!

The hash validation via ETS lookup is an effective optimization, avoiding SQLite queries for shape validation on the hot path.


182-186: LGTM!

The update correctly sequences the persistent write before the cache update and properly uses update_element with the correct tuple position.


190-193: LGTM!

Reading snapshot_started? from ETS cache provides fast hot-path access while maintaining a safe default of false for missing entries.


203-205: Verify: snapshot_complete? not cached in ETS.

While snapshot_started? now reads from the ETS cache (Line 190-193), snapshot_complete? still queries SQLite directly. This asymmetry may be intentional if snapshot_complete? is called less frequently, but it creates inconsistent performance characteristics.

If snapshot_complete? is also on a hot path, consider extending the meta tuple to {handle, hash, snapshot_started, snapshot_complete, last_read_time} and caching both states.


231-267: LGTM!

The LRU algorithm using gb_trees is an efficient approach (O(n) scan with O(log k) tree operations) and correctly handles the new 4-tuple structure.


297-310: LGTM!

The population logic correctly transforms the 3-tuple from ShapeDb.reduce_shape_meta into the 4-tuple required by ETS, initializing all shapes with the same start_time for a fair LRU baseline after restart.


283-295: LGTM!

The ETS table configuration with read_concurrency: true and write_concurrency: :auto is well-suited for a cache that's primarily read-heavy with occasional writes.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@blacksmith-sh

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 15, 2026

Benchmark results, triggered for 2734d

  • write fanout completed

write fanout results

  • unrelated shapes one client latency completed

unrelated shapes one client latency results

  • many shapes one client latency completed

many shapes one client latency results

  • concurrent shape creation completed

concurrent shape creation results

  • diverse shape fanout completed

diverse shape fanout results

@magnetised magnetised force-pushed the magnetised/sqlite-ets-cache branch from 2734d7b to bb41c9a Compare January 15, 2026 13:49
@balegas
Copy link
Contributor

balegas commented Jan 19, 2026

Concurrent shape creation looks significantly slower. Do you think this might cause timeouts with high load?

@magnetised magnetised force-pushed the magnetised/sqlite-ets-cache branch 2 times, most recently from fee08a1 to 62d685a Compare January 19, 2026 15:32
@blacksmith-sh

This comment has been minimized.

@magnetised magnetised force-pushed the magnetised/sqlite-ets-cache branch from 62d685a to 67a2d9f Compare January 19, 2026 15:42
@magnetised
Copy link
Contributor Author

benchmark this

@github-actions
Copy link
Contributor

github-actions bot commented Jan 19, 2026

Benchmark results, triggered for 55864

  • write fanout completed

write fanout results

  • unrelated shapes one client latency completed

unrelated shapes one client latency results

  • many shapes one client latency completed

many shapes one client latency results

  • concurrent shape creation completed

concurrent shape creation results

  • diverse shape fanout completed

diverse shape fanout results

@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

@magnetised
Copy link
Contributor Author

benchmark this

@github-actions
Copy link
Contributor

github-actions bot commented Jan 20, 2026

Benchmark results, triggered for b02c0

  • write fanout completed

write fanout results

  • unrelated shapes one client latency completed

unrelated shapes one client latency results

  • many shapes one client latency completed

many shapes one client latency results

  • concurrent shape creation completed

concurrent shape creation results

  • diverse shape fanout completed

diverse shape fanout results

@magnetised magnetised self-assigned this Jan 20, 2026
@blacksmith-sh

This comment has been minimized.

@magnetised
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@magnetised magnetised force-pushed the magnetised/sqlite-ets-cache branch from b9b0fce to fceff29 Compare January 20, 2026 13:52
@magnetised magnetised changed the title Put ets cache in front of sqlite Move SQLite writes to a background thread and buffer read state Jan 20, 2026
@magnetised magnetised marked this pull request as ready for review January 20, 2026 14:29
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@packages/sync-service/config/runtime.exs`:
- Around line 199-206: The current config decodes JSON into atoms using
Jason.decode!(..., keys: :atoms) which can create new atoms and exhaust the BEAM
atom table; update the decode call for max_concurrent_requests to use keys:
:atoms! instead (i.e., replace Jason.decode!(json, keys: :atoms) with
Jason.decode!(json, keys: :atoms!)) so only existing atoms are used; locate the
max_concurrent_requests assignment and the Jason.decode! invocation to make this
change.

In
`@packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/write_buffer.ex`:
- Around line 401-425: The batch write silently ignores non-:ok returns causing
failed DB ops to be dropped; update do_batch_write so each call to
Query.remove_shape, Query.mark_snapshot_started, and
Query.mark_snapshot_complete is enforced with pattern matching (e.g., expect :ok
and let a non-:ok return raise) similar to the existing :ok = Query.add_shape
line, so failures propagate to the rescue block instead of being ignored during
the Enum.each over entries.
- Around line 339-381: flush_until_empty currently leaves entries marked with
flushing=true when do_batch_write returns :error, preventing retries; after
detecting a non-:ok result from do_batch_write(stack_id, entries) you should
iterate the collected entries and reset their flushing flag back to false in the
ops_table so they can be retried later (use the same key shape used by
mark_and_collect_entries/ops_table entries, updating the flushing field via
:ets.update_element or a select/replace on ops_table for each {ts, _op} in
entries). Ensure this logic runs only on failure paths and references the local
variables ops_table, entries, and the flushing flag so successful paths are
unchanged.
- Around line 56-58: The ops table currently keys only by timestamp() which can
collide; add a new function op_key/0 that returns a composite monotonic
timestamp plus a unique integer (e.g., {System.monotonic_time(),
:erlang.unique_integer([:positive])}) and keep timestamp/0 if needed for
readability, then replace all uses of ts = timestamp() in add_shape,
remove_shape, queue_snapshot_started, and queue_snapshot_complete with ts =
op_key() so each op key is ordered and unique, preventing overwrites in the
:ordered_set.

In
`@packages/sync-service/test/electric/replication/shape_log_collector/request_batcher_test.exs`:
- Around line 117-120: Replace the fixed Process.sleep(5) with a deterministic
synchronization: wait (via assert_receive) for the RequestBatcher to confirm the
shape is fully enqueued or poll a public/readable status function on
RequestBatcher (e.g., RequestBatcher.enqueued?/2 or similar) until it returns
true before calling RequestBatcher.remove_shape(stack_id, `@shape_handle_2`);
ensure you add or use an existing test hook/message (like an :shape_enqueued
notification) from the add_shape handling so the test can reliably
assert_receive that notification instead of sleeping.
🧹 Nitpick comments (2)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex (1)

123-125: Emit checkout telemetry for non-queued checkouts too.

Fast-path checkouts skip telemetry, which can skew queue-time metrics. Consider emitting the same event with queue_time_μs: 0 in this clause for consistency.

♻️ Suggested adjustment
 def handle_checkout({:checkout, _label}, _from, conn, pool_state) do
+  OpenTelemetry.execute(
+    [:electric, :shape_db, :pool, :checkout],
+    %{queue_time_μs: 0},
+    %{label: label, stack_id: Keyword.get(pool_state, :stack_id)}
+  )
   {:ok, conn, conn, pool_state}
 end
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/query.ex (1)

223-229: Align snapshot_started? with bitmask semantics.

The comment defines bit 0 as “started.” Using > 0 treats any nonzero state as started. Consider using a bitmask check to match the stated semantics and future-proof if snapshot_state ever becomes 2.

♻️ Suggested adjustment
-    snapshot_started? = snapshot_state > 0
+    snapshot_started? = Bitwise.band(snapshot_state, 1) == 1
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2734d7b and fceff29.

📒 Files selected for processing (15)
  • integration-tests/tests/crash-recovery.lux
  • packages/sync-service/config/runtime.exs
  • packages/sync-service/lib/electric/core_supervisor.ex
  • packages/sync-service/lib/electric/monitored_core_supervisor.ex
  • packages/sync-service/lib/electric/shape_cache/shape_status.ex
  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex
  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex
  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/query.ex
  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/supervisor.ex
  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/write_buffer.ex
  • packages/sync-service/lib/electric/shapes/supervisor.ex
  • packages/sync-service/lib/electric/stack_supervisor/telemetry.ex
  • packages/sync-service/test/electric/replication/shape_log_collector/request_batcher_test.exs
  • packages/sync-service/test/electric/shape_cache/shape_status/shape_db_test.exs
  • packages/sync-service/test/support/component_setup.ex
💤 Files with no reviewable changes (2)
  • packages/sync-service/lib/electric/core_supervisor.ex
  • packages/sync-service/lib/electric/shapes/supervisor.ex
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-01-14T14:45:05.838Z
Learnt from: CR
Repo: electric-sql/electric PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-14T14:45:05.838Z
Learning: Avoid old Electric patterns (bidirectional SQLite sync, `electrify()` API) – use Electric HTTP streaming with TanStack DB collections instead

Applied to files:

  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex
  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex
📚 Learning: 2026-01-15T13:54:51.778Z
Learnt from: magnetised
Repo: electric-sql/electric PR: 3716
File: packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex:106-126
Timestamp: 2026-01-15T13:54:51.778Z
Learning: In NimblePool, when a worker is immediately available, the pool skips `handle_enqueue/2` and calls `handle_checkout/4` directly with the original checkout command. When no worker is available, the request goes through `handle_enqueue/2` (where state can be transformed) before `handle_checkout/4` is called. This means both patterns must be handled in `handle_checkout/4` to support both immediate and queued checkouts.

Applied to files:

  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex
🧬 Code graph analysis (4)
packages/sync-service/lib/electric/monitored_core_supervisor.ex (1)
packages/sync-service/lib/electric/telemetry/sentry.ex (1)
  • set_tags_context (23-25)
packages/sync-service/lib/electric/shape_cache/shape_status.ex (2)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/query.ex (2)
  • add_shape (100-131)
  • mark_snapshot_started (176-183)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex (1)
  • mark_snapshot_started (67-73)
packages/sync-service/test/electric/shape_cache/shape_status/shape_db_test.exs (4)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/write_buffer.ex (2)
  • add_shape (205-221)
  • remove_shape (224-240)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/query.ex (7)
  • add_shape (100-131)
  • mark_snapshot_started (176-183)
  • shape_for_handle (145-149)
  • handle_for_shape (139-143)
  • mark_snapshot_complete (185-192)
  • remove_shape (194-211)
  • handle_exists? (151-156)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex (2)
  • mark_snapshot_started (67-73)
  • mark_snapshot_complete (75-81)
packages/sync-service/lib/electric/shapes.ex (1)
  • mark_snapshot_started (123-127)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/write_buffer.ex (3)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/query.ex (4)
  • add_shape (100-131)
  • remove_shape (194-211)
  • mark_snapshot_started (176-183)
  • mark_snapshot_complete (185-192)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex (1)
  • checkout_write! (169-181)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex (2)
  • mark_snapshot_started (67-73)
  • mark_snapshot_complete (75-81)
🔇 Additional comments (26)
packages/sync-service/lib/electric/monitored_core_supervisor.ex (1)

17-17: No action needed — tweaks is guaranteed to be a keyword list by schema validation.

The codebase explicitly defines tweaks as type: :keyword_list in the schema validation (stack_supervisor.ex lines 104–110) with a default of []. This type constraint is enforced across all code paths. Additionally, core_supervisor.ex uses Keyword.fetch!(opts, :tweaks), which would fail immediately if tweaks were ever a map. All existing uses of tweaks in the codebase employ Keyword.get() or Keyword.fetch!() without any defensive logic, confirming that the type is safe and consistent.

Likely an incorrect or invalid review comment.

packages/sync-service/lib/electric/stack_supervisor/telemetry.ex (2)

43-62: Pending-write metric wiring looks good.
Adds the metric and schedules periodic reporting, improving visibility into buffered writes.


84-96: Defensive write-buffer telemetry emission looks good.
Gracefully handles the buffer not being available yet during early startup.

packages/sync-service/test/electric/shape_cache/shape_status/shape_db_test.exs (4)

176-202: Nice coverage for reduce_shape_meta + snapshot_started propagation.
Flushing before reduction makes the SQLite-backed state deterministic in the test.


223-250: Snapshot start/complete tests look good.
These updates align with the new snapshot lifecycle APIs.


273-277: Helper keeps snapshot assertions compact.
Makes the new snapshot-state checks easier to follow.


336-415: Good write-buffer behavior coverage.
The suite exercises buffered visibility, flush behavior, tombstones, and snapshot transitions well.

packages/sync-service/lib/electric/shape_cache/shape_status.ex (5)

32-61: shape_meta_table initialization looks solid.
The new meta-table lifecycle (create/populate) is clear and coherent.


69-139: Add/remove/reset keep the meta cache in sync.
The ETS updates align with the buffered write path.


157-194: Handle validation + snapshot_started caching aligns with the new tuple.
Consistent use of the meta table makes these paths straightforward.


216-263: LRU/last_read_time updates align with the new tuple layout.
The fold now correctly pulls the last_read_time element.


275-304: Meta-table helpers are clean.
Creation and population helpers look good.

packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex (7)

2-23: Updated ShapeDb doc + buffering wiring look good.
The module-level intent is clear.


39-86: Buffered mutation path looks consistent.
Add/remove/snapshot/reset all route through the buffer cleanly.


88-131: Buffer-first reads with tombstone checks look good.
The fallback to SQLite is clear and safe.


133-176: Merged buffered/SQLite listing with tombstone filtering looks good.
Deduping by handle makes sense here.


184-209: Reduction/count paths align with the buffer model.
The updated flow matches the new read/write semantics.


216-248: Buffer-aware handle checks and pre-validation flush are good.
This keeps validation consistent with persisted state.


256-260: Pending-buffer metric accessor is a good addition.
Useful for telemetry and load testing.

packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/write_buffer.ex (7)

1-62: Module structure and high-level API look good.
The doc + API layout is easy to follow.


64-69: flush_sync API is handy for tests and shutdown.
Nice addition for deterministic flushes.


73-127: Lookup helpers and tombstone gating read clearly.
The semantics are easy to reason about.


129-153: Buffered shape listing/count helpers look good.
They match the expected buffered semantics.


155-200: Relation filtering and tombstone accounting look consistent.
The logic aligns with the Query-level expectations.


258-315: Init and ETS table setup look good.
The table configuration and manual_flush_only flag are clear.


317-333: Poll/flush/terminate wiring looks good.
The periodic flush loop is straightforward.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@blacksmith-sh

This comment has been minimized.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/write_buffer.ex`:
- Around line 207-224: When re-adding a handle in add_shape the earlier
tombstone operation can still hide the new add until flush; update add_shape
(which uses shapes_table_name/ops_table via shapes_table and ops_table and
inserts into ops_table) to clear any pending tombstone/remove op for that handle
before queuing the {:add,...} entry—e.g. remove matching {:remove, handle, ...}
entries from ops_table (using :ets.match_delete or lookup+delete) so immediate
reads (lookup_*/has_handle?) reflect the new add.
🧹 Nitpick comments (1)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/write_buffer.ex (1)

24-27: Clarify ops table key shape in docs.

Line 24 describes a single timestamp, but the key is now a composite from op_key/0. Update the doc to prevent confusion.

♻️ Suggested doc tweak
-  `{timestamp, operation, flushing}`
+  `{op_key, operation, flushing}` (where op_key is a composite monotonic key)
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fceff29 and 9af4ea1.

📒 Files selected for processing (2)
  • .changeset/gorgeous-readers-return.md
  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/write_buffer.ex
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: electric-sql/electric PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-14T14:45:05.838Z
Learning: Define shapes in server/proxy – no client-defined tables/WHERE clauses
📚 Learning: 2026-01-14T14:45:05.838Z
Learnt from: CR
Repo: electric-sql/electric PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-14T14:45:05.838Z
Learning: Avoid old Electric patterns (bidirectional SQLite sync, `electrify()` API) – use Electric HTTP streaming with TanStack DB collections instead

Applied to files:

  • .changeset/gorgeous-readers-return.md
🧬 Code graph analysis (1)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/write_buffer.ex (3)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/supervisor.ex (3)
  • name (6-8)
  • start_link (10-12)
  • init (14-52)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/query.ex (3)
  • add_shape (100-131)
  • mark_snapshot_started (176-183)
  • mark_snapshot_complete (185-192)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex (2)
  • mark_snapshot_started (67-73)
  • mark_snapshot_complete (75-81)
🔇 Additional comments (4)
packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/write_buffer.ex (3)

74-110: Lookup paths look solid.

Tombstone-first filtering and buffer lookups are clear and consistent.


269-331: Lifecycle wiring is clean.

Init, polling, and synchronous flush hooks read clearly and are easy to reason about.


342-433: Flush + cleanup flow is cohesive.

Batching, cleanup, and retry behavior are well-structured.

.changeset/gorgeous-readers-return.md (1)

1-5: Changeset summary aligns with the implementation.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

## Crash electric
[shell electric]
# flush pending writes so that the shape we just created is definitely in the db
!Electric.ShapeCache.ShapeStatus.ShapeDb.WriteBuffer.flush_sync("single_stack")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are first flushing the changes then we are not simulating crash recovery - the test should cover the behaviour of a crash, including unflushed write buffer changes, and if that breaks electric we should fix it no?

Copy link
Contributor

@msfstef msfstef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will give it a more detailed look tomorrow, main comments are 1) many full table scans and tab2lists on the buffer but its not guaranteed to be small (?) 2) crash test now doesn't actually test a crash (?) how do we handle the inflight requests being lost

will leave opus review as well below this

tombstones = tombstones_table_name(stack_id)

shapes_table_name(stack_id)
|> :ets.tab2list()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather avoid tab2list where possible and instead do a reduce/fold - less memory pressure and scale independent?

tombstones = tombstones_table_name(stack_id)

shapes_table_name(stack_id)
|> :ets.tab2list()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above - prefer reduce over tab2list

end

defp has_buffered_add?(ops_table, handle) do
:ets.match_object(ops_table, {:_, {:add, handle, :_, :_, :_, :_}, :_}) != []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also full table scan (per-tombstone) - this is for telemetry only and again the buffer should be small but it is a bit scary to do a full scan per tombstone entry, surely there's more efficient ways of doing this (since we're doing tab2list again on the tombstones anyway) - we dont need to traverse the whole table every time. Having some sort of index might be worth it?

@doc "Look up a handle by comparable shape binary. Returns {:ok, handle} or :not_found."
def lookup_handle(stack_id, comparable_binary) do
tombstones = tombstones_table_name(stack_id)
pattern = {:_, :_, comparable_binary}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, this is a full table scan - I guess the idea is that the buffer should be small and we can afford this, but if write intensity is high won't this be an issue? and it's on the critical path as well

# Write functions

@doc "Add a shape to the buffer"
def add_shape(stack_id, handle, shape_binary, comparable_binary, hash, relations) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify - there is no limit to the size of this buffer, where it would force a flush and apply back pressure forcefully, right?

Copy link
Contributor

@msfstef msfstef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a well-structured performance optimization that introduces an ETS-based write buffer to prevent SQLite timeout cascades. The latest commit addresses some initial concerns (flushing flag reset, operation key uniqueness), but a few issues remain.

🔴 Critical: Crash Recovery Test Semantics

integration-tests/tests/crash-recovery.lux:50-52

The test now explicitly flushes before crashing:

!Electric.ShapeCache.ShapeStatus.ShapeDb.WriteBuffer.flush_sync("single_stack")

This defeats the purpose of crash recovery testing. The test should verify that Electric handles unflushed writes correctly on restart (even if that means the shape doesn't exist). This is a fundamental change in crash semantics that should be addressed in the code/documentation, not papered over in tests.


🔴 Critical: Data Loss on Crash

Acknowledged in the PR description, but worth emphasizing: shapes created just before a crash may not exist after restart. This changes the durability guarantees and should be clearly documented for operators.


🟠 Moderate: TOCTOU Race in handle_for_shape

packages/sync-service/lib/electric/shape_cache/shape_status/shape_db.ex:72-88

case checkout!(stack_id, :handle_for_shape, &Query.handle_for_shape(&1, comparable_shape)) do
  {:ok, handle} ->
    if WriteBuffer.is_tombstoned?(stack_id, handle) do  # Race: tombstone could be added here
      :error
    else
      {:ok, handle}
    end

Between the SQLite lookup and the tombstone check, a concurrent remove_shape could tombstone the handle. The window is small but exists under high concurrency.


🟠 Moderate: Linear Scans in Buffer Lookups

write_buffer.ex:136-152 (lookup_handle) and write_buffer.ex:171-172 (has_buffered_add?) use full table scans with match_object. Under the exact high-load conditions this PR targets (many concurrent shape creations), these could become bottlenecks.

Consider a secondary index for comparable_binary -> handle lookups.


🟡 Minor: Test Flakiness

request_batcher_test.exs:118

# Small delay to ensure the add_shape GenServer.call is processed before remove
Process.sleep(5)

Adding Process.sleep to fix ordering suggests there's a real race condition that could manifest under CI load.


✅ Fixed in Latest Commit

  • Flushing flag reset on error - Now properly allows retry on failed batches
  • Operation key uniqueness - Uses {monotonic_time, unique_integer} tuple to prevent collisions

Summary

Severity Issue Status
🔴 Critical Crash recovery test doesn't test actual crash recovery Open
🔴 Critical Data loss on crash (needs documentation) Open
🟠 Moderate TOCTOU race in handle lookups Open
🟠 Moderate Linear scans in buffer lookups Open
🟡 Minor Test flakiness with Process.sleep Open

The crash recovery test concern is the most pressing - the current test is not validating what it claims to validate.


assert_receive :adding_shape_2
# Small delay to ensure the add_shape GenServer.call is processed before remove
Process.sleep(5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not wait for the :processor_called message here as well? after that we wait for the processor_delay set above so the deduplication should work fine - generally we patch the processor call entirely so I'm unclear on why we need a sleep here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants