Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates diskless cluster replication error propagation so that, when the streaming checkpoint “reset CTS” exception is injected, the replica attach failure surfaces the injected exception message (instead of a generic “wait faulted” message), and adjusts the associated debug test expectation.
Changes:
- Update diskless sync to set replica session status to
FAILEDwith the underlying exception message when the streaming checkpoint watchdog (WaitOrDie) faults. - Update the diskless sync status setter to preserve the first error message.
- Update the debug-only cluster test to assert the injected exception message.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| test/Garnet.test.cluster/ReplicationTests/ClusterReplicationDisklessSyncTests.cs | Adjusts expected error string for the exception-injection scenario. |
| libs/cluster/Server/Replication/PrimaryOps/DisklessReplication/ReplicationSyncManager.cs | Propagates WaitOrDie failure reason into per-session status before canceling. |
| libs/cluster/Server/Replication/PrimaryOps/DisklessReplication/ReplicaSyncSession.cs | Preserves the first error message recorded for a session. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
TalZaccai
approved these changes
Feb 18, 2026
TalZaccai
added a commit
that referenced
this pull request
Feb 26, 2026
* fix ClusterDisklessSyncResetSyncManagerCts * set message only when error ocurrs * address comment --------- Co-authored-by: Tal Zaccai <talzacc@microsoft.com>
TalZaccai
added a commit
that referenced
this pull request
Feb 27, 2026
* Update Azure Cosmos DB Garnet Cache docs (#1548) * update registration process and troubleshooting * update phrasing * update email * Update website/docs/azure/faq.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * wait for recovery before issuing get keys (#1553) * Parallel ACL test fixes (#1554) * Parallel ACL tests sometimes run forever, cleaned up to properly use async and also check server responses. * nit * format * timeouts * reduce timeout * address comments * nit * nit * Misc fixes: epoch sharing, IEpochAccessor refactoring, lock improveme… (#1555) * Misc fixes: epoch sharing, IEpochAccessor refactoring, lock improvements, test fixes, and BDN benchmarks * Update libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLog.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add using * Improve SingleWriterMultiReaderLock * Revert "Improve SingleWriterMultiReaderLock" This reverts commit 394e11c. * rename ownedEpoch --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Work around receive latency increasing with larger buffers (#1546) * shrink receive buffer if it grows past maximum configured - but only if buffer was large enough to serve last request in the first place * Update libs/common/Networking/NetworkHandler.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update libs/common/Networking/NetworkHandler.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * don't shrink if we still have pending data greater than the maximum * use correct variable --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Replace spin-wait with semaphore-based backoff for epoch table exhaustion (#1543) * When hundreds of threads compete for epoch table entries, the previous spin-wait loop in ReserveEntry caused 100% CPU utilization due to tight spinning with Thread.Yield(). Changes: - Add SemaphoreSlim-based wait mechanism for threads when epoch table is full - Split ReserveEntry into fast path (TryAcquireEntry) and slow path (ReserveEntryWait) - Fast path: probes startOffset1, startOffset2, then circles table twice - fully inlinable - Slow path: uses try/finally with semaphore wait - marked NoInlining since kernel wait dominates cost anyway - Release() signals one waiting thread via volatile waiterCount check (nearly zero overhead when no waiters) - Double-check pattern in ReserveEntryWait prevents lost wakeups: increment waiterCount, re-check for slots, then wait - SemaphoreSlim uses Monitor.Pulse internally which provides FIFO wake-up order, preventing starvation Performance characteristics: - No contention: unchanged - fast path acquires entry with same probing logic - Table full: threads wait efficiently instead of burning CPU - Release hot path: single volatile read of waiterCount when no waiters * add small comment * clarify comments, increment version * make lightepoch isolate instances properly * nits * nits * Cancel epoch table waiters on dispose for graceful shutdown When the epoch table is full, threads block on a SemaphoreSlim in ReserveEntryWait until a slot is released. If LightEpoch is disposed while threads are waiting, they remain blocked indefinitely, preventing graceful shutdown. Add a CancellationTokenSource that is cancelled during Dispose, causing blocked threads to receive an OperationCanceledException. Dispose then spin-waits for all waiters to finish unwinding before disposing the CancellationTokenSource and SemaphoreSlim. * nit * comments * nit * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * nit * fix dispose to be robust * nit * nit * nit * no need to refresh here * add better epoch logging, fix garnet epoch dispose * nit * fixes * Fix MultiDatabase to correctly dispose devices * undo change to test * share epoch across all aof instances * fix testcase to wait for checkpoint to complete * fix HasKeysInSlots * add debug helper static method to LightEpoch * actually add * reduce logger verbosity * nits * fix * fix CloseLock semantics to ensure dispose happens after write lock is released. * nit * change lock style for clarity * fixes * updates * nit * fix formatting * update test suite to check LightEpoch disposal * updatwe tsavo tests to have tear down checks in one place * ensure epochs are disposed if server throws in constructor * fix tsavo tests to properly dispose epoch * fix test * fixes * nit * fix test * improve comments * update LightEp;och copy in client * nit * clean up struct Entry * use new epoch for garnet client correctly * fix * nit * fix CanDoBulkDeleteTests * share client epoch for failover * fix * update version for release --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Bump qs from 6.14.1 to 6.14.2 in /website (#1562) Bumps [qs](https://github.com/ljharb/qs) from 6.14.1 to 6.14.2. - [Changelog](https://github.com/ljharb/qs/blob/main/CHANGELOG.md) - [Commits](ljharb/qs@v6.14.1...v6.14.2) --- updated-dependencies: - dependency-name: qs dependency-version: 6.14.2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Use shared LightEpoch in parallel ACL/auth tests (#1566) ParallelTests now use a shared LightEpoch for all GarnetClient instances, improving thread safety and resource management. TestUtils.GetGarnetClient accepts an optional epoch parameter, which is passed to the GarnetClient constructor. This reduces contention and potential corruption during parallel authentication and ACL operations. * Fix ClusterDisklessSyncResetSyncManagerCts (#1557) * fix ClusterDisklessSyncResetSyncManagerCts * set message only when error ocurrs * address comment --------- Co-authored-by: Tal Zaccai <talzacc@microsoft.com> * Support hostname resolution in MIGRATE command (#1565) * Initial plan * Add hostname resolution support to MIGRATE command Co-authored-by: vazois <96085550+vazois@users.noreply.github.com> * Add tests for hostname resolution in MIGRATE command Co-authored-by: vazois <96085550+vazois@users.noreply.github.com> * Fix test for invalid hostname and improve test robustness Co-authored-by: vazois <96085550+vazois@users.noreply.github.com> * Address code review feedback: IPv4 preference, specific exception handling, null check Co-authored-by: vazois <96085550+vazois@users.noreply.github.com> * Improve variable naming: resolvedAddress -> effectiveAddress Co-authored-by: vazois <96085550+vazois@users.noreply.github.com> * fix formatting * remove unecessary DEBUG * revert DEBUG flag to its original state * cleanup tests * Start worker search from index 2 to skip local worker and prevent self-migration Co-authored-by: vazois <96085550+vazois@users.noreply.github.com> * Test all resolved IPs against cluster config and revert license changes Co-authored-by: vazois <96085550+vazois@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove unnecessary ArgumentOutOfRangeException catch block Co-authored-by: vazois <96085550+vazois@users.noreply.github.com> * Use Dns.GetHostEntryAsync for hostname resolution Co-authored-by: vazois <96085550+vazois@users.noreply.github.com> * Add ConfigureAwait(false) to Dns.GetHostEntryAsync call Co-authored-by: vazois <96085550+vazois@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: vazois <96085550+vazois@users.noreply.github.com> Co-authored-by: Vasileios Zois <vazois@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * format * Update libs/server/GarnetDatabase.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fixing merge issue * Added XML comment --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Justine Cocchi <jucocchi@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Vasileios Zois <96085550+vazois@users.noreply.github.com> Co-authored-by: Badrish Chandramouli <badrishc@microsoft.com> Co-authored-by: kevin-montrose <kmontrose@microsoft.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Vasileios Zois <vazois@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.