Skip to content

Fix race condition in JournalVirtualDev truncation (#603)#888

Open
OmDoshi13 wants to merge 2 commits into
eBay:stable/v7.xfrom
OmDoshi13:fix/journal-vdev-truncation-race-603
Open

Fix race condition in JournalVirtualDev truncation (#603)#888
OmDoshi13 wants to merge 2 commits into
eBay:stable/v7.xfrom
OmDoshi13:fix/journal-vdev-truncation-race-603

Conversation

@OmDoshi13
Copy link
Copy Markdown

@OmDoshi13 OmDoshi13 commented Jun 2, 2026

Summary

Fixes the concurrency bug reported in #603 where truncate() modified m_journal_chunks and m_data_start_offset concurrently with read and write operations, with no synchronization, causing crashes when a chunk was erased mid-iteration.

  • Add mutable std::shared_mutex m_chunk_mutex to Descriptor; writers (truncate, append_chunk) hold unique_lock, all reader methods hold shared_lock, and process_pwrite_offset (the pwrite path) now also holds shared_lock
  • Change m_data_start_offset from off_t to std::atomic<off_t> (release/acquire) to eliminate UB from unlocked reads in tail_offset() and alloc_next_append_blk()
  • Make update_data_start_offset() private; add public init_data_start_offset() (acquires unique_lock) for use by log_dev.cpp during init/recovery

Files changed

File Change
src/lib/device/journal_vdev.hpp Add mutex, make m_data_start_offset atomic, add init_data_start_offset()
src/lib/device/journal_vdev.cpp Lock all accessors, add init_data_start_offset() impl, use .store()/.load()
src/lib/logstore/log_dev.cpp Replace update_data_start_offset()init_data_start_offset()
src/tests/test_journal_vdev.cpp Same replacement

Test plan

  • Existing test_journal_vdev passes (compile fix at line 119 verified)
  • Run concurrent truncate + read stress test to confirm no crash
  • Run full CTest suite: ctest --output-on-failure inside Docker build

Related

Closes #603

odoshi13 added 2 commits June 2, 2026 16:05
Concurrent truncate() and read/write operations on JournalVirtualDev::Descriptor
had no synchronization, causing crashes when truncate() erased chunks from
m_journal_chunks while readers iterated the same vector.

Changes:
- Add mutable std::shared_mutex m_chunk_mutex to Descriptor; truncate() and
  append_chunk() acquire unique_lock, all reader methods acquire shared_lock,
  and process_pwrite_offset() now acquires shared_lock to protect the pwrite
  path against concurrent truncation
- Change m_data_start_offset from plain off_t to std::atomic<off_t> with
  release/acquire semantics, eliminating undefined behaviour from unlocked
  reads in tail_offset() and alloc_next_append_blk()
- Make update_data_start_offset() private; add public init_data_start_offset()
  (acquires unique_lock) for safe use from log_dev.cpp during init/recovery
- Update log_dev.cpp and test_journal_vdev.cpp to call init_data_start_offset()

Fixes eBay#603
process_pwrite_offset() now holds shared_lock on m_chunk_mutex (added
in the eBay#603 fix). The LOGTRACEMOD call at line 319 was invoking
to_string(), which also acquires shared_lock on the same mutex from
the same thread. std::shared_mutex does not allow recursive same-thread
locking — this would deadlock whenever journalvdev TRACE logging is
enabled. Replace to_string() with to_string_nolock() since the lock is
already held.
@OmDoshi13 OmDoshi13 force-pushed the fix/journal-vdev-truncation-race-603 branch from 77fdc51 to 0c58654 Compare June 2, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

journal vdev truncation

2 participants