fix(segment): initialize memory components when loading collection from disk#232
Conversation
…om disk Fixes alibaba#227 When loading a collection from disk, memory vector indexers were not initialized if the WAL had been deleted after flush. This caused vector search to return empty results after reopening a collection. The fix adds a check after recover() to initialize memory components if they haven't been initialized yet. This ensures vector search works correctly after loading from disk. Changes: - src/db/index/segment/segment.cc: Added check to call init_memory_components() if memory_store_ is null after recovery - tests/db/collection_test.cc: Added regression test Bug227_VectorSearchAfterReopen
|
|
| query.query_vector_ = std::vector<float>{0.1f, 0.2f, 0.3f, 0.4f, 0.5f, | ||
| 0.6f, 0.7f, 0.8f, 0.9f, 1.0f, | ||
| 1.1f, 1.2f, 1.3f, 1.4f, 1.5f, | ||
| 1.6f, 1.7f, 1.8f, 1.9f, 2.0f}; |
There was a problem hiding this comment.
Query vector dimension mismatch
The test constructs a query vector with 20 elements, but the dense_fp32 field created by TestHelper::CreateNormalSchema() has 128 dimensions (see utils.cc line 105: "dense_fp32", DataType::VECTOR_FP32, 128).
Depending on whether dimension validation is enforced in the FLAT indexer's search path:
- If validation exists,
result_before.has_value()will returnfalseon the very first query (before any close/reopen), causing the test to fail for the wrong reason — it never actually exercises the bug fix. - If validation is absent, distance computation over 128-dimensional stored vectors against a 20-dimensional query will read 108 floats of garbage memory (undefined behaviour).
The query vector should be 128 floats:
| query.query_vector_ = std::vector<float>{0.1f, 0.2f, 0.3f, 0.4f, 0.5f, | |
| 0.6f, 0.7f, 0.8f, 0.9f, 1.0f, | |
| 1.1f, 1.2f, 1.3f, 1.4f, 1.5f, | |
| 1.6f, 1.7f, 1.8f, 1.9f, 2.0f}; | |
| query.query_vector_ = std::vector<float>(128, 1.0f); |
(or any representative 128-element pattern — the exact values are not important for this regression test).
| // If memory components are not initialized (e.g., after flush when WAL is | ||
| // deleted), we need to initialize them for the writing block to support | ||
| // vector search on data that hasn't been persisted to index blocks yet. | ||
| if (!memory_store_) { | ||
| s = init_memory_components(); | ||
| CHECK_RETURN_STATUS(s); | ||
| } |
There was a problem hiding this comment.
init_memory_components() deletes an existing forward-block file during Open()
When called from Open() (the new code path), init_memory_components() unconditionally removes any pre-existing file at mem_path (lines 3977–3982), logging it as a "possible crash residue". This is safe for the post-flush normal reopen scenario (the new writing block ID has never been written to disk), but could silently destroy data if:
- The process crashes after
set_writing_forward_blockbut beforememory_store_->Open()returns, and the partially-created file already exists. - Some future code path creates the forward-block file before
Open()completes.
A comment here clarifying that this deletion is intentional and that the file, if it exists at all, can only be incomplete crash residue (because the block ID is freshly allocated) would make the intent explicit and help future readers avoid introducing a bug by calling init_memory_components() in a context where the file might legitimately exist.
Fixes #227
Bug Description
When loading a collection from disk, vector search would return empty results if the WAL had been deleted after flush. This happened because memory vector indexers were not initialized during the Open() operation.
Root Cause
In SegmentImpl::Open():
Fix
Added a check after recover() to call init_memory_components() if memory components haven't been initialized yet. This ensures vector search works correctly after loading from disk.
Testing
Added regression test Bug227_VectorSearchAfterReopen that:
Greptile Summary
This PR fixes a bug where vector search returns empty results after a collection is closed (flushed) and reopened from disk. The root cause was that
SegmentImpl::Open()never calledinit_memory_components()when the WAL had been deleted after a successful flush, leavingmemory_vector_indexers_empty and causingget_combined_vector_indexer()to produce an indexer with no underlying indexers.Key changes:
src/db/index/segment/segment.cc: Adds an 8-line guard afterrecover()inOpen()— ifmemory_store_is stillnullafter recovery (indicating no WAL was replayed),init_memory_components()is called to set up empty-but-functional memory components for the new writing block.tests/db/collection_test.cc: Adds a regression testBug227_VectorSearchAfterReopenthat creates a collection, inserts documents, closes it (triggering flush and WAL deletion), reopens it, and asserts vector search still returns results.Issues found:
dense_fp32field inCreateNormalSchema()is a 128-dimensional field. This dimension mismatch will cause either a query error (failing the test before the reopen scenario is exercised) or undefined behaviour during distance computation. The query vector must be padded to 128 elements.SegmentImpl::recover(), thestatusvariable is declared but never assigned from any of the fourinternal_*calls — per-document WAL replay errors are silently discarded. This is a pre-existing issue, but it interacts with the new fix: partially-initialised memory components after a partially-failed WAL replay will not trigger theif (!memory_store_)guard.Confidence Score: 3/5
segment.ccis correct and well-reasoned. The condition!memory_store_reliably detects the post-flush/no-WAL case, andinit_memory_components()is the right function to call. The score is penalised because the accompanying regression test contains a dimension mismatch (20 vs 128) that either breaks the test before it reaches the reopen assertion or causes undefined behaviour — meaning the test does not actually verify the fix works. Additionally, the pre-existing silent error swallowing inrecover()slightly complicates the correctness argument for the guard condition.tests/db/collection_test.cc— the query vector dimension must be corrected to 128 before the test can validate the fix.Important Files Changed
recover()guard inOpen()that callsinit_memory_components()whenmemory_store_is null (i.e. WAL was absent after a flush). The fix is logically correct and targets a real bug; minor concerns exist around the WAL replay loop's silent error swallowing and the unconditional file deletion insideinit_memory_components()when reached via the new code path.Bug227_VectorSearchAfterReopenthat exercises the close-reopen-search cycle. Contains a critical dimension mismatch: the query vector is 20 floats butdense_fp32is a 128-dimensional field, which will either cause the test to fail before it reaches the reopen step or produce undefined behaviour during distance computation.Sequence Diagram
sequenceDiagram participant App participant Collection participant SegmentImpl participant WAL participant DiskBlocks Note over App,DiskBlocks: Normal close (flush) path App->>Collection: close() / reset() Collection->>SegmentImpl: flush() SegmentImpl->>DiskBlocks: finish_memory_components() — persist forward+vector blocks SegmentImpl->>SegmentImpl: set_writing_forward_block(new empty block) SegmentImpl->>WAL: remove() — WAL deleted Note over App,DiskBlocks: Reopen path (before fix) App->>Collection: Collection::Open(path) Collection->>SegmentImpl: Open() SegmentImpl->>DiskBlocks: load_vector_index_blocks() — persisted indexers loaded SegmentImpl->>WAL: recover() — WAL absent, returns early Note over SegmentImpl: memory_store_ = null, memory_vector_indexers_ = empty App->>Collection: Query(vector) Collection->>SegmentImpl: get_combined_vector_indexer() Note over SegmentImpl: memory_vector_indexers_ empty → no indexer for new writing block → empty results Note over App,DiskBlocks: Reopen path (after fix) App->>Collection: Collection::Open(path) Collection->>SegmentImpl: Open() SegmentImpl->>DiskBlocks: load_vector_index_blocks() — persisted indexers loaded SegmentImpl->>WAL: recover() — WAL absent, returns early SegmentImpl->>SegmentImpl: if (!memory_store_) → init_memory_components() Note over SegmentImpl: memory_store_ initialised, memory_vector_indexers_ populated (empty, ready) App->>Collection: Query(vector) Collection->>SegmentImpl: get_combined_vector_indexer() Note over SegmentImpl: persisted + memory indexers combined → correct resultsComments Outside Diff (1)
src/db/index/segment/segment.cc, line 4096-4118 (link)recover()operation failuresInside the
recover()WAL-replay loop, the localstatusvariable is declared but is never assigned from any of theinternal_insert/internal_update/internal_upsert/internal_deletecalls — all four branches ignore the return value entirely. The error check at line 4119 (if (!status.ok())) therefore never triggers, silently discarding any per-document failure.This is a pre-existing issue, but the new fix depends on
recover()either (a) successfully replaying the WAL (settingmemory_store_) or (b) finding no WAL (leavingmemory_store_null). If a WAL exists but individual document writes fail silently (e.g. schema mismatch after a schema change),recover()will setmemory_store_via the first successfulinternal_insertcall, theif (!memory_store_)guard will be false, and the newinit_memory_components()call will be skipped — meaning memory components may be partially initialised without any diagnostic signal.Consider capturing each return value:
Last reviewed commit: d5b4b31