Skip to content

fix(segment): initialize memory components when loading collection from disk#232

Open
BillionClaw wants to merge 1 commit intoalibaba:mainfrom
BillionClaw:clawoss/fix/issue-227-vector-search-after-reopen
Open

fix(segment): initialize memory components when loading collection from disk#232
BillionClaw wants to merge 1 commit intoalibaba:mainfrom
BillionClaw:clawoss/fix/issue-227-vector-search-after-reopen

Conversation

@BillionClaw
Copy link

@BillionClaw BillionClaw commented Mar 17, 2026

Fixes #227

Bug Description

When loading a collection from disk, vector search would return empty results if the WAL had been deleted after flush. This happened because memory vector indexers were not initialized during the Open() operation.

Root Cause

In SegmentImpl::Open():

  1. load_vector_index_blocks() only loads persisted vector index blocks
  2. recover() replays WAL operations, but if data was flushed, WAL is deleted - leaving nothing to recover
  3. init_memory_components() was never called during Open(), leaving memory_vector_indexers_ empty
  4. When searching, get_combined_vector_indexer() would return an indexer with no underlying indexers, causing empty results

Fix

Added a check after recover() to call init_memory_components() if memory components haven't been initialized yet. This ensures vector search works correctly after loading from disk.

Testing

Added regression test Bug227_VectorSearchAfterReopen that:

  1. Creates a collection and inserts documents
  2. Verifies vector search works (returns 5 results)
  3. Closes and reopens the collection
  4. Verifies vector search still works after reopen (this was failing before the fix)

Greptile Summary

This PR fixes a bug where vector search returns empty results after a collection is closed (flushed) and reopened from disk. The root cause was that SegmentImpl::Open() never called init_memory_components() when the WAL had been deleted after a successful flush, leaving memory_vector_indexers_ empty and causing get_combined_vector_indexer() to produce an indexer with no underlying indexers.

Key changes:

  • src/db/index/segment/segment.cc: Adds an 8-line guard after recover() in Open() — if memory_store_ is still null after recovery (indicating no WAL was replayed), init_memory_components() is called to set up empty-but-functional memory components for the new writing block.
  • tests/db/collection_test.cc: Adds a regression test Bug227_VectorSearchAfterReopen that creates a collection, inserts documents, closes it (triggering flush and WAL deletion), reopens it, and asserts vector search still returns results.

Issues found:

  • The regression test's query vector contains only 20 float values, but the dense_fp32 field in CreateNormalSchema() is a 128-dimensional field. This dimension mismatch will cause either a query error (failing the test before the reopen scenario is exercised) or undefined behaviour during distance computation. The query vector must be padded to 128 elements.
  • Inside SegmentImpl::recover(), the status variable is declared but never assigned from any of the four internal_* calls — per-document WAL replay errors are silently discarded. This is a pre-existing issue, but it interacts with the new fix: partially-initialised memory components after a partially-failed WAL replay will not trigger the if (!memory_store_) guard.

Confidence Score: 3/5

  • The production fix in segment.cc is sound, but the regression test has a critical dimension mismatch that likely prevents it from actually validating the fix.
  • The core one-line logic change in segment.cc is correct and well-reasoned. The condition !memory_store_ reliably detects the post-flush/no-WAL case, and init_memory_components() is the right function to call. The score is penalised because the accompanying regression test contains a dimension mismatch (20 vs 128) that either breaks the test before it reaches the reopen assertion or causes undefined behaviour — meaning the test does not actually verify the fix works. Additionally, the pre-existing silent error swallowing in recover() slightly complicates the correctness argument for the guard condition.
  • tests/db/collection_test.cc — the query vector dimension must be corrected to 128 before the test can validate the fix.

Important Files Changed

Filename Overview
src/db/index/segment/segment.cc Adds a post-recover() guard in Open() that calls init_memory_components() when memory_store_ is null (i.e. WAL was absent after a flush). The fix is logically correct and targets a real bug; minor concerns exist around the WAL replay loop's silent error swallowing and the unconditional file deletion inside init_memory_components() when reached via the new code path.
tests/db/collection_test.cc Adds regression test Bug227_VectorSearchAfterReopen that exercises the close-reopen-search cycle. Contains a critical dimension mismatch: the query vector is 20 floats but dense_fp32 is a 128-dimensional field, which will either cause the test to fail before it reaches the reopen step or produce undefined behaviour during distance computation.

Sequence Diagram

sequenceDiagram
    participant App
    participant Collection
    participant SegmentImpl
    participant WAL
    participant DiskBlocks

    Note over App,DiskBlocks: Normal close (flush) path
    App->>Collection: close() / reset()
    Collection->>SegmentImpl: flush()
    SegmentImpl->>DiskBlocks: finish_memory_components() — persist forward+vector blocks
    SegmentImpl->>SegmentImpl: set_writing_forward_block(new empty block)
    SegmentImpl->>WAL: remove() — WAL deleted

    Note over App,DiskBlocks: Reopen path (before fix)
    App->>Collection: Collection::Open(path)
    Collection->>SegmentImpl: Open()
    SegmentImpl->>DiskBlocks: load_vector_index_blocks() — persisted indexers loaded
    SegmentImpl->>WAL: recover() — WAL absent, returns early
    Note over SegmentImpl: memory_store_ = null, memory_vector_indexers_ = empty
    App->>Collection: Query(vector)
    Collection->>SegmentImpl: get_combined_vector_indexer()
    Note over SegmentImpl: memory_vector_indexers_ empty → no indexer for new writing block → empty results

    Note over App,DiskBlocks: Reopen path (after fix)
    App->>Collection: Collection::Open(path)
    Collection->>SegmentImpl: Open()
    SegmentImpl->>DiskBlocks: load_vector_index_blocks() — persisted indexers loaded
    SegmentImpl->>WAL: recover() — WAL absent, returns early
    SegmentImpl->>SegmentImpl: if (!memory_store_) → init_memory_components()
    Note over SegmentImpl: memory_store_ initialised, memory_vector_indexers_ populated (empty, ready)
    App->>Collection: Query(vector)
    Collection->>SegmentImpl: get_combined_vector_indexer()
    Note over SegmentImpl: persisted + memory indexers combined → correct results
Loading

Comments Outside Diff (1)

  1. src/db/index/segment/segment.cc, line 4096-4118 (link)

    P1 Silent error-handling swallows recover() operation failures

    Inside the recover() WAL-replay loop, the local status variable is declared but is never assigned from any of the internal_insert / internal_update / internal_upsert / internal_delete calls — all four branches ignore the return value entirely. The error check at line 4119 (if (!status.ok())) therefore never triggers, silently discarding any per-document failure.

    This is a pre-existing issue, but the new fix depends on recover() either (a) successfully replaying the WAL (setting memory_store_) or (b) finding no WAL (leaving memory_store_ null). If a WAL exists but individual document writes fail silently (e.g. schema mismatch after a schema change), recover() will set memory_store_ via the first successful internal_insert call, the if (!memory_store_) guard will be false, and the new init_memory_components() call will be skipped — meaning memory components may be partially initialised without any diagnostic signal.

    Consider capturing each return value:

    Status status;
    switch (doc->get_operator()) {
      case Operator::INSERT: {
        status = internal_insert(*doc);
        break;
      }
      case Operator::UPDATE: {
        status = internal_update(*doc);
        break;
      }
      case Operator::UPSERT: {
        status = internal_upsert(*doc);
        break;
      }
      case Operator::DELETE: {
        status = internal_delete(*doc);
        break;
      }
      ...
    }

Last reviewed commit: d5b4b31

Greptile also left 2 inline comments on this PR.

…om disk

Fixes alibaba#227

When loading a collection from disk, memory vector indexers were not
initialized if the WAL had been deleted after flush. This caused vector
search to return empty results after reopening a collection.

The fix adds a check after recover() to initialize memory components
if they haven't been initialized yet. This ensures vector search works
correctly after loading from disk.

Changes:
- src/db/index/segment/segment.cc: Added check to call init_memory_components()
  if memory_store_ is null after recovery
- tests/db/collection_test.cc: Added regression test Bug227_VectorSearchAfterReopen
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Comment on lines +4250 to +4253
query.query_vector_ = std::vector<float>{0.1f, 0.2f, 0.3f, 0.4f, 0.5f,
0.6f, 0.7f, 0.8f, 0.9f, 1.0f,
1.1f, 1.2f, 1.3f, 1.4f, 1.5f,
1.6f, 1.7f, 1.8f, 1.9f, 2.0f};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Query vector dimension mismatch

The test constructs a query vector with 20 elements, but the dense_fp32 field created by TestHelper::CreateNormalSchema() has 128 dimensions (see utils.cc line 105: "dense_fp32", DataType::VECTOR_FP32, 128).

Depending on whether dimension validation is enforced in the FLAT indexer's search path:

  • If validation exists, result_before.has_value() will return false on the very first query (before any close/reopen), causing the test to fail for the wrong reason — it never actually exercises the bug fix.
  • If validation is absent, distance computation over 128-dimensional stored vectors against a 20-dimensional query will read 108 floats of garbage memory (undefined behaviour).

The query vector should be 128 floats:

Suggested change
query.query_vector_ = std::vector<float>{0.1f, 0.2f, 0.3f, 0.4f, 0.5f,
0.6f, 0.7f, 0.8f, 0.9f, 1.0f,
1.1f, 1.2f, 1.3f, 1.4f, 1.5f,
1.6f, 1.7f, 1.8f, 1.9f, 2.0f};
query.query_vector_ = std::vector<float>(128, 1.0f);

(or any representative 128-element pattern — the exact values are not important for this regression test).

Comment on lines +460 to +466
// If memory components are not initialized (e.g., after flush when WAL is
// deleted), we need to initialize them for the writing block to support
// vector search on data that hasn't been persisted to index blocks yet.
if (!memory_store_) {
s = init_memory_components();
CHECK_RETURN_STATUS(s);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 init_memory_components() deletes an existing forward-block file during Open()

When called from Open() (the new code path), init_memory_components() unconditionally removes any pre-existing file at mem_path (lines 3977–3982), logging it as a "possible crash residue". This is safe for the post-flush normal reopen scenario (the new writing block ID has never been written to disk), but could silently destroy data if:

  1. The process crashes after set_writing_forward_block but before memory_store_->Open() returns, and the partially-created file already exists.
  2. Some future code path creates the forward-block file before Open() completes.

A comment here clarifying that this deletion is intentional and that the file, if it exists at all, can only be incomplete crash residue (because the block ID is freshly allocated) would make the intent explicit and help future readers avoid introducing a bug by calling init_memory_components() in a context where the file might legitimately exist.

@zhourrr zhourrr self-requested a review March 17, 2026 03:15
@zhourrr zhourrr self-assigned this Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: load from file,using vector to search failed

3 participants