Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions src/db/index/segment/segment.cc
Original file line number Diff line number Diff line change
Expand Up @@ -456,6 +456,14 @@ Status SegmentImpl::Open(const SegmentOptions &options) {
// recover writing block
s = recover();
CHECK_RETURN_STATUS(s);

// If memory components are not initialized (e.g., after flush when WAL is
// deleted), we need to initialize them for the writing block to support
// vector search on data that hasn't been persisted to index blocks yet.
if (!memory_store_) {
s = init_memory_components();
CHECK_RETURN_STATUS(s);
}
Comment on lines +460 to +466
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 init_memory_components() deletes an existing forward-block file during Open()

When called from Open() (the new code path), init_memory_components() unconditionally removes any pre-existing file at mem_path (lines 3977–3982), logging it as a "possible crash residue". This is safe for the post-flush normal reopen scenario (the new writing block ID has never been written to disk), but could silently destroy data if:

  1. The process crashes after set_writing_forward_block but before memory_store_->Open() returns, and the partially-created file already exists.
  2. Some future code path creates the forward-block file before Open() completes.

A comment here clarifying that this deletion is intentional and that the file, if it exists at all, can only be incomplete crash residue (because the block ID is freshly allocated) would make the intent explicit and help future readers avoid introducing a bug by calling init_memory_components() in a context where the file might legitimately exist.

} else {
// Update block_id_allocator_
BlockID max_block_id{0};
Expand Down
58 changes: 58 additions & 0 deletions tests/db/collection_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -4224,6 +4224,64 @@ TEST_F(CollectionTest, CornerCase_CreateAndOpen) {
}
}

TEST_F(CollectionTest, Bug227_VectorSearchAfterReopen) {
// Regression test for GitHub issue #227:
// "load from file, using vector to search failed"
//
// This test verifies that vector search works correctly after closing
// and reopening a collection. The bug was that memory vector indexers
// were not initialized when loading from disk, causing search to return
// empty results.

FileHelper::RemoveDirectory(col_path);

// Create schema with vector fields
auto schema = TestHelper::CreateNormalSchema();
auto options = CollectionOptions{false, true, 64 * 1024 * 1024};

// Create collection and insert documents
auto collection = TestHelper::CreateCollectionWithDoc(
col_path, *schema, options, 0, 10, false);
ASSERT_NE(collection, nullptr);

// Verify documents are searchable before closing
VectorQuery query;
query.field_name_ = "dense_fp32";
query.query_vector_ = std::vector<float>{0.1f, 0.2f, 0.3f, 0.4f, 0.5f,
0.6f, 0.7f, 0.8f, 0.9f, 1.0f,
1.1f, 1.2f, 1.3f, 1.4f, 1.5f,
1.6f, 1.7f, 1.8f, 1.9f, 2.0f};
Comment on lines +4250 to +4253
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Query vector dimension mismatch

The test constructs a query vector with 20 elements, but the dense_fp32 field created by TestHelper::CreateNormalSchema() has 128 dimensions (see utils.cc line 105: "dense_fp32", DataType::VECTOR_FP32, 128).

Depending on whether dimension validation is enforced in the FLAT indexer's search path:

  • If validation exists, result_before.has_value() will return false on the very first query (before any close/reopen), causing the test to fail for the wrong reason — it never actually exercises the bug fix.
  • If validation is absent, distance computation over 128-dimensional stored vectors against a 20-dimensional query will read 108 floats of garbage memory (undefined behaviour).

The query vector should be 128 floats:

Suggested change
query.query_vector_ = std::vector<float>{0.1f, 0.2f, 0.3f, 0.4f, 0.5f,
0.6f, 0.7f, 0.8f, 0.9f, 1.0f,
1.1f, 1.2f, 1.3f, 1.4f, 1.5f,
1.6f, 1.7f, 1.8f, 1.9f, 2.0f};
query.query_vector_ = std::vector<float>(128, 1.0f);

(or any representative 128-element pattern — the exact values are not important for this regression test).

query.topk_ = 5;

auto result_before = collection->Query(query);
ASSERT_TRUE(result_before.has_value())
<< "Query failed before close: " << result_before.error().message();
ASSERT_EQ(result_before.value().size(), 5)
<< "Expected 5 results before close";

// Close collection
collection.reset();

// Reopen collection from disk
auto reopen_result = Collection::Open(col_path, options);
ASSERT_TRUE(reopen_result.has_value())
<< "Failed to reopen collection: " << reopen_result.error().message();
collection = std::move(reopen_result.value());

// Verify documents are still searchable after reopening
// This was failing before the fix - it would return 0 results
auto result_after = collection->Query(query);
ASSERT_TRUE(result_after.has_value())
<< "Query failed after reopen: " << result_after.error().message();
ASSERT_EQ(result_after.value().size(), 5)
<< "Expected 5 results after reopen, got " << result_after.value().size()
<< ". This indicates the vector index was not properly loaded.";

// Verify stats are consistent
auto stats = collection->Stats().value();
ASSERT_EQ(stats.doc_count, 10) << "Document count mismatch after reopen";
}

TEST_F(CollectionTest, CornerCase_CreateIndex) {
auto schema = TestHelper::CreateNormalSchema();
auto options = CollectionOptions{false, true, 64 * 1024 * 1024};
Expand Down