Skip to content

feat(clp-s::filter): Add abstraction layer for filter building, serialization, and deserialization.#2101

Open
ShangDanLuXian wants to merge 7 commits intoy-scope:mainfrom
ShangDanLuXian:filter_reader_writer
Open

feat(clp-s::filter): Add abstraction layer for filter building, serialization, and deserialization.#2101
ShangDanLuXian wants to merge 7 commits intoy-scope:mainfrom
ShangDanLuXian:filter_reader_writer

Conversation

@ShangDanLuXian
Copy link
Copy Markdown
Contributor

@ShangDanLuXian ShangDanLuXian commented Mar 17, 2026

Description

This PR adds a filter serialization layer for clp_s::filter.

It introduces:

  • FilterBuilder for constructing a filter from input values and serializing it
  • FilterReader for reading a serialized filter and performing membership checks
  • shared filter metadata/helpers in FilterOptions, including:
    • filter type parsing
    • filter normalization parsing
    • shared filter file magic
    • shared normalization logic

The current implementation is still backed by BloomFilter. Making this layer fully generic would
likely require a parent abstraction representing filter implementations, but since we currently only
have one filter type, this PR intentionally skips that refactor to keep the scope small.

This PR also:

  • adds filter-specific error codes for unsupported filter type / normalization and corrupt filter
    file metadata
  • wires the new filter library files into the build
  • adds the required clp::string_utils dependency for shared normalization logic

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Since this change mainly adds file I/O for the filter serialization layer, I am leaving the main
validation to integration tests.

Summary by CodeRabbit

  • New Features

    • Full filter support: create, populate, serialize and read Bloom filters; membership and wildcard-aware query handling; string normalization (None, Lowercase) and parsing utilities.
  • Bug Fixes

    • Improved input/read validation and clearer, specific error responses for unsupported filter types/normalizations and read failures.
  • Chores

    • Build updated to include new filter components and string utilities.

@ShangDanLuXian ShangDanLuXian requested review from a team and gibber9809 as code owners March 17, 2026 16:25
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a filter subsystem: new FilterOptions (types & normalization), FilterBuilder and FilterReader for Bloom filters; extends BloomFilter I/O with validation and renamed methods; adds two new error codes and updates CMake to include new filter sources and link clp::string_utils.

Changes

Cohort / File(s) Summary
CMake Configuration
components/core/cmake/Options/options.cmake, components/core/src/clp_s/filter/CMakeLists.txt
Added CLP_BUILD_CLP_STRING_UTILS to dependency validation; added new filter source/header files to filter CMake and linked clp_s_filter PRIVATE to clp::string_utils.
Error Handling
components/core/src/clp_s/filter/ErrorCode.hpp, components/core/src/clp_s/filter/ErrorCode.cpp
Added enum values UnsupportedFilterType and UnsupportedFilterNormalization; updated ReadFailure message and extended ErrorCategory::message with explicit messages for the new enum values.
Filter Options
components/core/src/clp_s/filter/FilterOptions.hpp, components/core/src/clp_s/filter/FilterOptions.cpp
Introduced FilterType and FilterNormalization enums and parsing/normalization helpers (try_parse_filter_type, normalize_string), including internal Bloom type string and use of clp::string_utils::to_lower.
FilterBuilder
components/core/src/clp_s/filter/FilterBuilder.hpp, components/core/src/clp_s/filter/FilterBuilder.cpp
New FilterBuilder class with create(...) -> Result<FilterBuilder>, add(std::string_view), write(WriterInterface&) const, accessors and constructor; validates normalization and selects/constructs Bloom filter.
FilterReader
components/core/src/clp_s/filter/FilterReader.hpp, components/core/src/clp_s/filter/FilterReader.cpp
New FilterReader class with try_read(ReaderInterface&) -> Result<FilterReader> that reads and validates numeric type/normalization IDs, propagates BloomFilter deserialization, and exposes possibly_contains and possibly_contains_query_string query methods.
BloomFilter API
components/core/src/clp_s/filter/BloomFilter.hpp, components/core/src/clp_s/filter/BloomFilter.cpp
Renamed I/O methods: try_read_from_filetry_read, write_to_filewrite; try_read now performs hash-algorithm parsing/validation, validates num_hash_functions and bit array sizes, and write serializes extended state and bit array contents.

Sequence Diagram(s)

sequenceDiagram
  participant Dev as Developer
  participant Builder as FilterBuilder
  participant Bloom as BloomFilter
  participant Writer as WriterInterface
  participant Reader as FilterReader
  participant ReaderIF as ReaderInterface

  Dev->>Builder: create(type, normalization, expected, fpr)
  Builder->>Bloom: BloomFilter::create(...)
  Bloom-->>Builder: bloom instance
  Dev->>Builder: add(value)
  Builder->>Bloom: insert(normalize(value))
  Dev->>Builder: write(Writer)
  Builder->>Writer: write(type_id, normalization_id, bloom data)

  Reader->>ReaderIF: try_read(ReaderIF)
  ReaderIF->>Reader: read(type_id, normalization_id)
  Reader->>ReaderIF: BloomFilter::try_read(ReaderIF)
  ReaderIF-->>Reader: bloom instance
  Dev->>Reader: possibly_contains_query_string(query)
  Reader->>Bloom: possibly_contains(normalize(query))
  Bloom-->>Reader: bool
  Reader-->>Dev: Result / bool
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding an abstraction layer for filter building, serialization, and deserialization.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/core/src/clp_s/filter/FilterOptions.cpp`:
- Around line 44-46: The lowercase normalization currently uses
clp::string_utils::to_lower which relies on the process locale and causes
on-disk incompatibility; update the normalization to be locale-independent by
making clp::string_utils::to_lower use the C/POSIX ("classic") locale or an
explicit Unicode-aware normalizer so results are stable across environments
(i.e., pin to std::locale::classic or use a stable UTF-8
normalization/NFKC+lowercase impl), and ensure both FilterBuilder::write_to_file
and FilterReader::possibly_contains use that same deterministic normalization
path.

In `@components/core/src/clp_s/filter/FilterReader.hpp`:
- Around line 22-30: Merge the duplicate `@return` tags in the FilterReader
docblock into a single `@return` section that lists both the initialization error
cases and the forwarded payload-read failure case; update the docblock in
FilterReader.hpp (the function/constructor that returns Result<FilterReader,
ErrorCodeEnum>) so the single `@return` contains the bullet list for metadata
errors (CorruptFilterFile, UnsupportedFilterType,
UnsupportedFilterNormalization, ReadFailure) and a final line stating that
payload read failures forward the underlying filter implementation's return
values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7d2c7580-4291-49b2-976c-897530bda378

📥 Commits

Reviewing files that changed from the base of the PR and between 3c51803 and 0574294.

📒 Files selected for processing (10)
  • components/core/cmake/Options/options.cmake
  • components/core/src/clp_s/filter/CMakeLists.txt
  • components/core/src/clp_s/filter/ErrorCode.cpp
  • components/core/src/clp_s/filter/ErrorCode.hpp
  • components/core/src/clp_s/filter/FilterBuilder.cpp
  • components/core/src/clp_s/filter/FilterBuilder.hpp
  • components/core/src/clp_s/filter/FilterOptions.cpp
  • components/core/src/clp_s/filter/FilterOptions.hpp
  • components/core/src/clp_s/filter/FilterReader.cpp
  • components/core/src/clp_s/filter/FilterReader.hpp

Comment on lines +44 to +46
case FilterNormalization::Lowercase:
clp::string_utils::to_lower(normalized);
return normalized;
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== to_lower implementation used by filter normalization ==="
sed -n '135,160p' components/core/src/clp-rust-utils/string_utils/string_utils.hpp

echo
echo "=== filter normalization call chain ==="
rg -n -C2 '\bnormalize_string\s*\(|\bto_lower\s*\(' \
  components/core/src/clp_s/filter/FilterOptions.cpp \
  components/core/src/clp_s/filter/FilterBuilder.cpp \
  components/core/src/clp_s/filter/FilterReader.cpp

echo
echo "=== explicit locale pinning in core sources (if any) ==="
rg -n -C2 '\bsetlocale\s*\(|std::locale::global\s*\(' components/core/src

Repository: y-scope/clp

Length of output: 240


🏁 Script executed:

# Find string_utils files
find components -name "*string_utils*" -type f | head -20

Repository: y-scope/clp

Length of output: 206


🏁 Script executed:

# Check the FilterOptions.cpp file around line 45
head -50 components/core/src/clp_s/filter/FilterOptions.cpp | tail -20

Repository: y-scope/clp

Length of output: 627


🏁 Script executed:

# Search for to_lower usage in the filter directory
rg -n "to_lower" components/core/src/clp_s/filter/

Repository: y-scope/clp

Length of output: 161


🏁 Script executed:

# Read the string_utils.hpp to find to_lower implementation
cat -n components/core/src/clp/string_utils/string_utils.hpp

Repository: y-scope/clp

Length of output: 6711


🏁 Script executed:

# Search for locale-related code in the codebase
rg -n "setlocale|std::locale::global" components/core/src

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Read the implementation of to_lower in the .cpp file
cat -n components/core/src/clp/string_utils/string_utils.cpp

Repository: y-scope/clp

Length of output: 13567


🏁 Script executed:

# Search for FilterOptions usage and persistence
rg -n "FilterOptions|FilterNormalization" components/core/src/clp_s/filter/ --type cpp --type hpp | head -30

Repository: y-scope/clp

Length of output: 82


🏁 Script executed:

# Look for any disk/persistence operations with filters
rg -n "write|persist|serialize|save" components/core/src/clp_s/filter/ --type cpp --type hpp | grep -i filter

Repository: y-scope/clp

Length of output: 82


🏁 Script executed:

# Search for FilterOptions usage
rg -n "FilterOptions|FilterNormalization" components/core/src/clp_s/filter/ | head -40

Repository: y-scope/clp

Length of output: 4048


🏁 Script executed:

# Check FilterBuilder and FilterReader for how filters are persisted/loaded
rg -n "normalize_string|to_lower" components/core/src/clp_s/filter/FilterBuilder.cpp components/core/src/clp_s/filter/FilterReader.cpp

Repository: y-scope/clp

Length of output: 311


🏁 Script executed:

# Look at the FilterBuilder to understand the flow
head -100 components/core/src/clp_s/filter/FilterBuilder.cpp

Repository: y-scope/clp

Length of output: 1934


🏁 Script executed:

# Verify FilterReader applies normalization at query time
cat -n components/core/src/clp_s/filter/FilterReader.cpp | sed -n '50,70p'

Repository: y-scope/clp

Length of output: 838


🏁 Script executed:

# Check if there are any tests showing the locale-dependent behavior
rg -n "locale|tolower|Lowercase" components/core/tests/ | grep -i "filter\|normalize"

Repository: y-scope/clp

Length of output: 37


Locale-dependent filter normalization affects on-disk compatibility.

The to_lower function at line 45 relies on std::tolower with the process locale. When filters built and queried under different locales normalize non-ASCII input differently, they produce membership mismatches in the Bloom filter.

The normalized values are persisted to disk as part of the Bloom filter (FilterBuilder::write_to_file), and the same normalization is reapplied during queries (FilterReader::possibly_contains). Without an explicitly pinned locale, the behaviour becomes environment-dependent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/filter/FilterOptions.cpp` around lines 44 - 46, The
lowercase normalization currently uses clp::string_utils::to_lower which relies
on the process locale and causes on-disk incompatibility; update the
normalization to be locale-independent by making clp::string_utils::to_lower use
the C/POSIX ("classic") locale or an explicit Unicode-aware normalizer so
results are stable across environments (i.e., pin to std::locale::classic or use
a stable UTF-8 normalization/NFKC+lowercase impl), and ensure both
FilterBuilder::write_to_file and FilterReader::possibly_contains use that same
deterministic normalization path.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect -- the behaviour of clp::string_utils::to_lower doesn't depend on locale. Generally the approach for handling lower-casing for different locales will probably be to treat them as different kinds of normalization.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibber9809, thanks for the correction — I apologize for the inaccurate review comment. I'll note that clp::string_utils::to_lower is locale-independent for future reviews.


✏️ Learnings added
Learnt from: gibber9809
Repo: y-scope/clp PR: 2101
File: components/core/src/clp_s/filter/FilterOptions.cpp:27-29
Timestamp: 2026-03-20T18:37:35.141Z
Learning: In y-scope/clp, `clp::string_utils::to_lower` (defined in `components/core/src/clp/string_utils/string_utils.hpp`) is locale-independent. Do not raise locale-sensitivity concerns about its usage. For handling lower-casing across different locales in the clp_s::filter module, the approach is to treat different locale lowercasings as distinct normalization types.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: gibber9809
Repo: y-scope/clp PR: 2101
File: components/core/src/clp_s/filter/FilterReader.cpp:83-85
Timestamp: 2026-03-20T18:33:35.288Z
Learning: In y-scope/clp’s clp_s filtering code (e.g., clp_s::filter), prefer an allocation-free normalization design using a dedicated normalizer class. The normalizer should own an internal buffer and provide `normalize(std::string_view) -> std::string_view`. When normalization/transformation is required, return a `std::string_view` pointing into the normalizer’s internal buffer; when the input is already in the normalized form (e.g., already lowercase), return the original `std::string_view` directly.

Do not over-optimize this path for `FilterReader::possibly_contains`; it’s acceptable to skip the short-term allocation-free optimization there because bloom filter lookups are infrequent and normalization is expected to be the common case.

Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 873
File: components/core/src/clp/ffi/ir_stream/search/QueryHandlerImpl.hpp:320-324
Timestamp: 2025-05-05T01:12:18.561Z
Learning: In the CLP codebase, the `m_case_sensitive_search` flag is used only for actual string value comparisons during query evaluation, not for schema key name matching. Schema keys are always compared case-sensitively regardless of this flag's setting.

Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/MockVariableDictionary.hpp:69-79
Timestamp: 2026-01-14T01:19:12.982Z
Learning: In the CLP codebase, case-insensitive tests are not expected for the mock dictionary classes (MockVariableDictionary and MockLogTypeDictionary) used in unit tests. The `ignore_case` parameter in methods like `get_entries_matching_wildcard_string` and `get_entry_matching_value` can safely ignore the parameter and always use case-sensitive matching.

Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Learnt from: ShangDanLuXian
Repo: y-scope/clp PR: 2010
File: components/core/src/clp_s/filter/FilterBuilder.cpp:24-33
Timestamp: 2026-02-20T15:19:43.528Z
Learning: In the components/core/src/clp_s module, FileWriter usage across multipleCpp files (e.g., JsonConstructor.cpp, SchemaMap.cpp, FilterBuilder.cpp) intentionally omits try-catch around write operations, and treats potential write failures as acceptable data loss. Do not add per-write try-catch in these files; instead, rely on the caller/integration layer to handle exceptions. Apply this pattern to similar FileWriter usage within this module, ensuring reviewers understand that error handling is deferred and that this behavior is consistent across related files.

Comment thread components/core/src/clp_s/filter/FilterReader.hpp
return std::nullopt;
}

auto try_parse_filter_type(std::string_view filter_type_string) -> std::optional<FilterType> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one will be used during integration to reject filter writing when processing command line arguments.

Probably I also need a method to return all the available filter type strings so that can show in help.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe leave it for the integration PR then since it isn't directly related, but not a huge deal either way.

If you do keep it though "bloom" needs to be a constexpr std::string_view per our style guide.

Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial pass and looks good overall, just some details to discuss. I think I generally agree that it's fine to not make the interface completely generic to different kinds of filters at the moment -- mostly looks like it will just require a bit of refactoring for the create methods down the line.

* Writes a serialized filter representation to a writer.
* @param writer
*/
auto write_to_file(clp::WriterInterface& writer) const -> void;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're writing to a WriterInterface maybe we should just call this method write instead of write_to_file?

Comment thread components/core/src/clp_s/filter/FilterBuilder.hpp Outdated
Comment on lines +51 to +56
/**
* @return The configured filter type.
*/
[[nodiscard]] auto get_type() const -> FilterType { return m_type; }

[[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't necessarily need docstrings for simple get methods that are described by their name, but if you put it on one here you should put it on both. Also fine to just get rid of the docstring on get_type.

Comment on lines +23 to +36
if (FilterNormalization::None != normalization
&& FilterNormalization::Lowercase != normalization)
{
return ErrorCode{ErrorCodeEnum::UnsupportedFilterNormalization};
}

switch (type) {
case FilterType::Bloom: {
auto bloom_filter{YSTDLIB_ERROR_HANDLING_TRYX(
BloomFilter::create(expected_num_elements, false_positive_rate)
)};
return FilterBuilder{type, normalization, std::move(bloom_filter)};
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reads a bit strangely to use if for conditions on one enum and switch for the conditions on the other enum. For consistency, could we use switch for both?

}

auto FilterBuilder::write_to_file(clp::WriterInterface& writer) const -> void {
writer.write(cFilterFileMagic.data(), cFilterFileMagic.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that this needs file magic?

Within an archive we generally don't use file magic, and instead rely on some archive-level metadata about the different sections of the archive. I can see the argument for when this is external to an archive, but I'm not sure if this file format is actually independent as it is right now anyway (in that it doesn't indicate which archive it's a filter for), so to use it externally to an archive we have to have some external way to tell that this is a filter for a given archive anyway. Potentially open to keeping it if you can think of a use-case though.

When we get to doing packed filters though I can definitely see file magic making sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is intentionally no archive-to-filter binding encoded in this library itself, since this filter serialization layer may be used in different ways in the future.

My reason for keeping the file magic is narrower: it gives us a lightweight validation that the bytes being packed are actually a serialized filter file, which helps guard against malformed files or unrelated junk in the staging area.

That said, I agree this is arguable. In my prototype implementation, the packing flow identified staged filters by filename convention:
<filter_output_dir>/<archive_id>.var.dict.filter

So in that flow, the archive association comes from the filename/path convention rather than from the filter payload itself. That already filters out most unrelated files in the staging directory (if there is any), unless something else happens to match the same naming pattern. Do you think we need to protect that case?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your reason for keeping it as lightweight validation when it's staged makes sense, so lets keep it for now. We can always revisit whether the file magic part is required for filters in archives when we get to that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, revisiting this one more time since the new plan doesn't involve staging anymore -- do you think we still need the file magic here?

I'm also happy to leave this for now and decide whether we should change it when we get to writing the indexing flow to avoid holding up this PR any longer.

return std::nullopt;
}

auto try_parse_filter_type(std::string_view filter_type_string) -> std::optional<FilterType> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe leave it for the integration PR then since it isn't directly related, but not a huge deal either way.

If you do keep it though "bloom" needs to be a constexpr std::string_view per our style guide.

Comment on lines +36 to +53
[[nodiscard]] auto try_parse_filter_type_id(uint8_t filter_type_id) -> std::optional<FilterType>;

/**
* Parses a filter type string.
* @param filter_type_string Case-sensitive filter type string.
* @return Parsed filter type for known strings, or std::nullopt otherwise.
*/
[[nodiscard]] auto try_parse_filter_type(std::string_view filter_type_string)
-> std::optional<FilterType>;

/**
* Parses a serialized filter normalization ID.
* @param filter_normalization_id Encoded filter normalization ID from serialized payload.
* @return Parsed filter normalization for known IDs, or std::nullopt otherwise.
*/
[[nodiscard]] auto try_parse_filter_normalization_id(uint8_t filter_normalization_id)
-> std::optional<FilterNormalization>;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the command line related method, are there any users for these parse methods besides FilterReader? I'm wondering whether it makes sense for these to just be private methods in that class.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I grouped these together intentionally since they are closely related, but I agree this is probably exposing a bit too much. I think it makes sense to keep the string-based one shared, and move the serialized-ID ones into FilterReader.

* @param value
* @return true if the filter may contain `value`; false if it definitely does not.
*/
[[nodiscard]] auto possibly_contains(std::string_view value) const -> bool;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it makes sense to have another method here that accepts a wildcard search string and returns whether it is possibly contained in the data based on the filter?

For the current bloom filter based approach it'll just be always true unless the search string contains no wildcards, in which case we actually do the lookup, but even that could be useful. Definitely more useful for other kinds of filters though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes sense. I’ll add a wildcard-check function with a switch on filter type; for Bloom, it will always return true, as this type is for exact match only.

Comment thread components/core/src/clp_s/filter/FilterReader.hpp Outdated
Comment thread components/core/src/clp_s/filter/FilterReader.hpp Outdated
Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
components/core/src/clp_s/filter/FilterBuilder.hpp (2)

51-56: 🧹 Nitpick | 🔵 Trivial

Add missing docstring for get_normalization().

The get_type() method has a docstring but get_normalization() does not. For consistency, either add a docstring to get_normalization() or remove the one from get_type().

📝 Proposed fix to add docstring
     [[nodiscard]] auto get_type() const -> FilterType { return m_type; }
 
+    /**
+     * `@return` The configured normalization strategy.
+     */
     [[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/filter/FilterBuilder.hpp` around lines 51 - 56, Add
a missing docstring for the getter get_normalization() to match get_type():
locate the method [[nodiscard]] auto get_normalization() const ->
FilterNormalization and add a brief Doxygen comment (e.g., "@return The
configured filter normalization.") so both getters have consistent
documentation; alternatively remove the docstring from get_type() if the project
prefers no getter comments, but keep style consistent across get_type() and
get_normalization().

45-49: 🧹 Nitpick | 🔵 Trivial

Consider renaming write_to_file to write.

Since this method writes to a WriterInterface rather than directly to a file, the name write would be more accurate and consistent with the abstraction level.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/filter/FilterBuilder.hpp` around lines 45 - 49, The
method name write_to_file is misleading because it writes to a WriterInterface;
rename the method to write to match the abstraction: update the declaration in
FilterBuilder (change write_to_file to write), update any corresponding
definition/implementation (FilterBuilder::write) and all call sites that invoke
write_to_file, and adjust the Doxygen comment to reflect the new name; ensure
the signature remains auto write(clp::WriterInterface& writer) const -> void and
run a build to find remaining references.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/core/src/clp_s/filter/FilterReader.hpp`:
- Line 48: Add a brief docblock above the get_normalization() accessor to match
the style used for get_type(); locate the method named get_normalization() in
FilterReader.hpp and insert a short comment describing that it returns the
FilterNormalization for this filter instance (mentioning it is a const,
nodiscard accessor), following the same wording/format used by the existing
get_type() docblock for consistency.

---

Duplicate comments:
In `@components/core/src/clp_s/filter/FilterBuilder.hpp`:
- Around line 51-56: Add a missing docstring for the getter get_normalization()
to match get_type(): locate the method [[nodiscard]] auto get_normalization()
const -> FilterNormalization and add a brief Doxygen comment (e.g., "@return The
configured filter normalization.") so both getters have consistent
documentation; alternatively remove the docstring from get_type() if the project
prefers no getter comments, but keep style consistent across get_type() and
get_normalization().
- Around line 45-49: The method name write_to_file is misleading because it
writes to a WriterInterface; rename the method to write to match the
abstraction: update the declaration in FilterBuilder (change write_to_file to
write), update any corresponding definition/implementation
(FilterBuilder::write) and all call sites that invoke write_to_file, and adjust
the Doxygen comment to reflect the new name; ensure the signature remains auto
write(clp::WriterInterface& writer) const -> void and run a build to find
remaining references.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b23bbce7-1194-446b-85e0-959ba27e0aaa

📥 Commits

Reviewing files that changed from the base of the PR and between 0574294 and aeb07b4.

📒 Files selected for processing (2)
  • components/core/src/clp_s/filter/FilterBuilder.hpp
  • components/core/src/clp_s/filter/FilterReader.hpp

*/
[[nodiscard]] auto get_type() const -> FilterType { return m_type; }

[[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add a docblock for get_normalization() for consistency.

The get_type() accessor has a docblock, but get_normalization() does not. Consider adding a brief docblock for consistency.

📝 Suggested docblock
+    /**
+     * `@return` The parsed filter normalization strategy.
+     */
     [[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; }
/**
* `@return` The parsed filter normalization strategy.
*/
[[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/filter/FilterReader.hpp` at line 48, Add a brief
docblock above the get_normalization() accessor to match the style used for
get_type(); locate the method named get_normalization() in FilterReader.hpp and
insert a short comment describing that it returns the FilterNormalization for
this filter instance (mentioning it is a const, nodiscard accessor), following
the same wording/format used by the existing get_type() docblock for
consistency.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/core/src/clp_s/filter/BloomFilter.cpp (1)

75-97: ⚠️ Potential issue | 🟠 Major

Reject oversized payloads before allocating the bit array.

bit_array_bytes is trusted all the way to Line 96. A corrupt filter can declare a multi-GB array that is arithmetically consistent with bit_array_size, forcing a huge allocation or bad_alloc/OOM before we ever return CorruptFilterPayload. Please enforce a maximum serialized Bloom-filter size before constructing bit_array.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/filter/BloomFilter.cpp` around lines 75 - 97,
Reject excessively large serialized bit arrays before allocating memory: after
reading bit_array_bytes and computing expected_bit_array_bytes (using
min_bytes_containing_bits) but before constructing
ystdlib::containers::Array<uint8_t> bit_array, add a check that
expected_bit_array_bytes (or bit_array_bytes) does not exceed a defensive
maximum (e.g., MAX_SERIALIZED_BLOOM_BYTES or a reasonable cap based on available
memory/limits); if it does, return
ErrorCode{ErrorCodeEnum::CorruptFilterPayload}. Update the logic around
reader.try_read_numeric_value, bit_array_size, bit_array_bytes, and the
allocation site for bit_array to perform the size validation first to avoid huge
allocations or OOM.
♻️ Duplicate comments (1)
components/core/src/clp_s/filter/FilterOptions.cpp (1)

22-29: ⚠️ Potential issue | 🟠 Major

Lowercase still is not safe for persisted filter keys.

normalize_string() still delegates to clp::string_utils::to_lower, which lowercases byte-by-byte. For UTF-8 input that is not a real case fold, and because the result is persisted by FilterBuilder::add() and re-applied by FilterReader::possibly_contains(), the on-disk key can vary by locale/process and miss non-ASCII case matches. Please make this path deterministic and explicit—either ASCII-only with validation, or a fixed Unicode-aware fold—before using it in the serialized format.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/filter/FilterOptions.cpp` around lines 22 - 29,
normalize_string currently calls clp::string_utils::to_lower for
FilterNormalization::Lowercase which does byte-wise lowercasing and is not
deterministic for UTF-8; change normalize_string(FilterNormalization::Lowercase)
to one of two explicit, deterministic behaviors: either (A) validate that the
input is pure ASCII and perform ASCII-only lowercasing (reject or error for
non-ASCII so persisted keys remain deterministic), or (B) perform a proper
Unicode case fold using a deterministic library/function (e.g., a
unicode_case_fold helper based on ICU or a portable Unicode case-fold table) and
replace the clp::string_utils::to_lower call with that; ensure the chosen
behavior is used consistently when writing keys in FilterBuilder::add and
reading keys in FilterReader::possibly_contains, and add/update tests to cover
multi-byte UTF-8 inputs and the rejection or fold behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/core/src/clp_s/filter/FilterReader.cpp`:
- Around line 83-85: The possibly_contains implementation currently always calls
normalize_string causing an allocation; add a fast path in
FilterReader::possibly_contains that checks m_normalization (the "None" case)
and directly calls m_bloom_filter.possibly_contains using the original
std::string_view (or its hash/equivalent used by the bloom filter) to avoid
materializing a std::string, otherwise fall back to
normalize_string(normalized_value, m_normalization) and pass that to
m_bloom_filter.possibly_contains.

---

Outside diff comments:
In `@components/core/src/clp_s/filter/BloomFilter.cpp`:
- Around line 75-97: Reject excessively large serialized bit arrays before
allocating memory: after reading bit_array_bytes and computing
expected_bit_array_bytes (using min_bytes_containing_bits) but before
constructing ystdlib::containers::Array<uint8_t> bit_array, add a check that
expected_bit_array_bytes (or bit_array_bytes) does not exceed a defensive
maximum (e.g., MAX_SERIALIZED_BLOOM_BYTES or a reasonable cap based on available
memory/limits); if it does, return
ErrorCode{ErrorCodeEnum::CorruptFilterPayload}. Update the logic around
reader.try_read_numeric_value, bit_array_size, bit_array_bytes, and the
allocation site for bit_array to perform the size validation first to avoid huge
allocations or OOM.

---

Duplicate comments:
In `@components/core/src/clp_s/filter/FilterOptions.cpp`:
- Around line 22-29: normalize_string currently calls
clp::string_utils::to_lower for FilterNormalization::Lowercase which does
byte-wise lowercasing and is not deterministic for UTF-8; change
normalize_string(FilterNormalization::Lowercase) to one of two explicit,
deterministic behaviors: either (A) validate that the input is pure ASCII and
perform ASCII-only lowercasing (reject or error for non-ASCII so persisted keys
remain deterministic), or (B) perform a proper Unicode case fold using a
deterministic library/function (e.g., a unicode_case_fold helper based on ICU or
a portable Unicode case-fold table) and replace the clp::string_utils::to_lower
call with that; ensure the chosen behavior is used consistently when writing
keys in FilterBuilder::add and reading keys in FilterReader::possibly_contains,
and add/update tests to cover multi-byte UTF-8 inputs and the rejection or fold
behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: db313041-ff87-47c3-981c-26fe9aa78562

📥 Commits

Reviewing files that changed from the base of the PR and between aeb07b4 and 4959a3a.

📒 Files selected for processing (8)
  • components/core/src/clp_s/filter/BloomFilter.cpp
  • components/core/src/clp_s/filter/BloomFilter.hpp
  • components/core/src/clp_s/filter/FilterBuilder.cpp
  • components/core/src/clp_s/filter/FilterBuilder.hpp
  • components/core/src/clp_s/filter/FilterOptions.cpp
  • components/core/src/clp_s/filter/FilterOptions.hpp
  • components/core/src/clp_s/filter/FilterReader.cpp
  • components/core/src/clp_s/filter/FilterReader.hpp

Comment on lines +83 to +85
auto FilterReader::possibly_contains(std::string_view value) const -> bool {
auto const normalized_value{normalize_string(value, m_normalization)};
return m_bloom_filter.possibly_contains(normalized_value);
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Fast-path the None normalization case.

This currently materializes a std::string on every lookup, even when no normalization is configured. The exact-match path can stay allocation-free by hashing the original std::string_view directly.

♻️ Suggested change
 auto FilterReader::possibly_contains(std::string_view value) const -> bool {
+    if (FilterNormalization::None == m_normalization) {
+        return m_bloom_filter.possibly_contains(value);
+    }
     auto const normalized_value{normalize_string(value, m_normalization)};
     return m_bloom_filter.possibly_contains(normalized_value);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/filter/FilterReader.cpp` around lines 83 - 85, The
possibly_contains implementation currently always calls normalize_string causing
an allocation; add a fast path in FilterReader::possibly_contains that checks
m_normalization (the "None" case) and directly calls
m_bloom_filter.possibly_contains using the original std::string_view (or its
hash/equivalent used by the bloom filter) to avoid materializing a std::string,
otherwise fall back to normalize_string(normalized_value, m_normalization) and
pass that to m_bloom_filter.possibly_contains.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the proper way to address this is to add a normalizer class that internally maintains a buffer and has a method normalize(std::string_view) -> std::string_view that will return an std::string_view of its internal buffer when it has to do normalization and return the input std::string_view otherwise.

Note that even when performing a given normalization you can sometimes scan and then directly return the input string without copying it into the internal buffer (e.g. when the input string is already lowercase).

That said, since we don't expect to have to check too many values for a given bloom filter + we expect that normalization will be required in the common case, optimizing to occasionally avoid a copy here probably isn't worthwhile at the moment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibber9809, that makes sense — the normalizer class approach is a cleaner abstraction, and deferring given the current usage patterns is reasonable. I'll drop this suggestion.


✏️ Learnings added
Learnt from: gibber9809
Repo: y-scope/clp PR: 2101
File: components/core/src/clp_s/filter/FilterReader.cpp:83-85
Timestamp: 2026-03-20T18:33:35.288Z
Learning: In `y-scope/clp`, the preferred approach for allocation-free normalization in `clp_s::filter` is a dedicated normalizer class that maintains an internal buffer and exposes a `normalize(std::string_view) -> std::string_view` method. It should return a `std::string_view` into its internal buffer when transformation is needed, or return the original `std::string_view` directly when the input is already in the normalized form (e.g., already lowercase). This optimization is considered not worthwhile in the short term for `FilterReader::possibly_contains` since bloom filter lookups are infrequent and normalization is expected to be the common case.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Learnt from: davemarco
Repo: y-scope/clp PR: 700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.

Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.hpp:158-158
Timestamp: 2024-10-24T14:46:00.664Z
Learning: In `components/core/src/clp/BufferedFileReader.hpp`, the `BufferedFileReader::try_read` function explicitly calls `BufferReader::try_read`, so documentation should reference `BufferReader::try_read`.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 2093
File: components/core/src/clp_s/ffi/sfa/ClpArchiveReader.cpp:142-151
Timestamp: 2026-03-13T01:43:01.336Z
Learning: In y-scope/clp, `clp_s::ffi::sfa::ClpArchiveReader` is scoped exclusively to single-file archives (SFA). In an SFA, a single source file is guaranteed never to be split across multiple range-index entries, so `m_file_infos` and `m_file_names` are always 1:1 with source files. Do not flag concerns about coalescing split-file range fragments in this class.

Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Learnt from: ShangDanLuXian
Repo: y-scope/clp PR: 2010
File: components/core/src/clp_s/filter/FilterBuilder.cpp:24-33
Timestamp: 2026-02-20T15:19:43.528Z
Learning: In the components/core/src/clp_s module, FileWriter usage across multipleCpp files (e.g., JsonConstructor.cpp, SchemaMap.cpp, FilterBuilder.cpp) intentionally omits try-catch around write operations, and treats potential write failures as acceptable data loss. Do not add per-write try-catch in these files; instead, rely on the caller/integration layer to handle exceptions. Apply this pattern to similar FileWriter usage within this module, ensuring reviewers understand that error handling is deferred and that this behavior is consistent across related files.

@gibber9809 gibber9809 self-requested a review March 20, 2026 17:59
Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for taking a long time to get back to this. Just a few details I think are worth changing, but otherwise looks like this PR is looking good.

Comment thread components/core/src/clp_s/filter/FilterReader.hpp Outdated
Comment thread components/core/src/clp_s/filter/FilterReader.hpp Outdated
* @param value
* @return true if the filter may contain `value`; false if it definitely does not.
*/
[[nodiscard]] auto possibly_contains_wildcard(std::string_view value) const -> bool;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename to possibly_contains_query_string, or another more specific name? possibly_contains_wildcard reads to me like we're checking if the filter contains string values with wildcards.

Comment thread components/core/src/clp_s/filter/FilterReader.cpp Outdated
}

auto FilterBuilder::write_to_file(clp::WriterInterface& writer) const -> void {
writer.write(cFilterFileMagic.data(), cFilterFileMagic.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, revisiting this one more time since the new plan doesn't involve staging anymore -- do you think we still need the file magic here?

I'm also happy to leave this for now and decide whether we should change it when we get to writing the indexing flow to avoid holding up this PR any longer.

Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/core/src/clp_s/filter/FilterReader.cpp`:
- Around line 88-99: The Bloom case in FilterReader::possibly_contains_wildcard
uses clp_s::search::ast::has_unescaped_wildcards and
clp::string_utils::unescape_string but their headers are not included and the
case body lacks braces; add explicit includes for the headers that define
has_unescaped_wildcards and unescape_string at the top of the file (do not rely
on transitive includes like FilterOptions.hpp) and wrap the FilterType::Bloom
case body in { ... } so the auto const unescaped_value declaration is scoped
properly within possibly_contains_wildcard.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cabfe753-a8df-402d-9619-9a7d6abf4712

📥 Commits

Reviewing files that changed from the base of the PR and between 4959a3a and 06437b4.

📒 Files selected for processing (2)
  • components/core/src/clp_s/filter/FilterReader.cpp
  • components/core/src/clp_s/filter/FilterReader.hpp

Comment on lines +88 to +99
auto FilterReader::possibly_contains_wildcard(std::string_view value) const -> bool {
switch (m_type) {
case FilterType::Bloom:
if (clp_s::search::ast::has_unescaped_wildcards(value)) {
return true;
}
auto const unescaped_value{clp::string_utils::unescape_string(value)};
return possibly_contains(unescaped_value);
}

return true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd -t f 'SearchUtils.hpp' components/core/src/clp_s/search/ast
fd -t f 'string_utils.hpp' components/core/src/clp/string_utils
rg -nP '\bunescape_string\s*\(' -C2 --type=cpp --type=hpp
rg -nP '\bhas_unescaped_wildcards\s*\(' -C2 --type=cpp --type=hpp

Repository: y-scope/clp

Length of output: 234


🏁 Script executed:

cat -n components/core/src/clp_s/filter/FilterReader.cpp | head -40

Repository: y-scope/clp

Length of output: 1377


🏁 Script executed:

rg -n 'unescape_string' components/core/src/clp/string_utils/string_utils.hpp -A 3

Repository: y-scope/clp

Length of output: 185


🏁 Script executed:

rg -n 'has_unescaped_wildcards' components/core/src/clp_s/search/ast/SearchUtils.hpp -A 3

Repository: y-scope/clp

Length of output: 216


🏁 Script executed:

cat -n components/core/src/clp_s/filter/FilterOptions.hpp | head -30

Repository: y-scope/clp

Length of output: 843


🏁 Script executed:

sed -n '88,99p' components/core/src/clp_s/filter/FilterReader.cpp

Repository: y-scope/clp

Length of output: 482


🏁 Script executed:

cat components/core/src/clp_s/filter/FilterOptions.hpp | head -50

Repository: y-scope/clp

Length of output: 1269


🏁 Script executed:

rg -n 'include.*SearchUtils|include.*string_utils' components/core/src/clp_s/filter/FilterOptions.hpp

Repository: y-scope/clp

Length of output: 37


Add missing includes and wrap the case body in braces.

Two critical issues in possibly_contains_wildcard:

  1. This translation unit uses clp_s::search::ast::has_unescaped_wildcards (line 91) and clp::string_utils::unescape_string (line 94), but neither header is included. The code currently compiles only via transitive includes through FilterOptions.hpp. Add explicit includes for both to prevent build breaks if transitive paths change.

  2. The variable auto const unescaped_value{...} is declared directly under the case label without braces. While it compiles today because Bloom is the only case, adding another FilterType will trigger a cross-initialization error (jump to case label crosses initialization). Wrap the case body in {}.

Proposed fix
 `#include` <clp/ReaderInterface.hpp>
+#include <clp/string_utils/string_utils.hpp>
+
+#include <clp_s/search/ast/SearchUtils.hpp>

 `#include` "BloomFilter.hpp"
 `#include` "ErrorCode.hpp"
 `#include` "FilterOptions.hpp"
 auto FilterReader::possibly_contains_wildcard(std::string_view value) const -> bool {
     switch (m_type) {
-        case FilterType::Bloom:
+        case FilterType::Bloom: {
             if (clp_s::search::ast::has_unescaped_wildcards(value)) {
                 return true;
             }
             auto const unescaped_value{clp::string_utils::unescape_string(value)};
             return possibly_contains(unescaped_value);
+        }
     }

     return true;
 }
🧰 Tools
🪛 Cppcheck (2.20.0)

[style] 88-88: The function 'possibly_contains_wildcard' is never used.

(unusedFunction)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/filter/FilterReader.cpp` around lines 88 - 99, The
Bloom case in FilterReader::possibly_contains_wildcard uses
clp_s::search::ast::has_unescaped_wildcards and
clp::string_utils::unescape_string but their headers are not included and the
case body lacks braces; add explicit includes for the headers that define
has_unescaped_wildcards and unescape_string at the top of the file (do not rely
on transitive includes like FilterOptions.hpp) and wrap the FilterType::Bloom
case body in { ... } so the auto const unescaped_value declaration is scoped
properly within possibly_contains_wildcard.

Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, but will hold off on marking approved until we finish cutting a release from main.

For the PR title, how about:

feat(clp-s::filter): Add abstraction layer for filter building, serialization, and deserialization.

@ShangDanLuXian ShangDanLuXian changed the title feat(clp_s::filter): add filter builder/reader serialization layer feat(clp-s::filter): Add abstraction layer for filter building, serialization, and deserialization. Apr 21, 2026
@gibber9809 gibber9809 self-requested a review April 22, 2026 14:33
Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking approved now that the release is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants