feat(clp-s::filter): Add abstraction layer for filter building, serialization, and deserialization.#2101
feat(clp-s::filter): Add abstraction layer for filter building, serialization, and deserialization.#2101ShangDanLuXian wants to merge 7 commits intoy-scope:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a filter subsystem: new FilterOptions (types & normalization), FilterBuilder and FilterReader for Bloom filters; extends BloomFilter I/O with validation and renamed methods; adds two new error codes and updates CMake to include new filter sources and link Changes
Sequence Diagram(s)sequenceDiagram
participant Dev as Developer
participant Builder as FilterBuilder
participant Bloom as BloomFilter
participant Writer as WriterInterface
participant Reader as FilterReader
participant ReaderIF as ReaderInterface
Dev->>Builder: create(type, normalization, expected, fpr)
Builder->>Bloom: BloomFilter::create(...)
Bloom-->>Builder: bloom instance
Dev->>Builder: add(value)
Builder->>Bloom: insert(normalize(value))
Dev->>Builder: write(Writer)
Builder->>Writer: write(type_id, normalization_id, bloom data)
Reader->>ReaderIF: try_read(ReaderIF)
ReaderIF->>Reader: read(type_id, normalization_id)
Reader->>ReaderIF: BloomFilter::try_read(ReaderIF)
ReaderIF-->>Reader: bloom instance
Dev->>Reader: possibly_contains_query_string(query)
Reader->>Bloom: possibly_contains(normalize(query))
Bloom-->>Reader: bool
Reader-->>Dev: Result / bool
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/core/src/clp_s/filter/FilterOptions.cpp`:
- Around line 44-46: The lowercase normalization currently uses
clp::string_utils::to_lower which relies on the process locale and causes
on-disk incompatibility; update the normalization to be locale-independent by
making clp::string_utils::to_lower use the C/POSIX ("classic") locale or an
explicit Unicode-aware normalizer so results are stable across environments
(i.e., pin to std::locale::classic or use a stable UTF-8
normalization/NFKC+lowercase impl), and ensure both FilterBuilder::write_to_file
and FilterReader::possibly_contains use that same deterministic normalization
path.
In `@components/core/src/clp_s/filter/FilterReader.hpp`:
- Around line 22-30: Merge the duplicate `@return` tags in the FilterReader
docblock into a single `@return` section that lists both the initialization error
cases and the forwarded payload-read failure case; update the docblock in
FilterReader.hpp (the function/constructor that returns Result<FilterReader,
ErrorCodeEnum>) so the single `@return` contains the bullet list for metadata
errors (CorruptFilterFile, UnsupportedFilterType,
UnsupportedFilterNormalization, ReadFailure) and a final line stating that
payload read failures forward the underlying filter implementation's return
values.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7d2c7580-4291-49b2-976c-897530bda378
📒 Files selected for processing (10)
components/core/cmake/Options/options.cmakecomponents/core/src/clp_s/filter/CMakeLists.txtcomponents/core/src/clp_s/filter/ErrorCode.cppcomponents/core/src/clp_s/filter/ErrorCode.hppcomponents/core/src/clp_s/filter/FilterBuilder.cppcomponents/core/src/clp_s/filter/FilterBuilder.hppcomponents/core/src/clp_s/filter/FilterOptions.cppcomponents/core/src/clp_s/filter/FilterOptions.hppcomponents/core/src/clp_s/filter/FilterReader.cppcomponents/core/src/clp_s/filter/FilterReader.hpp
| case FilterNormalization::Lowercase: | ||
| clp::string_utils::to_lower(normalized); | ||
| return normalized; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== to_lower implementation used by filter normalization ==="
sed -n '135,160p' components/core/src/clp-rust-utils/string_utils/string_utils.hpp
echo
echo "=== filter normalization call chain ==="
rg -n -C2 '\bnormalize_string\s*\(|\bto_lower\s*\(' \
components/core/src/clp_s/filter/FilterOptions.cpp \
components/core/src/clp_s/filter/FilterBuilder.cpp \
components/core/src/clp_s/filter/FilterReader.cpp
echo
echo "=== explicit locale pinning in core sources (if any) ==="
rg -n -C2 '\bsetlocale\s*\(|std::locale::global\s*\(' components/core/srcRepository: y-scope/clp
Length of output: 240
🏁 Script executed:
# Find string_utils files
find components -name "*string_utils*" -type f | head -20Repository: y-scope/clp
Length of output: 206
🏁 Script executed:
# Check the FilterOptions.cpp file around line 45
head -50 components/core/src/clp_s/filter/FilterOptions.cpp | tail -20Repository: y-scope/clp
Length of output: 627
🏁 Script executed:
# Search for to_lower usage in the filter directory
rg -n "to_lower" components/core/src/clp_s/filter/Repository: y-scope/clp
Length of output: 161
🏁 Script executed:
# Read the string_utils.hpp to find to_lower implementation
cat -n components/core/src/clp/string_utils/string_utils.hppRepository: y-scope/clp
Length of output: 6711
🏁 Script executed:
# Search for locale-related code in the codebase
rg -n "setlocale|std::locale::global" components/core/srcRepository: y-scope/clp
Length of output: 37
🏁 Script executed:
# Read the implementation of to_lower in the .cpp file
cat -n components/core/src/clp/string_utils/string_utils.cppRepository: y-scope/clp
Length of output: 13567
🏁 Script executed:
# Search for FilterOptions usage and persistence
rg -n "FilterOptions|FilterNormalization" components/core/src/clp_s/filter/ --type cpp --type hpp | head -30Repository: y-scope/clp
Length of output: 82
🏁 Script executed:
# Look for any disk/persistence operations with filters
rg -n "write|persist|serialize|save" components/core/src/clp_s/filter/ --type cpp --type hpp | grep -i filterRepository: y-scope/clp
Length of output: 82
🏁 Script executed:
# Search for FilterOptions usage
rg -n "FilterOptions|FilterNormalization" components/core/src/clp_s/filter/ | head -40Repository: y-scope/clp
Length of output: 4048
🏁 Script executed:
# Check FilterBuilder and FilterReader for how filters are persisted/loaded
rg -n "normalize_string|to_lower" components/core/src/clp_s/filter/FilterBuilder.cpp components/core/src/clp_s/filter/FilterReader.cppRepository: y-scope/clp
Length of output: 311
🏁 Script executed:
# Look at the FilterBuilder to understand the flow
head -100 components/core/src/clp_s/filter/FilterBuilder.cppRepository: y-scope/clp
Length of output: 1934
🏁 Script executed:
# Verify FilterReader applies normalization at query time
cat -n components/core/src/clp_s/filter/FilterReader.cpp | sed -n '50,70p'Repository: y-scope/clp
Length of output: 838
🏁 Script executed:
# Check if there are any tests showing the locale-dependent behavior
rg -n "locale|tolower|Lowercase" components/core/tests/ | grep -i "filter\|normalize"Repository: y-scope/clp
Length of output: 37
Locale-dependent filter normalization affects on-disk compatibility.
The to_lower function at line 45 relies on std::tolower with the process locale. When filters built and queried under different locales normalize non-ASCII input differently, they produce membership mismatches in the Bloom filter.
The normalized values are persisted to disk as part of the Bloom filter (FilterBuilder::write_to_file), and the same normalization is reapplied during queries (FilterReader::possibly_contains). Without an explicitly pinned locale, the behaviour becomes environment-dependent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/core/src/clp_s/filter/FilterOptions.cpp` around lines 44 - 46, The
lowercase normalization currently uses clp::string_utils::to_lower which relies
on the process locale and causes on-disk incompatibility; update the
normalization to be locale-independent by making clp::string_utils::to_lower use
the C/POSIX ("classic") locale or an explicit Unicode-aware normalizer so
results are stable across environments (i.e., pin to std::locale::classic or use
a stable UTF-8 normalization/NFKC+lowercase impl), and ensure both
FilterBuilder::write_to_file and FilterReader::possibly_contains use that same
deterministic normalization path.
There was a problem hiding this comment.
Incorrect -- the behaviour of clp::string_utils::to_lower doesn't depend on locale. Generally the approach for handling lower-casing for different locales will probably be to treat them as different kinds of normalization.
There was a problem hiding this comment.
@gibber9809, thanks for the correction — I apologize for the inaccurate review comment. I'll note that clp::string_utils::to_lower is locale-independent for future reviews.
✏️ Learnings added
Learnt from: gibber9809
Repo: y-scope/clp PR: 2101
File: components/core/src/clp_s/filter/FilterOptions.cpp:27-29
Timestamp: 2026-03-20T18:37:35.141Z
Learning: In y-scope/clp, `clp::string_utils::to_lower` (defined in `components/core/src/clp/string_utils/string_utils.hpp`) is locale-independent. Do not raise locale-sensitivity concerns about its usage. For handling lower-casing across different locales in the clp_s::filter module, the approach is to treat different locale lowercasings as distinct normalization types.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: gibber9809
Repo: y-scope/clp PR: 2101
File: components/core/src/clp_s/filter/FilterReader.cpp:83-85
Timestamp: 2026-03-20T18:33:35.288Z
Learning: In y-scope/clp’s clp_s filtering code (e.g., clp_s::filter), prefer an allocation-free normalization design using a dedicated normalizer class. The normalizer should own an internal buffer and provide `normalize(std::string_view) -> std::string_view`. When normalization/transformation is required, return a `std::string_view` pointing into the normalizer’s internal buffer; when the input is already in the normalized form (e.g., already lowercase), return the original `std::string_view` directly.
Do not over-optimize this path for `FilterReader::possibly_contains`; it’s acceptable to skip the short-term allocation-free optimization there because bloom filter lookups are infrequent and normalization is expected to be the common case.
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 873
File: components/core/src/clp/ffi/ir_stream/search/QueryHandlerImpl.hpp:320-324
Timestamp: 2025-05-05T01:12:18.561Z
Learning: In the CLP codebase, the `m_case_sensitive_search` flag is used only for actual string value comparisons during query evaluation, not for schema key name matching. Schema keys are always compared case-sensitively regardless of this flag's setting.
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/MockVariableDictionary.hpp:69-79
Timestamp: 2026-01-14T01:19:12.982Z
Learning: In the CLP codebase, case-insensitive tests are not expected for the mock dictionary classes (MockVariableDictionary and MockLogTypeDictionary) used in unit tests. The `ignore_case` parameter in methods like `get_entries_matching_wildcard_string` and `get_entry_matching_value` can safely ignore the parameter and always use case-sensitive matching.
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Learnt from: ShangDanLuXian
Repo: y-scope/clp PR: 2010
File: components/core/src/clp_s/filter/FilterBuilder.cpp:24-33
Timestamp: 2026-02-20T15:19:43.528Z
Learning: In the components/core/src/clp_s module, FileWriter usage across multipleCpp files (e.g., JsonConstructor.cpp, SchemaMap.cpp, FilterBuilder.cpp) intentionally omits try-catch around write operations, and treats potential write failures as acceptable data loss. Do not add per-write try-catch in these files; instead, rely on the caller/integration layer to handle exceptions. Apply this pattern to similar FileWriter usage within this module, ensuring reviewers understand that error handling is deferred and that this behavior is consistent across related files.
| return std::nullopt; | ||
| } | ||
|
|
||
| auto try_parse_filter_type(std::string_view filter_type_string) -> std::optional<FilterType> { |
There was a problem hiding this comment.
This one will be used during integration to reject filter writing when processing command line arguments.
Probably I also need a method to return all the available filter type strings so that can show in help.
There was a problem hiding this comment.
I'd maybe leave it for the integration PR then since it isn't directly related, but not a huge deal either way.
If you do keep it though "bloom" needs to be a constexpr std::string_view per our style guide.
gibber9809
left a comment
There was a problem hiding this comment.
Did an initial pass and looks good overall, just some details to discuss. I think I generally agree that it's fine to not make the interface completely generic to different kinds of filters at the moment -- mostly looks like it will just require a bit of refactoring for the create methods down the line.
| * Writes a serialized filter representation to a writer. | ||
| * @param writer | ||
| */ | ||
| auto write_to_file(clp::WriterInterface& writer) const -> void; |
There was a problem hiding this comment.
Since we're writing to a WriterInterface maybe we should just call this method write instead of write_to_file?
| /** | ||
| * @return The configured filter type. | ||
| */ | ||
| [[nodiscard]] auto get_type() const -> FilterType { return m_type; } | ||
|
|
||
| [[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; } |
There was a problem hiding this comment.
Don't necessarily need docstrings for simple get methods that are described by their name, but if you put it on one here you should put it on both. Also fine to just get rid of the docstring on get_type.
| if (FilterNormalization::None != normalization | ||
| && FilterNormalization::Lowercase != normalization) | ||
| { | ||
| return ErrorCode{ErrorCodeEnum::UnsupportedFilterNormalization}; | ||
| } | ||
|
|
||
| switch (type) { | ||
| case FilterType::Bloom: { | ||
| auto bloom_filter{YSTDLIB_ERROR_HANDLING_TRYX( | ||
| BloomFilter::create(expected_num_elements, false_positive_rate) | ||
| )}; | ||
| return FilterBuilder{type, normalization, std::move(bloom_filter)}; | ||
| } | ||
| } |
There was a problem hiding this comment.
It reads a bit strangely to use if for conditions on one enum and switch for the conditions on the other enum. For consistency, could we use switch for both?
| } | ||
|
|
||
| auto FilterBuilder::write_to_file(clp::WriterInterface& writer) const -> void { | ||
| writer.write(cFilterFileMagic.data(), cFilterFileMagic.size()); |
There was a problem hiding this comment.
Do you think that this needs file magic?
Within an archive we generally don't use file magic, and instead rely on some archive-level metadata about the different sections of the archive. I can see the argument for when this is external to an archive, but I'm not sure if this file format is actually independent as it is right now anyway (in that it doesn't indicate which archive it's a filter for), so to use it externally to an archive we have to have some external way to tell that this is a filter for a given archive anyway. Potentially open to keeping it if you can think of a use-case though.
When we get to doing packed filters though I can definitely see file magic making sense.
There was a problem hiding this comment.
There is intentionally no archive-to-filter binding encoded in this library itself, since this filter serialization layer may be used in different ways in the future.
My reason for keeping the file magic is narrower: it gives us a lightweight validation that the bytes being packed are actually a serialized filter file, which helps guard against malformed files or unrelated junk in the staging area.
That said, I agree this is arguable. In my prototype implementation, the packing flow identified staged filters by filename convention:
<filter_output_dir>/<archive_id>.var.dict.filter
So in that flow, the archive association comes from the filename/path convention rather than from the filter payload itself. That already filters out most unrelated files in the staging directory (if there is any), unless something else happens to match the same naming pattern. Do you think we need to protect that case?
There was a problem hiding this comment.
I think your reason for keeping it as lightweight validation when it's staged makes sense, so lets keep it for now. We can always revisit whether the file magic part is required for filters in archives when we get to that.
There was a problem hiding this comment.
Sorry, revisiting this one more time since the new plan doesn't involve staging anymore -- do you think we still need the file magic here?
I'm also happy to leave this for now and decide whether we should change it when we get to writing the indexing flow to avoid holding up this PR any longer.
| return std::nullopt; | ||
| } | ||
|
|
||
| auto try_parse_filter_type(std::string_view filter_type_string) -> std::optional<FilterType> { |
There was a problem hiding this comment.
I'd maybe leave it for the integration PR then since it isn't directly related, but not a huge deal either way.
If you do keep it though "bloom" needs to be a constexpr std::string_view per our style guide.
| [[nodiscard]] auto try_parse_filter_type_id(uint8_t filter_type_id) -> std::optional<FilterType>; | ||
|
|
||
| /** | ||
| * Parses a filter type string. | ||
| * @param filter_type_string Case-sensitive filter type string. | ||
| * @return Parsed filter type for known strings, or std::nullopt otherwise. | ||
| */ | ||
| [[nodiscard]] auto try_parse_filter_type(std::string_view filter_type_string) | ||
| -> std::optional<FilterType>; | ||
|
|
||
| /** | ||
| * Parses a serialized filter normalization ID. | ||
| * @param filter_normalization_id Encoded filter normalization ID from serialized payload. | ||
| * @return Parsed filter normalization for known IDs, or std::nullopt otherwise. | ||
| */ | ||
| [[nodiscard]] auto try_parse_filter_normalization_id(uint8_t filter_normalization_id) | ||
| -> std::optional<FilterNormalization>; | ||
|
|
There was a problem hiding this comment.
Besides the command line related method, are there any users for these parse methods besides FilterReader? I'm wondering whether it makes sense for these to just be private methods in that class.
There was a problem hiding this comment.
I grouped these together intentionally since they are closely related, but I agree this is probably exposing a bit too much. I think it makes sense to keep the string-based one shared, and move the serialized-ID ones into FilterReader.
| * @param value | ||
| * @return true if the filter may contain `value`; false if it definitely does not. | ||
| */ | ||
| [[nodiscard]] auto possibly_contains(std::string_view value) const -> bool; |
There was a problem hiding this comment.
Do you think it makes sense to have another method here that accepts a wildcard search string and returns whether it is possibly contained in the data based on the filter?
For the current bloom filter based approach it'll just be always true unless the search string contains no wildcards, in which case we actually do the lookup, but even that could be useful. Definitely more useful for other kinds of filters though.
There was a problem hiding this comment.
Yeah, it makes sense. I’ll add a wildcard-check function with a switch on filter type; for Bloom, it will always return true, as this type is for exact match only.
Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
components/core/src/clp_s/filter/FilterBuilder.hpp (2)
51-56: 🧹 Nitpick | 🔵 TrivialAdd missing docstring for
get_normalization().The
get_type()method has a docstring butget_normalization()does not. For consistency, either add a docstring toget_normalization()or remove the one fromget_type().📝 Proposed fix to add docstring
[[nodiscard]] auto get_type() const -> FilterType { return m_type; } + /** + * `@return` The configured normalization strategy. + */ [[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/core/src/clp_s/filter/FilterBuilder.hpp` around lines 51 - 56, Add a missing docstring for the getter get_normalization() to match get_type(): locate the method [[nodiscard]] auto get_normalization() const -> FilterNormalization and add a brief Doxygen comment (e.g., "@return The configured filter normalization.") so both getters have consistent documentation; alternatively remove the docstring from get_type() if the project prefers no getter comments, but keep style consistent across get_type() and get_normalization().
45-49: 🧹 Nitpick | 🔵 TrivialConsider renaming
write_to_filetowrite.Since this method writes to a
WriterInterfacerather than directly to a file, the namewritewould be more accurate and consistent with the abstraction level.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/core/src/clp_s/filter/FilterBuilder.hpp` around lines 45 - 49, The method name write_to_file is misleading because it writes to a WriterInterface; rename the method to write to match the abstraction: update the declaration in FilterBuilder (change write_to_file to write), update any corresponding definition/implementation (FilterBuilder::write) and all call sites that invoke write_to_file, and adjust the Doxygen comment to reflect the new name; ensure the signature remains auto write(clp::WriterInterface& writer) const -> void and run a build to find remaining references.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/core/src/clp_s/filter/FilterReader.hpp`:
- Line 48: Add a brief docblock above the get_normalization() accessor to match
the style used for get_type(); locate the method named get_normalization() in
FilterReader.hpp and insert a short comment describing that it returns the
FilterNormalization for this filter instance (mentioning it is a const,
nodiscard accessor), following the same wording/format used by the existing
get_type() docblock for consistency.
---
Duplicate comments:
In `@components/core/src/clp_s/filter/FilterBuilder.hpp`:
- Around line 51-56: Add a missing docstring for the getter get_normalization()
to match get_type(): locate the method [[nodiscard]] auto get_normalization()
const -> FilterNormalization and add a brief Doxygen comment (e.g., "@return The
configured filter normalization.") so both getters have consistent
documentation; alternatively remove the docstring from get_type() if the project
prefers no getter comments, but keep style consistent across get_type() and
get_normalization().
- Around line 45-49: The method name write_to_file is misleading because it
writes to a WriterInterface; rename the method to write to match the
abstraction: update the declaration in FilterBuilder (change write_to_file to
write), update any corresponding definition/implementation
(FilterBuilder::write) and all call sites that invoke write_to_file, and adjust
the Doxygen comment to reflect the new name; ensure the signature remains auto
write(clp::WriterInterface& writer) const -> void and run a build to find
remaining references.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: b23bbce7-1194-446b-85e0-959ba27e0aaa
📒 Files selected for processing (2)
components/core/src/clp_s/filter/FilterBuilder.hppcomponents/core/src/clp_s/filter/FilterReader.hpp
| */ | ||
| [[nodiscard]] auto get_type() const -> FilterType { return m_type; } | ||
|
|
||
| [[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add a docblock for get_normalization() for consistency.
The get_type() accessor has a docblock, but get_normalization() does not. Consider adding a brief docblock for consistency.
📝 Suggested docblock
+ /**
+ * `@return` The parsed filter normalization strategy.
+ */
[[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| [[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; } | |
| /** | |
| * `@return` The parsed filter normalization strategy. | |
| */ | |
| [[nodiscard]] auto get_normalization() const -> FilterNormalization { return m_normalization; } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/core/src/clp_s/filter/FilterReader.hpp` at line 48, Add a brief
docblock above the get_normalization() accessor to match the style used for
get_type(); locate the method named get_normalization() in FilterReader.hpp and
insert a short comment describing that it returns the FilterNormalization for
this filter instance (mentioning it is a const, nodiscard accessor), following
the same wording/format used by the existing get_type() docblock for
consistency.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/core/src/clp_s/filter/BloomFilter.cpp (1)
75-97:⚠️ Potential issue | 🟠 MajorReject oversized payloads before allocating the bit array.
bit_array_bytesis trusted all the way to Line 96. A corrupt filter can declare a multi-GB array that is arithmetically consistent withbit_array_size, forcing a huge allocation orbad_alloc/OOM before we ever returnCorruptFilterPayload. Please enforce a maximum serialized Bloom-filter size before constructingbit_array.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/core/src/clp_s/filter/BloomFilter.cpp` around lines 75 - 97, Reject excessively large serialized bit arrays before allocating memory: after reading bit_array_bytes and computing expected_bit_array_bytes (using min_bytes_containing_bits) but before constructing ystdlib::containers::Array<uint8_t> bit_array, add a check that expected_bit_array_bytes (or bit_array_bytes) does not exceed a defensive maximum (e.g., MAX_SERIALIZED_BLOOM_BYTES or a reasonable cap based on available memory/limits); if it does, return ErrorCode{ErrorCodeEnum::CorruptFilterPayload}. Update the logic around reader.try_read_numeric_value, bit_array_size, bit_array_bytes, and the allocation site for bit_array to perform the size validation first to avoid huge allocations or OOM.
♻️ Duplicate comments (1)
components/core/src/clp_s/filter/FilterOptions.cpp (1)
22-29:⚠️ Potential issue | 🟠 Major
Lowercasestill is not safe for persisted filter keys.
normalize_string()still delegates toclp::string_utils::to_lower, which lowercases byte-by-byte. For UTF-8 input that is not a real case fold, and because the result is persisted byFilterBuilder::add()and re-applied byFilterReader::possibly_contains(), the on-disk key can vary by locale/process and miss non-ASCII case matches. Please make this path deterministic and explicit—either ASCII-only with validation, or a fixed Unicode-aware fold—before using it in the serialized format.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/core/src/clp_s/filter/FilterOptions.cpp` around lines 22 - 29, normalize_string currently calls clp::string_utils::to_lower for FilterNormalization::Lowercase which does byte-wise lowercasing and is not deterministic for UTF-8; change normalize_string(FilterNormalization::Lowercase) to one of two explicit, deterministic behaviors: either (A) validate that the input is pure ASCII and perform ASCII-only lowercasing (reject or error for non-ASCII so persisted keys remain deterministic), or (B) perform a proper Unicode case fold using a deterministic library/function (e.g., a unicode_case_fold helper based on ICU or a portable Unicode case-fold table) and replace the clp::string_utils::to_lower call with that; ensure the chosen behavior is used consistently when writing keys in FilterBuilder::add and reading keys in FilterReader::possibly_contains, and add/update tests to cover multi-byte UTF-8 inputs and the rejection or fold behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/core/src/clp_s/filter/FilterReader.cpp`:
- Around line 83-85: The possibly_contains implementation currently always calls
normalize_string causing an allocation; add a fast path in
FilterReader::possibly_contains that checks m_normalization (the "None" case)
and directly calls m_bloom_filter.possibly_contains using the original
std::string_view (or its hash/equivalent used by the bloom filter) to avoid
materializing a std::string, otherwise fall back to
normalize_string(normalized_value, m_normalization) and pass that to
m_bloom_filter.possibly_contains.
---
Outside diff comments:
In `@components/core/src/clp_s/filter/BloomFilter.cpp`:
- Around line 75-97: Reject excessively large serialized bit arrays before
allocating memory: after reading bit_array_bytes and computing
expected_bit_array_bytes (using min_bytes_containing_bits) but before
constructing ystdlib::containers::Array<uint8_t> bit_array, add a check that
expected_bit_array_bytes (or bit_array_bytes) does not exceed a defensive
maximum (e.g., MAX_SERIALIZED_BLOOM_BYTES or a reasonable cap based on available
memory/limits); if it does, return
ErrorCode{ErrorCodeEnum::CorruptFilterPayload}. Update the logic around
reader.try_read_numeric_value, bit_array_size, bit_array_bytes, and the
allocation site for bit_array to perform the size validation first to avoid huge
allocations or OOM.
---
Duplicate comments:
In `@components/core/src/clp_s/filter/FilterOptions.cpp`:
- Around line 22-29: normalize_string currently calls
clp::string_utils::to_lower for FilterNormalization::Lowercase which does
byte-wise lowercasing and is not deterministic for UTF-8; change
normalize_string(FilterNormalization::Lowercase) to one of two explicit,
deterministic behaviors: either (A) validate that the input is pure ASCII and
perform ASCII-only lowercasing (reject or error for non-ASCII so persisted keys
remain deterministic), or (B) perform a proper Unicode case fold using a
deterministic library/function (e.g., a unicode_case_fold helper based on ICU or
a portable Unicode case-fold table) and replace the clp::string_utils::to_lower
call with that; ensure the chosen behavior is used consistently when writing
keys in FilterBuilder::add and reading keys in FilterReader::possibly_contains,
and add/update tests to cover multi-byte UTF-8 inputs and the rejection or fold
behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: db313041-ff87-47c3-981c-26fe9aa78562
📒 Files selected for processing (8)
components/core/src/clp_s/filter/BloomFilter.cppcomponents/core/src/clp_s/filter/BloomFilter.hppcomponents/core/src/clp_s/filter/FilterBuilder.cppcomponents/core/src/clp_s/filter/FilterBuilder.hppcomponents/core/src/clp_s/filter/FilterOptions.cppcomponents/core/src/clp_s/filter/FilterOptions.hppcomponents/core/src/clp_s/filter/FilterReader.cppcomponents/core/src/clp_s/filter/FilterReader.hpp
| auto FilterReader::possibly_contains(std::string_view value) const -> bool { | ||
| auto const normalized_value{normalize_string(value, m_normalization)}; | ||
| return m_bloom_filter.possibly_contains(normalized_value); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Fast-path the None normalization case.
This currently materializes a std::string on every lookup, even when no normalization is configured. The exact-match path can stay allocation-free by hashing the original std::string_view directly.
♻️ Suggested change
auto FilterReader::possibly_contains(std::string_view value) const -> bool {
+ if (FilterNormalization::None == m_normalization) {
+ return m_bloom_filter.possibly_contains(value);
+ }
auto const normalized_value{normalize_string(value, m_normalization)};
return m_bloom_filter.possibly_contains(normalized_value);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/core/src/clp_s/filter/FilterReader.cpp` around lines 83 - 85, The
possibly_contains implementation currently always calls normalize_string causing
an allocation; add a fast path in FilterReader::possibly_contains that checks
m_normalization (the "None" case) and directly calls
m_bloom_filter.possibly_contains using the original std::string_view (or its
hash/equivalent used by the bloom filter) to avoid materializing a std::string,
otherwise fall back to normalize_string(normalized_value, m_normalization) and
pass that to m_bloom_filter.possibly_contains.
There was a problem hiding this comment.
Yeah I think the proper way to address this is to add a normalizer class that internally maintains a buffer and has a method normalize(std::string_view) -> std::string_view that will return an std::string_view of its internal buffer when it has to do normalization and return the input std::string_view otherwise.
Note that even when performing a given normalization you can sometimes scan and then directly return the input string without copying it into the internal buffer (e.g. when the input string is already lowercase).
That said, since we don't expect to have to check too many values for a given bloom filter + we expect that normalization will be required in the common case, optimizing to occasionally avoid a copy here probably isn't worthwhile at the moment.
There was a problem hiding this comment.
@gibber9809, that makes sense — the normalizer class approach is a cleaner abstraction, and deferring given the current usage patterns is reasonable. I'll drop this suggestion.
✏️ Learnings added
Learnt from: gibber9809
Repo: y-scope/clp PR: 2101
File: components/core/src/clp_s/filter/FilterReader.cpp:83-85
Timestamp: 2026-03-20T18:33:35.288Z
Learning: In `y-scope/clp`, the preferred approach for allocation-free normalization in `clp_s::filter` is a dedicated normalizer class that maintains an internal buffer and exposes a `normalize(std::string_view) -> std::string_view` method. It should return a `std::string_view` into its internal buffer when transformation is needed, or return the original `std::string_view` directly when the input is already in the normalized form (e.g., already lowercase). This optimization is considered not worthwhile in the short term for `FilterReader::possibly_contains` since bloom filter lookups are infrequent and normalization is expected to be the common case.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Learnt from: davemarco
Repo: y-scope/clp PR: 700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Learnt from: haiqi96
Repo: y-scope/clp PR: 523
File: components/core/src/clp/BufferedFileReader.hpp:158-158
Timestamp: 2024-10-24T14:46:00.664Z
Learning: In `components/core/src/clp/BufferedFileReader.hpp`, the `BufferedFileReader::try_read` function explicitly calls `BufferReader::try_read`, so documentation should reference `BufferReader::try_read`.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 2093
File: components/core/src/clp_s/ffi/sfa/ClpArchiveReader.cpp:142-151
Timestamp: 2026-03-13T01:43:01.336Z
Learning: In y-scope/clp, `clp_s::ffi::sfa::ClpArchiveReader` is scoped exclusively to single-file archives (SFA). In an SFA, a single source file is guaranteed never to be split across multiple range-index entries, so `m_file_infos` and `m_file_names` are always 1:1 with source files. Do not flag concerns about coalescing split-file range fragments in this class.
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Learnt from: ShangDanLuXian
Repo: y-scope/clp PR: 2010
File: components/core/src/clp_s/filter/FilterBuilder.cpp:24-33
Timestamp: 2026-02-20T15:19:43.528Z
Learning: In the components/core/src/clp_s module, FileWriter usage across multipleCpp files (e.g., JsonConstructor.cpp, SchemaMap.cpp, FilterBuilder.cpp) intentionally omits try-catch around write operations, and treats potential write failures as acceptable data loss. Do not add per-write try-catch in these files; instead, rely on the caller/integration layer to handle exceptions. Apply this pattern to similar FileWriter usage within this module, ensuring reviewers understand that error handling is deferred and that this behavior is consistent across related files.
gibber9809
left a comment
There was a problem hiding this comment.
Apologies for taking a long time to get back to this. Just a few details I think are worth changing, but otherwise looks like this PR is looking good.
| * @param value | ||
| * @return true if the filter may contain `value`; false if it definitely does not. | ||
| */ | ||
| [[nodiscard]] auto possibly_contains_wildcard(std::string_view value) const -> bool; |
There was a problem hiding this comment.
Could we rename to possibly_contains_query_string, or another more specific name? possibly_contains_wildcard reads to me like we're checking if the filter contains string values with wildcards.
| } | ||
|
|
||
| auto FilterBuilder::write_to_file(clp::WriterInterface& writer) const -> void { | ||
| writer.write(cFilterFileMagic.data(), cFilterFileMagic.size()); |
There was a problem hiding this comment.
Sorry, revisiting this one more time since the new plan doesn't involve staging anymore -- do you think we still need the file magic here?
I'm also happy to leave this for now and decide whether we should change it when we get to writing the indexing flow to avoid holding up this PR any longer.
Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/core/src/clp_s/filter/FilterReader.cpp`:
- Around line 88-99: The Bloom case in FilterReader::possibly_contains_wildcard
uses clp_s::search::ast::has_unescaped_wildcards and
clp::string_utils::unescape_string but their headers are not included and the
case body lacks braces; add explicit includes for the headers that define
has_unescaped_wildcards and unescape_string at the top of the file (do not rely
on transitive includes like FilterOptions.hpp) and wrap the FilterType::Bloom
case body in { ... } so the auto const unescaped_value declaration is scoped
properly within possibly_contains_wildcard.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cabfe753-a8df-402d-9619-9a7d6abf4712
📒 Files selected for processing (2)
components/core/src/clp_s/filter/FilterReader.cppcomponents/core/src/clp_s/filter/FilterReader.hpp
| auto FilterReader::possibly_contains_wildcard(std::string_view value) const -> bool { | ||
| switch (m_type) { | ||
| case FilterType::Bloom: | ||
| if (clp_s::search::ast::has_unescaped_wildcards(value)) { | ||
| return true; | ||
| } | ||
| auto const unescaped_value{clp::string_utils::unescape_string(value)}; | ||
| return possibly_contains(unescaped_value); | ||
| } | ||
|
|
||
| return true; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
fd -t f 'SearchUtils.hpp' components/core/src/clp_s/search/ast
fd -t f 'string_utils.hpp' components/core/src/clp/string_utils
rg -nP '\bunescape_string\s*\(' -C2 --type=cpp --type=hpp
rg -nP '\bhas_unescaped_wildcards\s*\(' -C2 --type=cpp --type=hppRepository: y-scope/clp
Length of output: 234
🏁 Script executed:
cat -n components/core/src/clp_s/filter/FilterReader.cpp | head -40Repository: y-scope/clp
Length of output: 1377
🏁 Script executed:
rg -n 'unescape_string' components/core/src/clp/string_utils/string_utils.hpp -A 3Repository: y-scope/clp
Length of output: 185
🏁 Script executed:
rg -n 'has_unescaped_wildcards' components/core/src/clp_s/search/ast/SearchUtils.hpp -A 3Repository: y-scope/clp
Length of output: 216
🏁 Script executed:
cat -n components/core/src/clp_s/filter/FilterOptions.hpp | head -30Repository: y-scope/clp
Length of output: 843
🏁 Script executed:
sed -n '88,99p' components/core/src/clp_s/filter/FilterReader.cppRepository: y-scope/clp
Length of output: 482
🏁 Script executed:
cat components/core/src/clp_s/filter/FilterOptions.hpp | head -50Repository: y-scope/clp
Length of output: 1269
🏁 Script executed:
rg -n 'include.*SearchUtils|include.*string_utils' components/core/src/clp_s/filter/FilterOptions.hppRepository: y-scope/clp
Length of output: 37
Add missing includes and wrap the case body in braces.
Two critical issues in possibly_contains_wildcard:
-
This translation unit uses
clp_s::search::ast::has_unescaped_wildcards(line 91) andclp::string_utils::unescape_string(line 94), but neither header is included. The code currently compiles only via transitive includes throughFilterOptions.hpp. Add explicit includes for both to prevent build breaks if transitive paths change. -
The variable
auto const unescaped_value{...}is declared directly under thecaselabel without braces. While it compiles today becauseBloomis the only case, adding anotherFilterTypewill trigger a cross-initialization error (jump to case label crosses initialization). Wrap the case body in{}.
Proposed fix
`#include` <clp/ReaderInterface.hpp>
+#include <clp/string_utils/string_utils.hpp>
+
+#include <clp_s/search/ast/SearchUtils.hpp>
`#include` "BloomFilter.hpp"
`#include` "ErrorCode.hpp"
`#include` "FilterOptions.hpp" auto FilterReader::possibly_contains_wildcard(std::string_view value) const -> bool {
switch (m_type) {
- case FilterType::Bloom:
+ case FilterType::Bloom: {
if (clp_s::search::ast::has_unescaped_wildcards(value)) {
return true;
}
auto const unescaped_value{clp::string_utils::unescape_string(value)};
return possibly_contains(unescaped_value);
+ }
}
return true;
}🧰 Tools
🪛 Cppcheck (2.20.0)
[style] 88-88: The function 'possibly_contains_wildcard' is never used.
(unusedFunction)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/core/src/clp_s/filter/FilterReader.cpp` around lines 88 - 99, The
Bloom case in FilterReader::possibly_contains_wildcard uses
clp_s::search::ast::has_unescaped_wildcards and
clp::string_utils::unescape_string but their headers are not included and the
case body lacks braces; add explicit includes for the headers that define
has_unescaped_wildcards and unescape_string at the top of the file (do not rely
on transitive includes like FilterOptions.hpp) and wrap the FilterType::Bloom
case body in { ... } so the auto const unescaped_value declaration is scoped
properly within possibly_contains_wildcard.
gibber9809
left a comment
There was a problem hiding this comment.
Changes look good, but will hold off on marking approved until we finish cutting a release from main.
For the PR title, how about:
feat(clp-s::filter): Add abstraction layer for filter building, serialization, and deserialization.
gibber9809
left a comment
There was a problem hiding this comment.
Marking approved now that the release is done.
Description
This PR adds a filter serialization layer for
clp_s::filter.It introduces:
FilterBuilderfor constructing a filter from input values and serializing itFilterReaderfor reading a serialized filter and performing membership checksFilterOptions, including:The current implementation is still backed by
BloomFilter. Making this layer fully generic wouldlikely require a parent abstraction representing filter implementations, but since we currently only
have one filter type, this PR intentionally skips that refactor to keep the scope small.
This PR also:
file metadata
clp::string_utilsdependency for shared normalization logicChecklist
breaking change.
Validation performed
Since this change mainly adds file I/O for the filter serialization layer, I am leaving the main
validation to integration tests.
Summary by CodeRabbit
New Features
Bug Fixes
Chores