feat(schema): scan context support set table schema#394
Conversation
590d7bd to
82d21d0
Compare
871d94f to
2a9e8a9
Compare
| return RefreshReadRangesAfterCleanUp(); | ||
| } | ||
|
|
||
| Status PrefetchFileBatchReaderImpl::RefreshReadRangesAfterCleanUp() { |
There was a problem hiding this comment.
Why split out the RefreshReadRangesAfterCleanUp() function?
There was a problem hiding this comment.
I split out RefreshReadRangesAfterCleanUp() because SetReadSchema() needs a different ordering now:
CleanUp();
reader->SetReadSchema(...);
RefreshReadRangesAfterCleanUp();
The original RefreshReadRanges() always calls CleanUp() internally. If SetReadSchema() called it directly after stopping the prefetch thread, we would either reset schemas before cleanup, which caused the ORC crash, or call CleanUp() twice.
So the split keeps the behavior clear:
RefreshReadRanges()
-> CleanUp()
-> RefreshReadRangesAfterCleanUp()
SetReadSchema()
-> CleanUp()
-> reset reader schemas
-> RefreshReadRangesAfterCleanUp()
This lets us reuse the range-refresh logic while guaranteeing that schema reset happens only after the background prefetch thread has fully stopped.
Purpose
Add
ScanContextBuilder::SetTableSchemaso scan operations can reuse a pre-loaded table schema instead of always loading the latest schema from table metadata.For main branch data table scans,
TableScannow parses the provided schema JSON viaTableSchema::CreateFromJson(...)and skipsSchemaManager::Latest(). If no schema is provided, or the scan targets a non-main branch, the existing latest-schema loading behavior is preserved.Tests
cmake --build build --target paimon-core-testctest -R paimon-core-test --output-on-failureAPI and Format
This change adds a public API in
include/paimon/scan_context.h:ScanContextBuilder& SetTableSchema(const std::string& table_schema)No storage format or protocol changes.
Documentation
This introduces a new scan-side optimization API. Inline API documentation is added in
scan_context.h.Generative AI tooling
Generated-by: OpenAI Codex GPT-5