Skip to content

[python] Support query auth (row filter & column masking) for REST catalog#8136

Open
MgjLLL wants to merge 1 commit into
apache:masterfrom
MgjLLL:python-query-auth
Open

[python] Support query auth (row filter & column masking) for REST catalog#8136
MgjLLL wants to merge 1 commit into
apache:masterfrom
MgjLLL:python-query-auth

Conversation

@MgjLLL

@MgjLLL MgjLLL commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Purpose

Adds query-auth support to the Python client so it honors the row-level filter and column masking rules returned by a REST catalog, matching the existing JVM client behavior.

When the new option query-auth.enabled is set to true, before producing a Plan the client calls POST /v1/.../databases/{db}/tables/{tb}/auth with the projected fields, receives { filter, columnMasking }, and applies them on the read path:

  • RESTApi.auth_table_query issues the call (new request/response models AuthTableQueryRequest / AuthTableQueryResponse, new path in ResourcePaths.auth_table).
  • TableQueryAuth / TableQueryAuthResult (catalog/table_query_auth.py) wrap the result and convert each split to a QueryAuthSplit.
  • predicate_json_parser (common/predicate_json_parser.py) parses Paimon predicate JSON into a PyArrow compute filter (EQ/NEQ/LT/LTEQ/GT/GTEQ/IS_NULL/IS_NOT_NULL/IN/NOT_IN/STARTS_WITH/ENDS_WITH/CONTAINS/AND/OR/NOT).
  • AuthFilterReader / AuthMaskingReader / ColumnProjectReader (read/reader/auth_masking_reader.py) implement row filtering, column masking transforms (NULL, FIELD_REF, CAST, UPPER, LOWER, CONCAT, CONCAT_WS) and final projection back to the user's requested columns.
  • read_builder / stream_read_builder / table_read / table_scan / file_store_table / catalog_environment / rest_catalog are wired to invoke the auth call and pull extra fields required only by the auth filter.

Behavior is gated by the new CoreOptions.QUERY_AUTH_ENABLED (query-auth.enabled, default false), so existing users see no change.

Tests

Three new test files (994+ lines, all passing locally under pytest):

  • paimon-python/pypaimon/tests/predicate_json_parser_test.py — covers each predicate kind, nested AND/OR/NOT, type coercion, null handling, and extract_referenced_fields.
  • paimon-python/pypaimon/tests/auth_masking_reader_test.py — covers each masking transform, missing-field validation, and projection back to the user-requested columns.
  • paimon-python/pypaimon/tests/table_query_auth_test.py — end-to-end coverage: REST catalog calls auth_table_query, the result is plumbed into the plan, splits become QueryAuthSplit, and reads return filtered + masked rows.

Local check:

cd paimon-python
python -m pytest pypaimon/tests/predicate_json_parser_test.py \
                  pypaimon/tests/auth_masking_reader_test.py \
                  pypaimon/tests/table_query_auth_test.py -q
flake8 --config dev/cfg.ini pypaimon/  # 已在改动范围内通过

API and Format

  • New catalog option: query-auth.enabled (boolean, default false).
  • New REST endpoint consumed by the client: POST /v1/{prefix}/databases/{db}/tables/{tb}/auth. Request { "select": [...] }, response { "filter": [<predicate-json>...], "columnMasking": { <col>: <transform-json>, ... } }. The contract follows the existing Java client; no server-side change is required for catalogs that already implement query auth.
  • No change to existing user-facing Python APIs. New types (AuthTableQueryRequest, AuthTableQueryResponse, TableQueryAuth, TableQueryAuthResult, QueryAuthSplit, AuthFilterReader, AuthMaskingReader, ColumnProjectReader) are additive and live under existing modules.
  • File format / on-disk layout: unchanged.

Documentation

The new option query-auth.enabled should be reflected in the Python configuration reference. Happy to add the docs entry in this PR or in a follow-up — please advise.

This closes #8135

@JingsongLi

Copy link
Copy Markdown
Contributor

I found a few correctness issues in the query-auth paths introduced here:

  1. paimon-python/pypaimon/read/datasource/split_provider.py:127 constructs ReadBuilder(self._ensure_table()) directly. That bypasses FileStoreTable.new_read_builder(), which is where the REST query auth is injected. As a result, pypaimon.ray.read_paimon(...) can read REST tables without applying server-side row filters or column masking. I think this should either call self._ensure_table().new_read_builder() or explicitly pass the table's query auth into the builder, with a Ray regression test for row filtering/masking.

  2. paimon-python/pypaimon/read/stream_read_builder.py:117 stores _query_auth, but new_streaming_scan() does not pass it into AsyncStreamingTableScan. The plans returned from streaming_table_scan.py:322 and streaming_table_scan.py:386 also do not go through auth_result.convert_plan(). So table.new_stream_read_builder() skips row filters and column masking for both the initial scan and later delta/changelog scans. The streaming scan should preserve and apply query auth before returning each plan.

  3. The auth reader wrappers currently assume the inner reader supports read_arrow_batch() (paimon-python/pypaimon/read/reader/auth_masking_reader.py:38 and :66). For primary-key tables with non-raw-convertible splits, TableRead can create a MergeFileSplitRead, whose create_reader() returns the normal row RecordReader path rather than a RecordBatchReader. Wrapping that in AuthFilterReader/AuthMaskingReader will fail with AttributeError when query auth is enabled. This needs either a row-reader auth path, conversion to a batch-capable reader before wrapping, or routing/rejecting these splits explicitly.

@MgjLLL

MgjLLL commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Fixes for issues raised by @JingsongLi, plus one additional issue found during analysis.
8 files changed, +188 -13 lines.

Fix 1: Ray read path bypasses auth (split_provider.py)

_ensure_planned() directly constructed ReadBuilder(table), bypassing the query_auth injection from FileStoreTable.new_read_builder().

Fix: Changed to self._ensure_table().new_read_builder(). Removed the direct ReadBuilder import.

Fix 2: Streaming read path skips auth entirely (streaming_table_scan.py, stream_read_builder.py)

StreamReadBuilder stored _query_auth but never passed it to AsyncStreamingTableScan. All three plan creation methods (initial/follow-up/catch-up) produced plans without auth wrapping.

Fix:

  • Added query_auth parameter to AsyncStreamingTableScan.__init__()
  • Added _apply_auth(plan) method that calls query_auth(select)convert_plan(plan)
  • Applied auth to _create_initial_plan, _create_follow_up_plan, _create_catch_up_plan
  • Extracted _create_initial_plan_raw() to avoid double auth in catch-up path
  • StreamReadBuilder.new_streaming_scan() now passes query_auth

Fix 3: Primary-key table merge reader incompatible with auth wrappers (auth_masking_reader.py, table_read.py)

MergeFileSplitRead.create_reader() returns a RecordReader (row-level), but AuthFilterReader/AuthMaskingReader require RecordBatchReader (batch-level with read_arrow_batch()). Java has a unified RecordReader<InternalRow> so this
mismatch doesn't exist in JVM.

Fix:

  • Added RecordReaderToBatchAdapter(RecordBatchReader) that collects OffsetRow tuples from read_batch()/next() and converts them to pa.RecordBatch
  • _authed_reader() now detects non-RecordBatchReader and wraps with the adapter before applying auth wrappers

Fix 4: Parallel read path bypasses auth + missing raw_convertible proxy (table_read.py, query_auth_split.py)

_read_one_split_to_batches() called _create_split_read(split).create_reader() directly, bypassing QueryAuthSplit detection. Additionally, QueryAuthSplit lacked a raw_convertible property proxy, causing AttributeError when accessed.

Fix:

  • Changed to self._create_reader_for_split(split) which correctly handles QueryAuthSplit
  • Added raw_convertible property proxy to QueryAuthSplit

New Tests (for fixes)

  • TestQueryAuthSplitRawConvertible (2 tests) — verifies raw_convertible proxy for true/false
  • TestRecordReaderToBatchAdapter (5 tests) — basic conversion, multi-batch, empty reader, close delegation, integration with AuthFilterReader

All 90 auth-related tests pass. Full test suite: 318/324 pass (6 pre-existing failures unrelated to this PR).

@MgjLLL

MgjLLL commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

I found a few correctness issues in the query-auth paths introduced here:

  1. paimon-python/pypaimon/read/datasource/split_provider.py:127 constructs ReadBuilder(self._ensure_table()) directly. That bypasses FileStoreTable.new_read_builder(), which is where the REST query auth is injected. As a result, pypaimon.ray.read_paimon(...) can read REST tables without applying server-side row filters or column masking. I think this should either call self._ensure_table().new_read_builder() or explicitly pass the table's query auth into the builder, with a Ray regression test for row filtering/masking.
  2. paimon-python/pypaimon/read/stream_read_builder.py:117 stores _query_auth, but new_streaming_scan() does not pass it into AsyncStreamingTableScan. The plans returned from streaming_table_scan.py:322 and streaming_table_scan.py:386 also do not go through auth_result.convert_plan(). So table.new_stream_read_builder() skips row filters and column masking for both the initial scan and later delta/changelog scans. The streaming scan should preserve and apply query auth before returning each plan.
  3. The auth reader wrappers currently assume the inner reader supports read_arrow_batch() (paimon-python/pypaimon/read/reader/auth_masking_reader.py:38 and :66). For primary-key tables with non-raw-convertible splits, TableRead can create a MergeFileSplitRead, whose create_reader() returns the normal row RecordReader path rather than a RecordBatchReader. Wrapping that in AuthFilterReader/AuthMaskingReader will fail with AttributeError when query auth is enabled. This needs either a row-reader auth path, conversion to a batch-capable reader before wrapping, or routing/rejecting these splits explicitly.

@JingsongLi All 3 issues fixed (+ 1 additional parallel path bypass found during analysis). See updated PR description. PTAL.

if not self.filter and not self.column_masking:
return plan
auth_splits = [QueryAuthSplit(split, self) for split in plan.splits()]
return Plan(auth_splits)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java TableScan.Plan does not carry a snapshot id, but Python Plan does and the update / row-id update paths use it as check_from_snapshot. Wrapping the plan here drops plan.snapshot_id, so a query-auth table planned from a non-empty snapshot becomes snapshot_id=None; table_update then emits commit messages with -1, which disables the row-id conflict checks (and related global-index update checks). Please preserve the original plan metadata, e.g. Plan(auth_splits, snapshot_id=plan.snapshot_id).


return reader

def _create_split_read_with_read_type(self, split, read_type):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This auth-specific construction bypasses the normal PK read path above. In _create_split_read, PK tables inject missing sequence.field columns into the inner read type and then project them back out, matching the Java withReadType + outer projection behavior. Here, if query auth is enabled and the user projects id,val from a PK table with sequence.field=ts, MergeFileSplitRead is built without ts; that can either fail with sequence.field ... not found or merge by file sequence instead of the configured user sequence. Please reuse the existing _create_split_read widening/project-back logic for effective_read_type, or factor it so the auth path cannot drift from the normal PK path.

elif function == "LIKE":
raw = literals[0]
escaped = re.escape(raw)
pattern = escaped.replace("%", ".*").replace("_", ".")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not match the JVM LIKE semantics. Java treats backslash as the default escape character before expanding % / _, so a policy predicate like LIKE admin\\_% matches admin_foo and not adminXfoo. Escaping the whole string first and then replacing every % / _ makes escaped wildcards behave as wildcards (or requires a literal backslash), so Python can allow/deny different rows from the Java client for the same auth filter. Please port the Java Like.sqlToRegexLike behavior, including invalid escape handling.

@MgjLLL

MgjLLL commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@JingsongLi All 3 issues fixed (+ several related correctness issues found during follow-up self-review). New commit 7d037c6a8 squashes the round-2 fixes into one increment on top of 482fdadd7. PTAL.

Fixes for the three review comments

1. catalog/table_query_auth.py — wrapping drops Plan.snapshot_id

convert_plan now returns Plan(auth_splits, snapshot_id=plan.snapshot_id).

Without this, query-auth tables planned from a non-empty snapshot lost row-id conflict / global-index update checks (check_from_snapshot=-1). New test TestConvertPlanPreservesSnapshotId covers both non-null and None cases.

2. read/table_read.py_create_split_read_with_read_type bypasses PK widening

Removed the parallel method. _create_split_read now accepts an optional read_type, so the auth path goes through the same sequence.field injection + outer projection used by the normal PK read path.

id,val projection over a PK table with sequence.field=ts no longer fails or merges by file sequence.

3. common/predicate_json_parser.pyLIKE escape mismatches JVM

Ported Like.sqlToRegexLike: backslash is the default escape character, \% / \_ are literal, invalid escape sequences raise.

LIKE admin\_% now matches admin_foo (not adminXfoo), so policy predicates evaluate identically to the JVM client.

Related correctness fixes (self-found, all on the same auth surface)

  • Streaming path previously stored _query_auth but never applied it. StreamReadBuilder now forwards query_auth and read_type to AsyncStreamingTableScan, which calls _apply_auth on initial / follow-up / catch-up plans. Catch-up uses _create_initial_plan_raw to avoid double auth.
  • Iterator / parallel / Ray paths consolidated through _create_reader_for_split. to_iterator gains a RecordBatchReader branch for PyTorch / generic iterator consumers. scan_with_stats applies auth for parity with plan().
  • Row kind preserved on PK + auth: RecordReaderToBatchAdapter encodes row_kind into a _row_kind column when include_row_kind=True; ColumnProjectReader keeps _row_kind even when user projection drops it; to_iterator restores it via OffsetRow.set_row_kind_byte without leaking the column into row_tuple.
  • Adapter chunking: when an inner read_batch yields more rows than chunk_size, the surplus rows are carried over to the next flush instead of being dropped.
  • QueryAuthSplit attribute delegation: __getattr__ forwards file_size, file_paths, data_deletion_files, raw_convertible, etc. to the inner split (Ray / explain / Daft paths), while guarding _-prefixed names to avoid pickle recursion.
  • Daft explain_scan consistency: ExplainSplitInfo now carries has_auth, populated from QueryAuthSplit in _build_explain_result. Daft's _split_has_auth accepts both real splits and the explain descriptor, so reported reader_mode and fallback_reason="query auth active" match the actual read path.
  • Pickle safety when auth is disabled: table_query_auth returns None instead of a local lambda, keeping FileStoreTable Ray-pickle-safe.

Tests

  • New tests for snapshot_id preservation, row_kind through adapter / project reader, chunk-size carry-over, and QueryAuthSplit attribute delegation incl. pickle round-trip.
  • 45 query-auth related tests pass:
pytest pypaimon/tests/ -k "explain or query_auth or table_query_auth"

@MgjLLL MgjLLL force-pushed the python-query-auth branch 2 times, most recently from bf32ff2 to 15ee1c1 Compare June 15, 2026 06:13

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few correctness issues in the Python query-auth implementation.

elif function == "IN":
return pc.is_in(value_array, pa.array(converted, type=value_array.type))
elif function == "NOT_IN":
return pc.invert(pc.is_in(value_array, pa.array(converted, type=value_array.type)))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java Paimon's NotIn returns false when the input field is null, but pc.is_in(null, ...) is false and inverting it makes null rows pass. For a row-filter auth rule such as dept NOT_IN ('blocked'), Python would expose rows where dept is null. Please combine this with pc.is_valid(value_array) and also preserve Java's null-literal behavior, where any null literal makes NOT_IN false.

self._parsed_rules = {col: json.loads(tj) for col, tj in masking_rules.items()}
read_field_names = {f.name for f in read_fields}
for col_name, transform in self._parsed_rules.items():
for ref_name in _collect_all_field_refs_from_transform(transform):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validates references for every masking rule before checking whether the masked target column is actually projected. If REST returns a rule like secret = FIELD_REF(email) and the user reads only id, the Python reader raises because email is absent even though secret is not returned. Java skips masking rules whose target column is absent from the output row type before remapping inputs, so this should filter to projected target columns first.

return pa.array([True] * batch_len, type=pa.bool_())
elif function == "FALSE":
return pa.array([False] * batch_len, type=pa.bool_())
raise ValueError(f"Unknown leaf function: {function}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IS_NAN is a valid Java Paimon predicate (PredicateBuilder.isNaN / IsNaN.NAME) and can be serialized in REST auth filters. With the current switch it fails every Python read with Unknown leaf function: IS_NAN; please add a branch using pc.is_nan for float/double arrays.

@MgjLLL MgjLLL requested a review from JingsongLi June 22, 2026 02:12

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few inline comments on query-auth correctness and serialization edge cases.

elif function == "IS_NOT_NULL":
return pc.is_valid(value_array)
elif function == "IN":
return pc.is_in(value_array, pa.array(converted, type=value_array.type))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java's In.test returns false when the field value is null and skips null literals. pc.is_in treats NULL IN (..., NULL) as true, so an auth row filter like dept IN ('eng', NULL) would leak rows where dept is null. Please drop null literals and wrap the result with pc.if_else(pc.is_valid(value_array), ..., False); if no non-null literals remain, the result should be all false.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now drop null literals from the value set and wrap with pc.if_else(pc.is_valid(value_array), in_mask, False). If all literals are null, returns all-false. Also added null literal guards for EQUAL/NOT_EQUAL/LT/LE/GT/GE to match Java's LeafBinaryFunction semantics.

if not options.query_auth_enabled or self.catalog_loader is None:
return None

def auth(select):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested auth function is stored on FileStoreTable as _query_auth_fn, which makes auth-enabled tables/read tasks unpickleable (AttributeError: Can't get local object ...). Ray/Daft and the existing serialization paths pickle tables or read tasks, so this breaks those paths once query-auth.enabled is true. Please use a top-level callable/dataclass or reconstruct the catalog auth function lazily instead of storing a local closure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced the nested closure with a top-level _TableQueryAuthFn class that is pickleable. Verified with pickle roundtrip test.

return pa.nulls(len(original_batch), type=pa.string())
sep = self._resolve_input(inputs[0], original_batch)
values = [self._resolve_input(inp, original_batch) for inp in inputs[1:]]
return pc.binary_join_element_wise(*values, sep, null_handling='skip')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary_join_element_wise(..., null_handling='skip') does not match Java BinaryString.concatWs when the separator is non-null and all value inputs for a row are null. Java returns an empty string for that row, but Arrow returns a shorter array, so set_column can fail with ArrowInvalid: Added column's length must match record batch's length. Please normalize this case so the result always has len(original_batch) rows; the predicate-side CONCAT_WS path has the same issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. PyArrow's binary_join_element_wise with null_handling='skip' returns an empty array (length 0) when all values are null — confirmed by testing. Now pad the result to batch length and use pc.if_else(sep_null, None, pc.if_else(pc.is_valid(result), result, "")) to match Java's BinaryString.concatWs behavior: all-null values → empty string, null separator → null. Same fix applied in the predicate transform path.

from pypaimon.schema.data_types import DataField


class TableNoPermissionException(Exception):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defines a second TableNoPermissionException that is unrelated to pypaimon.catalog.catalog_exception.TableNoPermissionException. RESTCatalog.auth_table_query raises this new type on 403, so callers catching the catalog exception will miss query-auth denials. Please reuse the existing catalog exception or make this subclass it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed the duplicate class from table_query_auth.py. Updated catalog_exception.TableNoPermissionException to accept an optional cause parameter and handle both Identifier objects and plain strings. All callers now use the unified exception.

@MgjLLL MgjLLL force-pushed the python-query-auth branch from 034f3b3 to 040a6bf Compare June 23, 2026 07:50
return pa.nulls(len(batch), type=pa.string())
result = pc.binary_join_element_wise(*values, sep, null_handling='skip')
if len(result) < len(batch):
padded = result.to_pylist() + [None] * (len(batch) - len(result))

@JingsongLi JingsongLi Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding the compacted Arrow result at the end misaligns rows. pc.binary_join_element_wise(..., null_handling="skip") drops rows whose value inputs are all null, so for a=[None, "x", None], b=[None, "y", "z"], sep="-" it returns ["x-y", "z"]; this code pads that to ["x-y", "z", ""], while Java BinaryString.concatWs should evaluate the rows as ["", "x-y", "z"]. In a row-filter predicate that can allow or deny the wrong rows. Please make the CONCAT_WS transform row-preserving, and add a mixed all-null/non-null regression test.

values = [self._resolve_input(inp, original_batch) for inp in inputs[1:]]
result = pc.binary_join_element_wise(*values, sep, null_handling='skip')
if len(result) < len(original_batch):
padded = result.to_pylist() + [None] * (len(original_batch) - len(result))

@JingsongLi JingsongLi Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same row-alignment problem as the predicate CONCAT_WS path. Arrow returns a compacted array when some rows have all-null value inputs, so appending padding at the end shifts the later rows masked values. For example a=[None, "x", None], b=[None, "y", "z"] with sep="-" becomes ["x-y", "z", ""] here, but Java BinaryString.concatWs would produce ["", "x-y", "z"]. That masks the wrong rows; please preserve original row positions and cover the mixed all-null/non-null case in tests.

for col_idx, masked_array in masked_columns.items():
original_field = original_batch.schema.field(col_idx)
if masked_array.type != original_field.type:
masked_array = pc.cast(masked_array, original_field.type)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column masking should not force the transformed value back to the original column type. Java TableQueryAuthResult.transform writes transform.transform(row) directly; RESTCatalogTest.testColumnMaskingApplyOnRead masks an INT column with CastTransform(..., STRING) and then reads "100" with getString(2). Here the same {"name":"CAST","type":"STRING"} mask on an int column is cast back to int32, so the mask is effectively undone or can fail for non-castable masking expressions. Please replace the column using the masked array type instead, and update the schema-preservation test accordingly.

@MgjLLL MgjLLL force-pushed the python-query-auth branch from 040a6bf to aad7740 Compare June 23, 2026 13:26
def extract_row_filter(self) -> Optional[Callable[[pa.RecordBatch], pa.Array]]:
if not self.filter:
return None
filters = [parse_predicate_to_batch_filter(json_str) for json_str in self.filter]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java TableQueryAuthResult.extractPredicate() skips empty filter strings before deserializing them, but this path tries to parse every entry. If the REST server returns filter=[""] (or a mix of empty and valid filters), Python will fail the read with JSONDecodeError in both extract_row_filter() and get_extra_fields_for_filter(), while the JVM client treats the empty entry as no-op. Please filter out empty/blank JSON strings before wrapping the plan or before parsing, and add a regression test for empty filter entries.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now filter empty/blank entries at the source — TableQueryAuthResult constructor strips falsy entries from the filter list ([f for f in filter if f]) and empty keys/values from column_masking dict. AuthMaskingReader also guards if not tj: continue before json.loads. Python truthiness check (if f) matches Java StringUtils.isEmpty semantics: skips null and "" but passes through whitespace strings. Added regression tests for mixed empty/valid filters and empty masking values.


def new_read_builder(self) -> 'ReadBuilder':
return ReadBuilder(self)
def new_read_builder(self, *, _skip_auth=False) -> 'ReadBuilder':

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding _skip_auth?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python write paths (e.g. table_update.py, table_upsert_by_key.py,
file_store_write.py) internally call new_read_builder() to read existing
data for operations like UPDATE, UPSERT, and MERGE INTO. This is a pre-existing
design that predates this PR.

Once query auth is injected into new_read_builder(), those internal reads
would also be subject to row filtering and column masking — returning a subset
of rows or masked column values to the write engine, which would corrupt the
write results.

_skip_auth=True bypasses auth for these internal, system-initiated reads,
matching Java's design: Java's write paths operate at a lower abstraction level
(e.g. LocalTableQuery, SortedGlobalIndexBuilder) that never passes through
AbstractDataTableScan's auth injection point in the first place.

The leading underscore signals that this is an internal API not intended for
external callers.

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to take a closer look here. This feature cannot affect the Public API.

  1. Why modify daft?
  2. Why modify read_builder?

This is not just questions. This is the tendency of design, and I need you to provide clear solutions instead of why you are currently changing them.
Java has not affected the Public API.

@JingsongLi

Copy link
Copy Markdown
Contributor

AI cannot solve these problems, please provide feedback in the form of a real person.

@MgjLLL MgjLLL force-pushed the python-query-auth branch from 44bdea6 to be6c040 Compare July 1, 2026 08:17
…talog

When query-auth.enabled is true, the Python client calls POST
/v1/.../databases/{db}/tables/{tb}/auth before producing a Plan and
applies row-level filters and column masking transforms on the read
path, matching the Java client behavior.

Auth is injected via post-construction on TableScan/StreamingTableScan
by ReadBuilder/StreamReadBuilder — no Public API signatures change.
Write paths disable auth before plan() to preserve data integrity.

New files:
- catalog/table_query_auth.py: TableQueryAuthResult
- common/predicate_json_parser.py: JSON predicate to PyArrow filters
- read/query_auth_split.py: QueryAuthSplit wrapper
- read/reader/auth_masking_reader.py: filter, masking, projection readers
@MgjLLL MgjLLL force-pushed the python-query-auth branch from be6c040 to e774825 Compare July 3, 2026 08:21
auth_result = self._auth_query()
plan = self.file_scanner.scan()
if auth_result is not None:
plan = auth_result.convert_plan(plan)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This auth wrapping is only applied by plan(), but scan_with_stats() below still returns the raw scanner plan. ReadBuilder.explain() uses scan_with_stats(), and Daft explain derives routing from those split details; the actual Daft scan path sees QueryAuthSplit and falls back to pypaimon, while explain will still report native Parquet for the same auth-enabled scan. Please apply the same _auth_query().convert_plan(...) logic in scan_with_stats() while preserving the ScanStats, and add an explain regression for an auth-enabled table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] [python] Support query auth (row filter & column masking) for REST catalog

2 participants