Skip to content

Fix list-of-struct nested projection#2

Open
godnight10061 wants to merge 16 commits intodevelopfrom
fix-list-of-struct-projection
Open

Fix list-of-struct nested projection#2
godnight10061 wants to merge 16 commits intodevelopfrom
fix-list-of-struct-projection

Conversation

@godnight10061
Copy link
Copy Markdown
Owner

@godnight10061 godnight10061 commented Jan 20, 2026

Summary

  • Enable nested projection for list and fixed_size_list via internal get_item_list and list layout pushdown.
  • Improve split planning:
    • FixedSizeList: map element splits back to row splits.
    • List: map offsets splits back to row splits.
  • Add unit regressions for scan dtype simplification (list and fixed_size_list).

Tests

Local (Windows):

  • cargo +nightly fmt --all --check
  • cargo clippy --locked --workspace --all-targets --exclude vortex-test-e2e-cuda --exclude vortex-cuda --exclude vortex-cub --exclude vortex-nvcomp --exclude vortex-bench --exclude vortex-python --exclude vortex-duckdb --exclude vortex-fuzz --exclude duckdb-bench --exclude lance-bench --exclude datafusion-bench --exclude random-access-bench --exclude compress-bench -- -D warnings
  • cargo nextest run --locked --workspace --all-features --no-fail-fast --exclude vortex-test-e2e-cuda --exclude vortex-cuda --exclude vortex-cub --exclude vortex-nvcomp --exclude vortex-bench --exclude vortex-python --exclude vortex-duckdb --exclude vortex-fuzz --exclude duckdb-bench --exclude lance-bench --exclude datafusion-bench --exclude random-access-bench --exclude compress-bench

Notes:

  • setup-rust installs protoc; if running manually, ensure protoc is available on PATH (or PROTOC set).

References

@godnight10061 godnight10061 added the fix Bug fixes label Jan 20, 2026
@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement by fixing nested projection for list<struct> types. This is achieved by adding a new shredded list layout (vortex.list) and extending get_item and the list reader to support this new structure. The implementation appears robust and is accompanied by thorough tests, including a new integration test that validates the end-to-end workflow. While the overall approach is sound, I've identified several areas with significant code duplication, particularly in get_item.rs, which could be refactored to enhance maintainability. My review includes suggestions for these refactorings.

Comment thread vortex-array/src/expr/exprs/get_item.rs Outdated
Comment thread vortex-array/src/expr/exprs/get_item.rs Outdated
Comment thread vortex-layout/src/layouts/list/mod.rs Outdated
Comment thread vortex-layout/src/layouts/list/writer.rs
@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses nested projection for list<struct> types by introducing a new shredded list layout (vortex.list) and extending get_item to support it. The changes are comprehensive, including updates to the layout system, expression evaluation, and the addition of thorough tests. The implementation of the new list reader and writer strategies appears solid.

My review includes a couple of suggestions for refactoring in get_item.rs to reduce code duplication in the evaluate and execute methods, which would improve long-term maintainability. Overall, this is a great contribution that significantly enhances Vortex's capabilities for handling nested data.

Comment thread vortex-array/src/expr/exprs/get_item.rs Outdated
Comment thread vortex-array/src/expr/exprs/get_item.rs Outdated
@godnight10061 godnight10061 force-pushed the fix-list-of-struct-projection branch from d5a8576 to 87d9553 Compare January 21, 2026 02:35
@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request significantly enhances the handling of nested projections, particularly for List<Struct> types, by introducing a shredded layout (vortex.list). The new Map expression is a well-designed addition, seamlessly integrating into the expression simplification and evaluation framework. The vortex-layout changes correctly implement the ListLayout and its associated reader and writer strategies, which is crucial for efficient storage and retrieval of shredded list data. Furthermore, the updates in vortex-datafusion and vortex-duckdb ensure consistent application of expression optimization, leveraging the new Map expression's simplification capabilities. The new test cases provide thorough validation of the core functionality.

Comment thread vortex-array/src/expr/exprs/map.rs Outdated
Comment thread vortex-layout/src/layouts/table.rs
Comment thread vortex-layout/src/layouts/list/writer.rs
Comment thread vortex-layout/src/layouts/list/reader.rs
Comment thread vortex-layout/src/layouts/list/reader.rs
Comment thread vortex-layout/src/layouts/list/mod.rs
Comment thread vortex-array/src/expr/exprs/get_item.rs Outdated
Comment thread vortex-array/src/expr/exprs/get_item.rs Outdated
Comment thread vortex-array/src/expr/exprs/get_item.rs Outdated
Comment thread vortex-array/src/expr/exprs/get_item.rs Outdated
@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the nested projection for list<struct> types by introducing a new vortex.list shredded layout and a map expression. The changes correctly extend get_item and list reader pushdown logic to support this functionality. Comprehensive tests have been added to validate the new behavior and optimizations. The integration of the new ListStrategy into the TableStrategy and the strategic use of optimize_recursive across different modules ensure that the new simplification rules are applied, leading to a robust and efficient solution.

@godnight10061 godnight10061 force-pushed the fix-list-of-struct-projection branch from 6a2a8f7 to 2973c55 Compare January 23, 2026 07:12
@godnight10061 godnight10061 added the changelog/fix Changelog: bug fix label Jan 23, 2026
@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements for handling nested projections on list-of-struct and fixed-size-list-of-struct types. It achieves this by introducing a new GetItemList expression and a vortex.list layout encoding, which allows for shredding list arrays into their components (validity, offsets, elements) for more efficient processing and projection pushdown. The changes are well-tested with new unit and integration tests, and the dtype calculation logic in scan builders has been updated to account for the new simplification rules. The explicit marking of GetItemList as unstable and non-serializable is a clear indication of its current internal optimization role.

Comment thread vortex-array/src/expr/exprs/get_item.rs
Comment thread vortex-array/src/expr/exprs/get_item.rs
Comment thread vortex-array/src/expr/exprs/get_item_list.rs
Comment thread vortex-array/src/expr/exprs/get_item_list.rs
Comment thread vortex-file/tests/test_write_table.rs Outdated
Comment thread vortex-layout/src/layouts/table.rs
Comment thread vortex-scan/src/gpu/gpubuilder.rs Outdated
Comment thread vortex-scan/src/gpu/gpubuilder.rs Outdated
Comment thread vortex-scan/src/scan_builder.rs Outdated
Comment thread vortex-scan/src/scan_builder.rs Outdated
@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements for handling nested projections on lists of structs by implementing a shredded layout (vortex.list) and a new GetItemList expression. The changes correctly propagate nullability and enable pushdown optimizations for these complex types. New tests validate the functionality for both variable-size and fixed-size lists of structs.

Comment thread vortex-layout/src/layouts/list/reader.rs Outdated
@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements for handling nested projections involving lists of structs and fixed-size lists of structs. The core changes include a new GetItemList expression to specifically manage projections on list elements, and updates to the GetItem expression's simplify method to transform nested get_item calls into get_item_list when appropriate. The new ListLayout and ListStrategy provide robust mechanisms for reading and writing these complex data types, including improved split planning for FixedSizeList elements. Furthermore, the ScanBuilder and GpuScanBuilder have been updated to ensure that DType calculations correctly account for these simplified projections, preventing errors with nested field access. The added test cases thoroughly validate the new functionality and ensure proper behavior for various list-of-struct scenarios.

@godnight10061 godnight10061 force-pushed the fix-list-of-struct-projection branch 2 times, most recently from a7a6b7d to da77c0c Compare January 24, 2026 03:12
@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature: nested projection for lists of structs. This is achieved through a new get_item_list expression and a corresponding ListLayout. The changes are extensive and well-implemented, including improved split planning for lists and fixed-size lists.

A major refactoring from fields() to unmasked_fields() across the codebase enhances clarity regarding how struct validity is handled. Additionally, there are notable improvements in error handling, with panics being replaced by VortexResult in several places.

The PR also brings substantial additions to CUDA support, with new kernels for dictionary decoding and new benchmarks. The introduction of new fuzzing targets for compressed encodings is a great step towards improving the robustness of the library.

Overall, this is a high-quality contribution with well-thought-out changes and good test coverage. I have a couple of minor suggestions for code quality and documentation correctness.

Comment on lines +106 to +132
let ends_array = match ends_ptype {
PType::U8 => {
let ends_typed: Vec<u8> = ends
.iter()
.map(|&e| u8::try_from(e).vortex_expect("end value fits in u8"))
.collect();
PrimitiveArray::new(Buffer::copy_from(ends_typed), Validity::NonNullable).into_array()
}
PType::U16 => {
let ends_typed: Vec<u16> = ends
.iter()
.map(|&e| u16::try_from(e).vortex_expect("end value fits in u16"))
.collect();
PrimitiveArray::new(Buffer::copy_from(ends_typed), Validity::NonNullable).into_array()
}
PType::U32 => {
let ends_typed: Vec<u32> = ends
.iter()
.map(|&e| u32::try_from(e).vortex_expect("end value fits in u32"))
.collect();
PrimitiveArray::new(Buffer::copy_from(ends_typed), Validity::NonNullable).into_array()
}
PType::U64 => {
PrimitiveArray::new(Buffer::copy_from(ends), Validity::NonNullable).into_array()
}
_ => unreachable!("Only unsigned integer types are valid for ends"),
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for creating primitive arrays for different unsigned integer types is a bit repetitive. You can make this more concise by using PrimitiveArray::from_iter and avoiding the intermediate Vec collection. This will improve readability and maintainability.

    let ends_array = match ends_ptype {
        PType::U8 => PrimitiveArray::from_iter(
            ends.iter()
                .map(|&e| u8::try_from(e).vortex_expect("end value fits in u8")),
        )
        .into_array(),
        PType::U16 => PrimitiveArray::from_iter(
            ends.iter()
                .map(|&e| u16::try_from(e).vortex_expect("end value fits in u16")),
        )
        .into_array(),
        PType::U32 => PrimitiveArray::from_iter(
            ends.iter()
                .map(|&e| u32::try_from(e).vortex_expect("end value fits in u32")),
        )
        .into_array(),
        PType::U64 => PrimitiveArray::from(ends).into_array(),
        _ => unreachable!("Only unsigned integer types are valid for ends"),
    };

}

impl<T: DeviceRepr + Send + Sync + 'static> DeviceBuffer for CudaDeviceBuffer<T> {
/// Returns the number of elements in the buffer of type T.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation for len() is now incorrect after the implementation change. It states that it returns the number of elements, but the implementation now correctly returns the number of bytes to conform to the DeviceBuffer trait. The doc comment should be updated to reflect this.

Suggested change
/// Returns the number of elements in the buffer of type T.
/// Returns the number of bytes in the buffer.

@godnight10061 godnight10061 changed the base branch from develop to develop-upstream January 25, 2026 12:37
@godnight10061 godnight10061 changed the base branch from develop-upstream to develop January 25, 2026 12:37
@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature enhancement by enabling nested projection for list<struct> and fixed_size_list<struct> types. This is achieved through a new internal GetItemList expression and a simplification rule on GetItem. The changes are comprehensive, affecting expression analysis, execution, and the scan builder to accommodate the new simplification path. A new vortex.list layout with reader/writer implementations and improved split planning is also a key part of this update. The added regression tests are thorough and validate the new functionality well. The code is of high quality, and I have one suggestion for a minor refactoring to reduce code duplication in the GetItemList execution logic for better maintainability.

Comment on lines +135 to +178
match input.dtype() {
DType::List(..) => {
let list = input.execute::<ListViewArray>(args.ctx)?;
let struct_elems = list.elements().clone().execute::<StructArray>(args.ctx)?;

let field = struct_elems.unmasked_field_by_name(field_name)?.clone();
let field = match struct_elems.dtype().nullability() {
Nullability::NonNullable => field,
Nullability::Nullable => mask(&field, &struct_elems.validity_mask()?.not())?,
};

ListViewArray::try_new(
field,
list.offsets().clone(),
list.sizes().clone(),
list.validity().clone(),
)?
.into_array()
.execute(args.ctx)
}
DType::FixedSizeList(..) => {
let list = input.execute::<FixedSizeListArray>(args.ctx)?;
let struct_elems = list.elements().clone().execute::<StructArray>(args.ctx)?;

let field = struct_elems.unmasked_field_by_name(field_name)?.clone();
let field = match struct_elems.dtype().nullability() {
Nullability::NonNullable => field,
Nullability::Nullable => mask(&field, &struct_elems.validity_mask()?.not())?,
};

FixedSizeListArray::try_new(
field,
list.list_size(),
list.validity().clone(),
list.len(),
)?
.into_array()
.execute(args.ctx)
}
_ => Err(vortex_err!(
"Expected list scope for GetItemList execution, got {}",
input.dtype()
)),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for handling DType::List and DType::FixedSizeList is very similar, particularly the part that applies the struct's validity mask to the projected field. To improve maintainability and reduce code duplication, consider refactoring the common logic.

For example, you could extract a helper function to apply the mask from the struct elements:

fn apply_struct_mask(field: crate::Array, struct_elems: &StructArray) -> VortexResult<crate::Array> {
    match struct_elems.dtype().nullability() {
        Nullability::NonNullable => Ok(field),
        Nullability::Nullable => crate::compute::mask(&field, &struct_elems.validity_mask()?.not()),
    }
}

Using this helper in both match arms would make the execute method cleaner. You might find opportunities for further generalization to handle both list types in a more unified manner.

@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature: nested projection for list<struct> and fixed_size_list<struct> types. This is achieved by adding a new internal expression GetItemList and a corresponding ListLayout. The changes are well-implemented, with improved split planning and necessary adjustments to the scan builder logic. The addition of regression tests is also a great touch. My review focuses on a few opportunities to improve code clarity and enforce invariants.

Comment on lines +68 to +72
match &layout.dtype {
DType::List(..) => 2 + validity_children, // offsets + elements
DType::FixedSizeList(..) => 1 + validity_children, // elements
_ => 0,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The build function for ListLayout ensures that the dtype is always a List or FixedSizeList. Therefore, the _ arm in this match should be unreachable. Using unreachable! would make this invariant explicit and prevent silent failures if the layout is ever constructed incorrectly.

Suggested change
match &layout.dtype {
DType::List(..) => 2 + validity_children, // offsets + elements
DType::FixedSizeList(..) => 1 + validity_children, // elements
_ => 0,
}
match &layout.dtype {
DType::List(..) => 2 + validity_children, // offsets + elements
DType::FixedSizeList(..) => 1 + validity_children, // elements
_ => unreachable!("ListLayout only supports List and FixedSizeList dtypes, got {}", layout.dtype),
}

Comment on lines +104 to +120
let idx = match self.dtype() {
DType::List(..) => {
if self.dtype().is_nullable() {
2
} else {
1
}
}
DType::FixedSizeList(..) => {
if self.dtype().is_nullable() {
1
} else {
0
}
}
_ => return Err(vortex_err!("Expected list dtype, got {}", self.dtype())),
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to determine the index of the elements child can be simplified by using the boolean-to-integer cast of is_nullable().

        let idx = match self.dtype() {
            DType::List(..) => 1 + self.dtype().is_nullable() as usize,
            DType::FixedSizeList(..) => self.dtype().is_nullable() as usize,
            _ => return Err(vortex_err!("Expected list dtype, got {}", self.dtype())),
        };

Comment on lines +194 to +224
let offsets_slice_u64: VortexResult<Vec<u64>> =
match offsets.ptype() {
ptype if ptype.is_unsigned_int() => vortex_dtype::match_each_unsigned_integer_ptype!(ptype, |T| {
Ok(offsets
.as_slice::<T>()
.iter()
.map(|&v| v.to_u64())
.collect())
}),
ptype if ptype.is_signed_int() => {
vortex_dtype::match_each_signed_integer_ptype!(
ptype,
|T| {
offsets
.as_slice::<T>()
.iter()
.map(|&v| {
u64::try_from(v).map_err(|_| {
vortex_err!(
"List offsets must be convertible to u64"
)
})
})
.collect()
}
)
}
other => Err(vortex_err!(
"List offsets must be an integer type, got {other}"
)),
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block for converting list offsets to Vec<u64> is quite complex and deeply nested. To improve readability and maintainability, consider extracting this logic into a separate helper function, for example fn offsets_to_u64(offsets: &PrimitiveArray) -> VortexResult<Vec<u64>>.

@godnight10061 godnight10061 force-pushed the fix-list-of-struct-projection branch 3 times, most recently from 830efb6 to cc87962 Compare January 29, 2026 03:33
@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature by enabling nested projections on lists of structs through a new GetItemList expression. The implementation is comprehensive, adding new layouts, readers, and writers for list types, and correctly updating the scan builders to handle expression simplification. The code is well-structured and includes good test coverage for the new functionality. I've identified a few opportunities for refactoring to improve code clarity and reduce duplication, which are detailed in the specific comments.

Comment on lines +140 to +144
let field = struct_elems.unmasked_field_by_name(field_name)?.clone();
let field = match struct_elems.dtype().nullability() {
Nullability::NonNullable => field,
Nullability::Nullable => mask(&field, &struct_elems.validity_mask()?.not())?,
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to get the projected field and apply the struct's validity mask can be simplified by using the existing StructArray::field_by_name method. This helper function encapsulates this exact logic, making the code more concise and consistent with other parts of the codebase.

Suggested change
let field = struct_elems.unmasked_field_by_name(field_name)?.clone();
let field = match struct_elems.dtype().nullability() {
Nullability::NonNullable => field,
Nullability::Nullable => mask(&field, &struct_elems.validity_mask()?.not())?,
};
let field = struct_elems.field_by_name(field_name)?;

Comment on lines +159 to +163
let field = struct_elems.unmasked_field_by_name(field_name)?.clone();
let field = match struct_elems.dtype().nullability() {
Nullability::NonNullable => field,
Nullability::Nullable => mask(&field, &struct_elems.validity_mask()?.not())?,
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the DType::List branch, this block can be simplified by using the StructArray::field_by_name helper method to get the projected field while correctly handling the struct's nullability.

Suggested change
let field = struct_elems.unmasked_field_by_name(field_name)?.clone();
let field = match struct_elems.dtype().nullability() {
Nullability::NonNullable => field,
Nullability::Nullable => mask(&field, &struct_elems.validity_mask()?.not())?,
};
let field = struct_elems.field_by_name(field_name)?;

Comment on lines +142 to +255
match dtype {
DType::List(_, list_nullability) => {
let offsets_reader = self.offsets()?.clone();
let elements_reader = self.elements()?.clone();
let row_range_clone = row_range.clone();
let element_expr = element_expr.clone();

Ok(Box::pin(async move {
let row_len = usize::try_from(row_range_clone.end - row_range_clone.start)
.vortex_expect("row range must fit in usize");

let offsets_row_range = row_range_clone.start..row_range_clone.end + 1;
let offsets_len = row_len + 1;
let offsets_fut = offsets_reader.projection_evaluation(
&offsets_row_range,
&root(),
MaskFuture::new_true(offsets_len),
)?;

let (offsets, validity) = try_join!(offsets_fut, async move {
match validity_fut {
Some(v) => v.await.map(Some),
None => Ok(None),
}
})?;

let offsets = offsets.to_primitive();
let offsets_slice = offsets.as_slice::<u64>();
let base = *offsets_slice.first().unwrap_or(&0u64);
let end = *offsets_slice.last().unwrap_or(&base);

let elements_row_range = base..end;
let elements_len = usize::try_from(end - base)
.vortex_expect("element range must fit in usize");
let elements = elements_reader.projection_evaluation(
&elements_row_range,
&element_expr,
MaskFuture::new_true(elements_len),
)?;

let elements = elements.await?;

let normalized_offsets = vortex_array::arrays::PrimitiveArray::from_iter(
offsets_slice.iter().map(|v| *v - base),
)
.into_array();

let validity = match (list_nullability, validity) {
(Nullability::NonNullable, _) => {
vortex_array::validity::Validity::NonNullable
}
(Nullability::Nullable, Some(v)) => {
vortex_array::validity::Validity::Array(v)
}
(Nullability::Nullable, None) => vortex_array::validity::Validity::AllValid,
};

Ok(ListArray::try_new(elements, normalized_offsets, validity)?.into_array())
}))
}
DType::FixedSizeList(_, list_size, list_nullability) => {
let elements_reader = self.elements()?.clone();
let row_range_clone = row_range.clone();
let element_expr = element_expr.clone();

Ok(Box::pin(async move {
let row_len_u64 = row_range_clone.end - row_range_clone.start;
let row_len =
usize::try_from(row_len_u64).vortex_expect("row range must fit in usize");

let list_size_u64 = u64::from(list_size);
let element_start = row_range_clone
.start
.checked_mul(list_size_u64)
.ok_or_else(|| vortex_err!("FixedSizeList element start overflow"))?;
let element_end = row_range_clone
.end
.checked_mul(list_size_u64)
.ok_or_else(|| vortex_err!("FixedSizeList element end overflow"))?;

let elements_row_range = element_start..element_end;
let elements_len = usize::try_from(element_end - element_start)
.vortex_expect("element range must fit in usize");
let elements_fut = elements_reader.projection_evaluation(
&elements_row_range,
&element_expr,
MaskFuture::new_true(elements_len),
)?;

let (elements, validity) = try_join!(elements_fut, async move {
match validity_fut {
Some(v) => v.await.map(Some),
None => Ok(None),
}
})?;

let validity = match (list_nullability, validity) {
(Nullability::NonNullable, _) => {
vortex_array::validity::Validity::NonNullable
}
(Nullability::Nullable, Some(v)) => {
vortex_array::validity::Validity::Array(v)
}
(Nullability::Nullable, None) => vortex_array::validity::Validity::AllValid,
};

Ok(
FixedSizeListArray::try_new(elements, list_size, validity, row_len)?
.into_array(),
)
}))
}
_ => Err(vortex_err!("Expected list dtype, got {}", dtype)),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's some repeated logic for handling validity_fut and constructing the final Validity in both branches of the match dtype. This could be refactored to reduce duplication.

You could move the async move block to wrap the whole function body and resolve validity_fut once at the beginning. Then, inside the match, you would only need to await the list-specific futures (offsets_fut or elements_fut). This would make the code cleaner and avoid repeating the validity handling logic.

Comment on lines 189 to 203
pub fn dtype(&self) -> VortexResult<DType> {
self.projection.return_dtype(self.layout_reader.dtype())
// NOTE: `GetItem` may simplify into `GetItemList` for list-of-struct projections.
// To avoid rejecting valid nested projections (like `items.a`) we must simplify before
// validating the return dtype.
//
// Also, `row_idx` support is provided by `RowIdxLayoutReader`, so use the same reader
// enrichment as `prepare`.
let layout_reader = Arc::new(RowIdxLayoutReader::new(
self.row_offset,
self.layout_reader.clone(),
self.session.clone(),
));
let projection = self.projection.optimize_recursive(layout_reader.dtype())?;
projection.return_dtype(layout_reader.dtype())
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for creating the RowIdxLayoutReader and simplifying the projection is duplicated between this dtype method and the prepare method. This could be extracted into a private helper method to avoid code duplication and ensure consistency.

For example, you could have a prepare_projection_and_reader(&self) method that returns the simplified projection, its dtype, and the enriched layout reader. Both dtype() and prepare() could then call this helper.

@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable feature: enabling nested projection for list<struct> and fixed_size_list<struct>. The implementation is well-designed, introducing a new get_item_list expression and a simplify rule on get_item to handle these projections effectively. The addition of a new ListLayout and improvements to split planning for FixedSizeList are also great enhancements. The code is clean, well-structured, and includes thorough regression tests. I have one suggestion regarding a potential missed optimization for List split planning.

// TODO(a10y): Variable-size lists can only be split "naturally" based on elements,
// but mapping element splits back to row ranges requires offsets (or precomputed
// row->element boundary metadata) that we don't have at split-planning time.
DType::List(..) => {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PR description mentions improving split planning for List by splitting via offsets. However, this branch of register_splits is currently empty.

While the TODO comment correctly points out the difficulty of mapping element splits back to rows, we could still implement splitting based on the chunk boundaries of the offsets array itself. A split in the offsets array at index i would correspond to a split in the list array at row i.

Was this intentionally omitted? It seems like a useful optimization that would align with the PR's goals.

@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for nested projections on list<struct> and fixed_size_list<struct> types, which is a great enhancement. This is primarily achieved by adding a new internal get_item_list expression and implementing a pushdown mechanism in the new ListLayout. The changes also include improvements to split planning for list types and add comprehensive regression tests. The implementation is solid and the changes are well-tested. I have one minor suggestion to simplify the code in the new ListReader.

Comment on lines +329 to +338
for offsets_split in offsets_splits {
if offsets_split <= row_range.start {
continue;
}

let row_split = offsets_split - 1;
if row_split > row_range.start && row_split < row_range.end {
splits.insert(row_split);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The initial check if offsets_split <= row_range.start is redundant. The offsets_splits are generated for an offsets_range which starts at row_range.start, so offsets_split will always be >= row_range.start. The only case where the condition is true is offsets_split == row_range.start, which results in row_split = row_range.start - 1. This is already correctly handled by the subsequent if row_split > row_range.start check. Removing the redundant condition will simplify the loop.

                for offsets_split in offsets_splits {
                    let row_split = offsets_split - 1;
                    if row_split > row_range.start && row_split < row_range.end {
                        splits.insert(row_split);
                    }
                }

@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively enables nested projection for list<struct> and fixed_size_list<struct> types. The introduction of the internal get_item_list expression is a clever and pragmatic solution to handle this without a full map expression implementation. The changes are well-integrated across the expression system, layout management, and scanning logic. The new ListLayout with its reader and writer components is well-designed and correctly handles component shredding. The improvements to split planning for list types are also a valuable addition. The added unit tests are comprehensive and provide good coverage for the new functionality. Overall, this is a high-quality contribution that significantly enhances Vortex's capabilities for handling nested data.

@godnight10061 godnight10061 force-pushed the fix-list-of-struct-projection branch from 6c6326a to 3690d75 Compare January 30, 2026 09:12
@godnight10061
Copy link
Copy Markdown
Owner Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements by enabling nested projection for list<struct> and fixed_size_list<struct> types. This is achieved through the addition of an internal get_item_list expression and enhanced split planning for list layouts. The changes are well-implemented, including new layout components, updated expression simplification logic, and comprehensive unit tests to validate the new functionality. The refactoring in scan_builder.rs also improves code organization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix Changelog: bug fix fix Bug fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant