refactor parquet -> icebug-disk by aheev · Pull Request #476 · LadybugDB/ladybug

aheev · 2026-05-10T15:05:26Z

Apart from renames

added checks for mixed type tables
added version checks
updated storage = format
fixed StorageManager defaulting to icebug-disk
added doc

Closes #469

aheev · 2026-05-10T15:20:48Z

dataset PR: LadybugDB/dataset#1

aheev · 2026-05-11T04:40:25Z

@adsharma could you PTAL? minimal test failure can be fixed by merging dataset PR

aheev · 2026-05-11T09:36:48Z

Ran the tests added in the PR locally in CLI too. Also, lbug -i schema.cypher works fine

adsharma · 2026-05-11T15:25:53Z

minimal test failure can be fixed by merging dataset LadybugDB/dataset#1

Could you add a commit to reset the submodule pointer? Tests should pass then. Also see my comment about making things work when schema.cypher is not in the same dir.

adsharma · 2026-05-11T17:24:15Z

(storage = 'icebug-disk:<path-to-dir>');

This doesn't leave room for saying icebug-disk where the path is s3://server.com/path-to-dir, which the VFS supports (but not tested in CI because we don't have testable s3 urls).

How about:

(storage = '<any valid url>', format = 'icebug-disk')

The format could be guessed/defaulted based on the file extension in the URL. This is consistent with how DuckDB does EXPORT/IMPORT.

I haven't reviewed the rest of the PR. But think it's directionally correct.

aheev · 2026-05-12T00:13:11Z

(storage = 'icebug-disk:');

This doesn't leave room for saying icebug-disk where the path is s3://server.com/path-to-dir, which the VFS supports (but not tested in CI because we don't have testable s3 urls).

The path is parsed out by strictly checking for icebug-disk: prefix

How about:

(storage = '<any valid url>', format = 'icebug-disk')

all in for separate dir path and format options

The format could be guessed/defaulted based on the file extension in the URL. This is consistent with how DuckDB does EXPORT/IMPORT.

storage is a dir path right? also defaulting to icebug-disk if the storage is a dir would block any future formats from using storage option

adsharma · 2026-05-12T14:13:38Z

LGTM except for this item:

all in for separate dir path and format options

Looks like you haven't implemented it yet? A test case for:

(storage = 's3://bucket/foo/bar', format = 'icebug-disk')

and we should be good to go. Also format = 'bad' should fail.

aheev · 2026-05-12T14:28:59Z

LGTM except for this item:

all in for separate dir path and format options

Looks like you haven't implemented it yet? A test case for:

I was waiting for your signal 🙃

aheev · 2026-05-12T14:29:25Z

(storage = 's3://bucket/foo/bar', format = 'icebug-disk')

this should fail right?

adsharma · 2026-05-12T15:38:28Z

It's the main use case for icebug disk. I expect the dir on remote on object storage will be a common case. I couldn't tell if you agree with separating format and path logically by string splitting on icebug-disk: or a separate format = x option.

lbug> CREATE NODE TABLE city(id INT32, name STRING, population INT64, PRIMARY KEY(id)) WITH (storage = 'icebug-disk:s3://bucket/foo', format = 'icedisk');

...
Error: IO exception: Cannot open file s3:/bucket/foo/nodes_city.parquet: No such file or directory

adsharma · 2026-05-12T15:54:06Z

I was waiting for your signal

Missed this comment. I like the separate format = 'x' option, so storage is parsable with any URL parsing lib.

aheev · 2026-05-12T17:20:16Z

It's the main use case for icebug disk. I expect the dir on remote on object storage will be a common case. I couldn't tell if you agree with separating format and path logically by string splitting on icebug-disk: or a separate format = x option.
lbug> CREATE NODE TABLE city(id INT32, name STRING, population INT64, PRIMARY KEY(id)) WITH (storage = 'icebug-disk:s3://bucket/foo', format = 'icedisk');

...
Error: IO exception: Cannot open file s3:/bucket/foo/nodes_city.parquet: No such file or directory

separate format option

aheev · 2026-05-12T18:24:01Z

new dataset PR: LadybugDB/dataset#2

aheev · 2026-05-13T03:23:57Z

@adsharma could you PTAL?

adsharma

Two minor issues - suggest making a small PR post merge.

Two major pre-existing issues (out of scope for this PR), but noting here for future reference:

High: src/storage/table/ice_disk_node_table.cpp:306 assigns nodeID.offset = rowIndex, where rowIndex is only the row’s position within the currently assigned Parquet row group. With multiple row groups, every row group starts offsets at 0, producing duplicate node IDs and breaking joins/relationship traversals.
This needs a row-group base offset from Parquet metadata added to the local row index.
High: src/storage/table/ice_disk_rel_table.cpp:311 drops relationships for later bound nodes in the same row group. When the scan sees a second boundOffset, it breaks, then scanInternalByRowGroups increments currentRowGroup, so the rest of that row group is never revisited. A query with a batch containing multiple source nodes can return only the first node’s relationships.

aheev · 2026-05-13T05:00:49Z

Two minor issues - suggest making a small PR post merge.

Two major pre-existing issues (out of scope for this PR), but noting here for future reference:

High: src/storage/table/ice_disk_node_table.cpp:306 assigns nodeID.offset = rowIndex, where rowIndex is only the row’s position within the currently assigned Parquet row group. With multiple row groups, every row group starts offsets at 0, producing duplicate node IDs and breaking joins/relationship traversals.
This needs a row-group base offset from Parquet metadata added to the local row index.

High: src/storage/table/ice_disk_rel_table.cpp:311 drops relationships for later bound nodes in the same row group. When the scan sees a second boundOffset, it breaks, then scanInternalByRowGroups increments currentRowGroup, so the rest of that row group is never revisited. A query with a batch containing multiple source nodes can return only the first node’s relationships.

Guess I have to resurrect #429 😓

aheev force-pushed the refactor-parquet branch 2 times, most recently from 5a77447 to e63c9b7 Compare May 11, 2026 04:40

aheev mentioned this pull request May 11, 2026

add icebug-disk dataset LadybugDB/dataset#1

Merged

aheev force-pushed the refactor-parquet branch from 5a810b8 to 1007994 Compare May 12, 2026 11:36

aheev added 10 commits May 13, 2026 07:26

refactor parquet -> icebug-disk

8f90352

add unit tests

13efd64

fix storage manager defaulting to ice-disk

c3cf83c

update dataset submodule

daea739

add support for lbug -i ./<some-path>/schema.cypher

a839d18

fix icebug-disk immutability on alter

701d722

fix mix tables check

a243178

update ddl

d9be2b8

update dataset submodule

4efda9c

revert toCypher changes

3b089ce

aheev force-pushed the refactor-parquet branch from d6601f4 to 3b089ce Compare May 13, 2026 02:12

adsharma approved these changes May 13, 2026

View reviewed changes

Comment thread src/include/storage/table/ice_disk_utils.h

Comment thread src/include/common/constants.h

adsharma merged commit 37143c9 into LadybugDB:main May 13, 2026
4 checks passed

aheev deleted the refactor-parquet branch May 13, 2026 04:49

Conversation

aheev commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aheev commented May 10, 2026

Uh oh!

aheev commented May 11, 2026

Uh oh!

aheev commented May 11, 2026

Uh oh!

adsharma commented May 11, 2026

Uh oh!

adsharma commented May 11, 2026

Uh oh!

aheev commented May 12, 2026

Uh oh!

adsharma commented May 12, 2026

Uh oh!

aheev commented May 12, 2026

Uh oh!

aheev commented May 12, 2026

Uh oh!

adsharma commented May 12, 2026

Uh oh!

adsharma commented May 12, 2026

Uh oh!

aheev commented May 12, 2026

Uh oh!

aheev commented May 12, 2026

Uh oh!

aheev commented May 13, 2026

Uh oh!

adsharma left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aheev commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aheev commented May 10, 2026 •

edited

Loading

adsharma left a comment •

edited

Loading