Skip to content

refactor parquet -> icebug-disk#476

Merged
adsharma merged 10 commits into
LadybugDB:mainfrom
aheev:refactor-parquet
May 13, 2026
Merged

refactor parquet -> icebug-disk#476
adsharma merged 10 commits into
LadybugDB:mainfrom
aheev:refactor-parquet

Conversation

@aheev
Copy link
Copy Markdown
Contributor

@aheev aheev commented May 10, 2026

Apart from renames

  • added checks for mixed type tables
  • added version checks
  • updated storage = format
  • fixed StorageManager defaulting to icebug-disk
  • added doc

Closes #469

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 10, 2026

dataset PR: LadybugDB/dataset#1

@aheev aheev force-pushed the refactor-parquet branch 2 times, most recently from 5a77447 to e63c9b7 Compare May 11, 2026 04:40
@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 11, 2026

@adsharma could you PTAL? minimal test failure can be fixed by merging dataset PR

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 11, 2026

Ran the tests added in the PR locally in CLI too. Also, lbug -i schema.cypher works fine

@adsharma
Copy link
Copy Markdown
Contributor

minimal test failure can be fixed by merging dataset LadybugDB/dataset#1

Could you add a commit to reset the submodule pointer? Tests should pass then. Also see my comment about making things work when schema.cypher is not in the same dir.

@adsharma
Copy link
Copy Markdown
Contributor

(storage = 'icebug-disk:<path-to-dir>');

This doesn't leave room for saying icebug-disk where the path is s3://server.com/path-to-dir, which the VFS supports (but not tested in CI because we don't have testable s3 urls).

How about:

(storage = '<any valid url>', format = 'icebug-disk')

The format could be guessed/defaulted based on the file extension in the URL. This is consistent with how DuckDB does EXPORT/IMPORT.

I haven't reviewed the rest of the PR. But think it's directionally correct.

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 12, 2026

(storage = 'icebug-disk:');

This doesn't leave room for saying icebug-disk where the path is s3://server.com/path-to-dir, which the VFS supports (but not tested in CI because we don't have testable s3 urls).

The path is parsed out by strictly checking for icebug-disk: prefix

How about:

(storage = '<any valid url>', format = 'icebug-disk')

all in for separate dir path and format options

The format could be guessed/defaulted based on the file extension in the URL. This is consistent with how DuckDB does EXPORT/IMPORT.

storage is a dir path right? also defaulting to icebug-disk if the storage is a dir would block any future formats from using storage option

@aheev aheev force-pushed the refactor-parquet branch from 5a810b8 to 1007994 Compare May 12, 2026 11:36
@adsharma
Copy link
Copy Markdown
Contributor

LGTM except for this item:

all in for separate dir path and format options

Looks like you haven't implemented it yet? A test case for:

(storage = 's3://bucket/foo/bar', format = 'icebug-disk')

and we should be good to go. Also format = 'bad' should fail.

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 12, 2026

LGTM except for this item:

all in for separate dir path and format options

Looks like you haven't implemented it yet? A test case for:

I was waiting for your signal 🙃

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 12, 2026

(storage = 's3://bucket/foo/bar', format = 'icebug-disk')

this should fail right?

@adsharma
Copy link
Copy Markdown
Contributor

It's the main use case for icebug disk. I expect the dir on remote on object storage will be a common case. I couldn't tell if you agree with separating format and path logically by string splitting on icebug-disk: or a separate format = x option.

lbug> CREATE NODE TABLE city(id INT32, name STRING, population INT64, PRIMARY KEY(id)) WITH (storage = 'icebug-disk:s3://bucket/foo', format = 'icedisk');

...
Error: IO exception: Cannot open file s3:/bucket/foo/nodes_city.parquet: No such file or directory

@adsharma
Copy link
Copy Markdown
Contributor

I was waiting for your signal

Missed this comment. I like the separate format = 'x' option, so storage is parsable with any URL parsing lib.

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 12, 2026

It's the main use case for icebug disk. I expect the dir on remote on object storage will be a common case. I couldn't tell if you agree with separating format and path logically by string splitting on icebug-disk: or a separate format = x option.

lbug> CREATE NODE TABLE city(id INT32, name STRING, population INT64, PRIMARY KEY(id)) WITH (storage = 'icebug-disk:s3://bucket/foo', format = 'icedisk');

...
Error: IO exception: Cannot open file s3:/bucket/foo/nodes_city.parquet: No such file or directory

separate format option

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 12, 2026

new dataset PR: LadybugDB/dataset#2

@aheev aheev force-pushed the refactor-parquet branch from d6601f4 to 3b089ce Compare May 13, 2026 02:12
@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 13, 2026

@adsharma could you PTAL?

Copy link
Copy Markdown
Contributor

@adsharma adsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor issues - suggest making a small PR post merge.

Two major pre-existing issues (out of scope for this PR), but noting here for future reference:

  • High: src/storage/table/ice_disk_node_table.cpp:306 assigns nodeID.offset = rowIndex, where rowIndex is only the row’s position within the currently assigned Parquet row group. With multiple row groups, every row group starts offsets at 0, producing duplicate node IDs and breaking joins/relationship traversals.
    This needs a row-group base offset from Parquet metadata added to the local row index.
  • High: src/storage/table/ice_disk_rel_table.cpp:311 drops relationships for later bound nodes in the same row group. When the scan sees a second boundOffset, it breaks, then scanInternalByRowGroups increments currentRowGroup, so the rest of that row group is never revisited. A query with a batch containing multiple source nodes can return only the first node’s relationships.

Comment thread src/include/storage/table/ice_disk_utils.h
Comment thread src/include/common/constants.h
@adsharma adsharma merged commit 37143c9 into LadybugDB:main May 13, 2026
4 checks passed
@aheev aheev deleted the refactor-parquet branch May 13, 2026 04:49
@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 13, 2026

Two minor issues - suggest making a small PR post merge.

Two major pre-existing issues (out of scope for this PR), but noting here for future reference:

  • High: src/storage/table/ice_disk_node_table.cpp:306 assigns nodeID.offset = rowIndex, where rowIndex is only the row’s position within the currently assigned Parquet row group. With multiple row groups, every row group starts offsets at 0, producing duplicate node IDs and breaking joins/relationship traversals.
    This needs a row-group base offset from Parquet metadata added to the local row index.
  • High: src/storage/table/ice_disk_rel_table.cpp:311 drops relationships for later bound nodes in the same row group. When the scan sees a second boundOffset, it breaks, then scanInternalByRowGroups increments currentRowGroup, so the rest of that row group is never revisited. A query with a batch containing multiple source nodes can return only the first node’s relationships.

Guess I have to resurrect #429 😓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add Icebug-disk spec impl

2 participants