feat(sedona-raster-zarr): cloud storage backends (s3, gcs, azure, http) via object_store#888
Draft
james-willis wants to merge 1 commit into
Draft
feat(sedona-raster-zarr): cloud storage backends (s3, gcs, azure, http) via object_store#888james-willis wants to merge 1 commit into
james-willis wants to merge 1 commit into
Conversation
5af7c08 to
946e6a4
Compare
946e6a4 to
4d3de71
Compare
…ect_store
ZarrChunkReader now accepts file://, bare paths, s3://, gs:// / gcs://,
az:// / abfs:// / abfss://, and http:// / https://. The reader's
public API is pure-storage-in: hand it an
Arc<dyn AsyncReadableListableStorageTraits> and a group URI (for
chunk-anchor URI formatting) and it walks the chunk grid. Where the
storage came from — registry lookup, env-var-built builder, an
in-memory test fixture — is the caller's concern, not the reader's.
* `open_storage_from_uri(uri, store_override: Option<Arc<dyn ObjectStore>>)`
is the helper for callers that don't already hold a credentialed
store. `Some(store)` uses it directly (with PrefixStore for
s3/gs/az; http(s) is rooted at the URL by its builder). `None`
falls back to per-scheme *Builder::from_env().with_url(uri).build()
— the same env-var credential discovery that read_format's
orchestrated path gets via ensure_object_store_registered_with_options.
file:// and bare paths always go through zarrs_filesystem's
FilesystemStore wrapped in SyncToAsyncStorageAdapter, so the local
backend shows up on the same async storage surface.
* ZarrChunkReader::try_new is async; the sync RecordBatchReader
streaming surface is unchanged since next() is pure CPU.
* No dedicated tokio runtime; async tasks run on whatever runtime
the caller is on (DataFusion's executor for the future SQL UDTF,
an ad-hoc current-thread runtime block_on'd in PyZarrChunkReader::new
for the Python FFI, #[tokio::test] in tests).
* PyZarrChunkReader::new builds storage via open_storage_from_uri
(env-var credentials by default), then drives the async try_new
through a local current-thread tokio runtime. No synthetic
RuntimeEnv, no datafusion-execution dep on the FFI cdylib.
* Replaces Group::child_arrays (which opens every child up front and
errors hard on any per-array metadata failure) with two purpose-
built paths driven by the caller's arrays filter:
- arrays: Some([...]) opens each by name with Array::async_open.
No listing — usable against backends that can't list (plain HTTPS
without WebDAV, S3-via-HttpStore).
- arrays: None lists direct children with storage.list_dir, then
Array::async_open each. Per-array open failures are logged at
warn! and skipped, so a single malformed sibling (e.g. an
xarray-style fixed-length-Unicode coord variable with a null
fill_value that zarrs 0.23 can't open) no longer poisons the
rest of the group.
* zarrs_object_store pinned to 0.5 in the workspace; 0.6 depends on
object_store 0.13, semver-incompatible with DataFusion 52's
object_store 0.12.x.
* Cloud smoke tests pass strictly against the public anonymous
ITS_LIVE v2 ice-velocity datacubes (s3://its-live-data/...). Same
bucket via s3:// (AmazonS3Builder) and via the virtual-hosted
HTTPS URL (HttpStore), with an explicit M11/M12 filter to avoid
listing on the HTTPS path.
4d3de71 to
3fbe051
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds cloud storage backends to
sedona-raster-zarr.ZarrChunkReadernow acceptsfile://, bare paths,s3://,gs:///gcs://,az:///abfs:///abfss://, andhttp:///https://.All Async
This package has been migrated to use all async for record loading and byte retrieval. This allows a single approach (cloud was always going to be async, but file system was originally sync) without shimming sync fs code into the async RS_EnsureLoaded udf and async cloud access code into the sync record loading machinery.
Version pinning
zarrs_object_storeis pinned to0.5in the workspace.0.6(the latest) depends onobject_store 0.13, which is semver-incompatible with theobject_store 0.12.xthe rest of the workspace uses through DataFusion 52.