Drop defensive copies of freshly-allocated nodata-mask arrays#1558
Merged
brendancol merged 2 commits intomainfrom May 11, 2026
Merged
Drop defensive copies of freshly-allocated nodata-mask arrays#1558brendancol merged 2 commits intomainfrom
brendancol merged 2 commits intomainfrom
Conversation
Two ``arr = arr.copy()`` calls on the read path duplicated a multi-MB array right before an in-place sentinel-to-NaN rewrite. The arrays were freshly allocated by ``_read_tiles`` / ``_read_strips`` / ``_fetch_decode_cog_http_tiles`` and uniquely owned at that point, so the copies were pure peak-memory waste. Removed: - ``_geotiff_to_xarray`` (was ~line 518): eager-read nodata path. - ``_delayed_read_window`` (was ~line 1448): dask-chunked read. Kept (with a comment explaining why): - ``to_geotiff`` (~line 990): ``arr`` may be a view of a caller-owned numpy buffer (``np.asarray(raw)`` or ``np.moveaxis``). - ``_write_single_tile`` (~line 1050): same shape; ``arr`` may alias caller data through ``np.asarray(chunk_data)``. Adds ``test_nodata_no_extra_copy_1553.py`` covering float32 strip, float32 tiled, uint16 tiled, dask-chunked float, dask-chunked uint16, repeat reads (independence), and a writer-no-mutate guard for the defensive copies kept on the write path.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces peak memory usage on the GeoTIFF read path by removing two redundant defensive arr.copy() calls immediately before in-place nodata sentinel → NaN rewrites, while explicitly documenting why equivalent copies must remain on the write path to avoid mutating caller-owned buffers. It also adds a targeted regression test suite for the nodata rewrite behavior.
Changes:
- Removed two unnecessary
arr.copy()calls on the read path before nodata masking (eager read + dask windowed reads). - Kept the two write-path defensive copies and added comments explaining their necessity to prevent caller-buffer mutation.
- Added a new regression test module covering nodata sentinel handling across strip/tiled reads and dask-chunked reads, plus a writer non-mutation guard.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
xrspatial/geotiff/__init__.py |
Drops read-path copies before in-place nodata masking; adds rationale comments for retained write-path copies. |
xrspatial/geotiff/tests/test_nodata_no_extra_copy_1553.py |
Adds regression coverage for nodata masking (strip/tiled + dask) and a writer no-mutation guard. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add test_write_single_tile_does_not_mutate_caller_input to cover the second kept defensive copy (the .vrt tiled-output path). Calls _write_single_tile directly with a numpy buffer + integer nodata so the NaN -> sentinel rewrite at line 1065 fires, then asserts the source array still has its NaNs in the original positions. Verified with a mutation test: dropping the .copy() at line 1065 makes the test fail loudly (sentinel stamped over the source's NaN cells), restoring it makes it pass. Real regression guard, not vacuous. Updated the test_writer_does_not_mutate_caller_input docstring to reflect what each test now covers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
arr = arr.copy()calls on the GeoTIFF read path. They duplicated a freshly-allocated multi-MB array right before an in-place sentinel-to-NaN rewrite, doubling peak memory for nodata-bearing reads.arr = arr.copy()calls on the write path. The source array there can be a view of a caller-owned numpy buffer, and mutating without a copy would corrupt user input. Each kept call now has a comment explaining why.Closes #1553.
Sites changed
_geotiff_to_xarrayxrspatial/geotiff/__init__.py:518arris freshly allocated by_read_tiles/_read_strips; orientation handler returnsnp.ascontiguousarray(...); no other reference held._delayed_read_windowxrspatial/geotiff/__init__.py:1448arrcomes from_fetch_decode_cog_http_tilesorread_to_array; both freshly allocate.to_geotiffxrspatial/geotiff/__init__.py:990arrmay benp.asarray(raw)ornp.moveaxis(...)of a caller's numpy DataArray (a view). Mutating it would corrupt user data._write_single_tilexrspatial/geotiff/__init__.py:1050arr = np.asarray(chunk_data)may alias caller-owned data.Test plan
pytest xrspatial/geotiff/tests/test_nodata_no_extra_copy_1553.py -v-> 7 passed.