Skip to content

Drop defensive copies of freshly-allocated nodata-mask arrays#1558

Merged
brendancol merged 2 commits intomainfrom
issue-1553
May 11, 2026
Merged

Drop defensive copies of freshly-allocated nodata-mask arrays#1558
brendancol merged 2 commits intomainfrom
issue-1553

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • Drops two unnecessary arr = arr.copy() calls on the GeoTIFF read path. They duplicated a freshly-allocated multi-MB array right before an in-place sentinel-to-NaN rewrite, doubling peak memory for nodata-bearing reads.
  • Keeps the two arr = arr.copy() calls on the write path. The source array there can be a view of a caller-owned numpy buffer, and mutating without a copy would corrupt user input. Each kept call now has a comment explaining why.
  • Adds a regression test file covering both fixed sites and a guard for the kept write-side copies.

Closes #1553.

Sites changed

Site File:line (pre-edit) Action Reason
_geotiff_to_xarray xrspatial/geotiff/__init__.py:518 Drop copy arr is freshly allocated by _read_tiles/_read_strips; orientation handler returns np.ascontiguousarray(...); no other reference held.
_delayed_read_window xrspatial/geotiff/__init__.py:1448 Drop copy arr comes from _fetch_decode_cog_http_tiles or read_to_array; both freshly allocate.
to_geotiff xrspatial/geotiff/__init__.py:990 Keep copy + comment arr may be np.asarray(raw) or np.moveaxis(...) of a caller's numpy DataArray (a view). Mutating it would corrupt user data.
_write_single_tile xrspatial/geotiff/__init__.py:1050 Keep copy + comment arr = np.asarray(chunk_data) may alias caller-owned data.

Test plan

  • New regression suite passes: pytest xrspatial/geotiff/tests/test_nodata_no_extra_copy_1553.py -v -> 7 passed.
  • Full geotiff suite passes (deselecting 3 pre-existing matplotlib/Python 3.14 deepcopy failures unrelated to this PR): 943 passed, 6 skipped, 3 deselected.
  • Tests cover float32 strip, float32 tiled, uint16 tiled, dask-chunked float, dask-chunked uint16, repeat-read independence, and a writer-no-mutate guard.

Two ``arr = arr.copy()`` calls on the read path duplicated a
multi-MB array right before an in-place sentinel-to-NaN rewrite.
The arrays were freshly allocated by ``_read_tiles`` /
``_read_strips`` / ``_fetch_decode_cog_http_tiles`` and uniquely
owned at that point, so the copies were pure peak-memory waste.

Removed:
- ``_geotiff_to_xarray`` (was ~line 518): eager-read nodata path.
- ``_delayed_read_window`` (was ~line 1448): dask-chunked read.

Kept (with a comment explaining why):
- ``to_geotiff`` (~line 990): ``arr`` may be a view of a caller-owned
  numpy buffer (``np.asarray(raw)`` or ``np.moveaxis``).
- ``_write_single_tile`` (~line 1050): same shape; ``arr`` may alias
  caller data through ``np.asarray(chunk_data)``.

Adds ``test_nodata_no_extra_copy_1553.py`` covering float32 strip,
float32 tiled, uint16 tiled, dask-chunked float, dask-chunked uint16,
repeat reads (independence), and a writer-no-mutate guard for the
defensive copies kept on the write path.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 10, 2026
@brendancol brendancol requested a review from Copilot May 11, 2026 03:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces peak memory usage on the GeoTIFF read path by removing two redundant defensive arr.copy() calls immediately before in-place nodata sentinel → NaN rewrites, while explicitly documenting why equivalent copies must remain on the write path to avoid mutating caller-owned buffers. It also adds a targeted regression test suite for the nodata rewrite behavior.

Changes:

  • Removed two unnecessary arr.copy() calls on the read path before nodata masking (eager read + dask windowed reads).
  • Kept the two write-path defensive copies and added comments explaining their necessity to prevent caller-buffer mutation.
  • Added a new regression test module covering nodata sentinel handling across strip/tiled reads and dask-chunked reads, plus a writer non-mutation guard.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
xrspatial/geotiff/__init__.py Drops read-path copies before in-place nodata masking; adds rationale comments for retained write-path copies.
xrspatial/geotiff/tests/test_nodata_no_extra_copy_1553.py Adds regression coverage for nodata masking (strip/tiled + dask) and a writer no-mutation guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xrspatial/geotiff/tests/test_nodata_no_extra_copy_1553.py
Add test_write_single_tile_does_not_mutate_caller_input to cover the
second kept defensive copy (the .vrt tiled-output path). Calls
_write_single_tile directly with a numpy buffer + integer nodata so
the NaN -> sentinel rewrite at line 1065 fires, then asserts the
source array still has its NaNs in the original positions.

Verified with a mutation test: dropping the .copy() at line 1065 makes
the test fail loudly (sentinel stamped over the source's NaN cells),
restoring it makes it pass. Real regression guard, not vacuous.

Updated the test_writer_does_not_mutate_caller_input docstring to
reflect what each test now covers.
@brendancol brendancol merged commit f157746 into main May 11, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Drop defensive copies of freshly-allocated nodata-mask arrays

2 participants