Skip to content

Vectorize mode-resampling for COG overview generation#1526

Merged
brendancol merged 1 commit intoxarray-contrib:mainfrom
brendancol:perf/mode-overview-vectorize
May 9, 2026
Merged

Vectorize mode-resampling for COG overview generation#1526
brendancol merged 1 commit intoxarray-contrib:mainfrom
brendancol:perf/mode-overview-vectorize

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • Replaces the per-pixel np.unique double loop in _block_reduce_2d(method='mode') with a vectorized sort-and-count over the (oh, ow, 4) block tensor.
  • Measured on a 1024x1024 uint8 input: 1008 ms (prior) -> 20 ms (new), about 50x faster (locally reproduced as 1037 ms -> 27 ms, ~39x).
  • Output is bit-exact identical to the prior implementation, verified by a reference copy of the old loop in the new test file.
  • Tie-break semantics preserved: when two values appear equally often the smaller value wins, because sorting groups equal values and np.argmax returns the leftmost max-count index.

Implementation

After reshaping each 2x2 block to a row of 4 cells:

  1. Sort along the last axis so equal values are contiguous.
  2. For each of the 4 positions, count cells equal to it (small fixed loop of length 4, each iteration vectorized over oh * ow).
  3. np.argmax picks the leftmost position with the highest count, which after sorting is the smallest tied value.

Test plan

  • pytest xrspatial/geotiff/tests/test_mode_overview_perf.py -x -q (48 passed)
  • pytest xrspatial/geotiff/tests/test_cog.py xrspatial/geotiff/tests/test_sparse_cog.py -x -q (30 passed)
  • Bit-exact match against the prior reference for random uint8/uint16/int16/int32/uint32/int64 inputs at sizes including 17x19, 100x101, 64x65.
  • Hand-crafted tie-break cases (two-way tie, three-way tie, three-of-a-kind, all-same).
  • Sanity guard: 1024x1024 uint8 path completes in under 100 ms.

Replace the per-pixel double loop in `_block_reduce_2d(method='mode')`
with a vectorized sort-and-count over the (oh, ow, 4) block tensor.
On a 1024x1024 uint8 input the reference implementation took ~1037 ms;
the vectorized path runs in ~27 ms (about 39x faster).

Output is bit-exact identical to the prior implementation. Tie-break
semantics ("lowest value wins" on equal counts) are preserved because
sorting brings equal values adjacent and `np.argmax` returns the
leftmost (smallest) position when counts tie.

Adds tests/test_mode_overview_perf.py with bit-exact comparison
against a copy of the old reference for randomized inputs across
uint8/uint16/int16/int32/uint32/int64 and odd dimensions, hand-crafted
tie-break cases, and a 100 ms sanity guard on a 1024^2 input.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 8, 2026
@brendancol brendancol requested a review from Copilot May 8, 2026 22:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes GeoTIFF/COG overview generation by replacing the previous per-block np.unique loop used for method='mode' downsampling with a vectorized NumPy approach, aiming to drastically reduce runtime while preserving the prior tie-break behavior.

Changes:

  • Replaced _block_reduce_2d(..., method='mode') implementation with a vectorized sort-and-count approach over 2x2 blocks.
  • Added correctness tests that compare output bit-for-bit against a reference implementation and cover key tie-break cases.
  • Added a performance-oriented test intended to guard against regressions in the mode-resampling path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
xrspatial/geotiff/_writer.py Implements the new vectorized mode resampling logic for 2x2 block reduction.
xrspatial/geotiff/tests/test_mode_overview_perf.py Adds reference-based correctness tests, tie-break tests, and a runtime budget check for mode resampling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +106 to +117
def test_perf_under_100ms_on_1024sq_uint8():
rng = np.random.default_rng(seed=0)
arr = rng.integers(0, 16, size=(1024, 1024), dtype=np.uint8)
# Warmup
_block_reduce_2d(arr, 'mode')
t0 = time.perf_counter()
out = _block_reduce_2d(arr, 'mode')
elapsed = time.perf_counter() - t0
assert out.shape == (512, 512)
assert elapsed < 0.1, (
f"mode resampling took {elapsed*1000:.1f} ms (threshold 100 ms)"
)
rng = np.random.default_rng(seed=42)
info = np.iinfo(dtype)
# Use a small categorical-style range so ties happen often.
lo = max(info.min, 0)
h2 = (shape[0] // 2) * 2
w2 = (shape[1] // 2) * 2
if h2 == 0 or w2 == 0:
return
@brendancol brendancol merged commit 5d06bda into xarray-contrib:main May 9, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants