Boolean indexing improvements by oleksandr-pavlyk · Pull Request #1923 · IntelPython/dpctl

oleksandr-pavlyk · 2024-12-06T21:44:34Z

This PR is motivated by gh-1249. It changes masked extract, masked place and nonzero kernels to reduce global memory bandwidth through use of shared local memory.

Kernels are changes from being range-based, to being nd_range-based. Work-items of work-group collectively load lws + 1 elements of cumulative_sum values into local accessor. Each work-item reads two values of this cumulative sum, and reading them from SLM reduces the number of GM accesses by a factor of 2.

For masked extract, that is called in gh-1249, this PR introduce contiguous src array specialization to improve performance, since x_2d[m_2d] runs slower than x_2d_flat[m_2d_flat].

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
Have you added documentation for your changes, if necessary?
Have you added your changes to the changelog?
If this PR is a work in progress, are you opening the PR as a draft?

github-actions · 2024-12-06T22:22:11Z

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

github-actions · 2024-12-06T22:28:21Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_296 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

coveralls · 2024-12-06T22:29:41Z

coverage: 87.659%. remained the same
when pulling a8e7600 on boolean-indexing-improvements
into cb4a049 on master.

github-actions · 2024-12-07T21:09:29Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_302 ran successfully.
Passed: 896
Failed: 0
Skipped: 118

github-actions · 2024-12-07T21:11:05Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_302 ran successfully.
Passed: 896
Failed: 0
Skipped: 118

github-actions · 2024-12-09T18:12:53Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_313 ran successfully.
Passed: 895
Failed: 1
Skipped: 118

oleksandr-pavlyk · 2024-12-10T02:27:05Z

Incidentally, a nice improvement from the latest commit is the acceleration of cumulative sum:

In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.ones(10**7, dtype="i4")

In [3]: m = dpt.cumulative_sum(x)

In [4]: %timeit dpt.cumulative_sum(x, out=m).sycl_queue.wait()
388 μs ± 1.31 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

whereas in 0.18.3:

In [6]: %timeit dpt.cumulative_sum(x, out=m).sycl_queue.wait()
648 μs ± 1.51 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

ndgrigorian

LGTM! This brings a nice performance boost

1. Use shared local memory to optimize access to neighboring elements of cumulative sums. 2. Introduce contig variant for masked_extract code 3. Removed unused orthog_nelems functor argument, and added local_accessor argument instead. The example ``` import dpctl.tensor as dpt x = dpt.ones(20241024, dtype='f4') m = dpt.ones(x.size, dtype='b1') %time x[m] ``` decreased from 41ms on Iris Xe WSL box to 37 ms.

Use local_accessor to improve memory bandwidth of the work-group.

Use shared local memory to improve global memory bandwidth.

Also implement get_lws to choose local-work-group-size from given choices I0 > I1 > I2 > ..., if n > I0, use I0, if n > I1 use I1, and so on.

The chunk update kernels processed consecutive elements in contiguous memory, hence sub-group memory access pattern was sub-optimal (no coalescing). This PR changes these kernels to process n_wi elements which are sub-group size apart, improving memory access patern. Running a micro-benchmark based on code from gh-1249 (for shape =(n, n,) where n = 4096) with this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010703916665753004 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.01079747307597211 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010864820314088353 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023878061203975922 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023666468500677083 ``` while before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011415911812542213 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011722088705196424 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030126182353813893 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030459783371986338 ``` Running the same code using NumPy (same size): ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.01416253090698134 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.014979530811413296 ``` The reason Level-Zero device is slower has to do with slow allocation/deallocation bug. OpenCL device has better timing. With this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.015038836885381627 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01527448468496678 ``` before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01758851639115838 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.017089676241286926 ```

Changed left-over update kernel to use coalesceed memory access.

…ed extract/place code

github-actions · 2024-12-10T20:20:01Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_319 ran successfully.
Passed: 896
Failed: 0
Skipped: 118

ndgrigorian

LGTM!

github-actions · 2024-12-10T20:57:15Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_321 ran successfully.
Passed: 893
Failed: 3
Skipped: 118

oleksandr-pavlyk · 2024-12-10T22:50:55Z

Array API conformance test failures are unrelated. Merging

oleksandr-pavlyk force-pushed the boolean-indexing-improvements branch from f9abe3e to a5b03ca Compare December 7, 2024 20:30

oleksandr-pavlyk marked this pull request as ready for review December 9, 2024 20:36

oleksandr-pavlyk requested a review from ndgrigorian as a code owner December 9, 2024 20:36

ndgrigorian previously approved these changes Dec 10, 2024

View reviewed changes

oleksandr-pavlyk dismissed ndgrigorian’s stale review via f365bba December 10, 2024 19:36

oleksandr-pavlyk added 8 commits December 10, 2024 13:45

MaskedPlaced kernel optimized

5161b8e

Use local_accessor to improve memory bandwidth of the work-group.

Optimization to kernel of tensor.nonzero

137c309

Use shared local memory to improve global memory bandwidth.

Functors for masked extract/place changed to store typed pointers

ec924c3

Also implement get_lws to choose local-work-group-size from given choices I0 > I1 > I2 > ..., if n > I0, use I0, if n > I1 use I1, and so on.

Chunk update kernels use nd_range<2>

e0a59f5

Changed left-over update kernel to use coalesceed memory access.

Add change-log entry for performance improvement in scanning and mask…

b0c8c5e

…ed extract/place code

size_t -> std::size_t

a8e7600

oleksandr-pavlyk force-pushed the boolean-indexing-improvements branch from f365bba to a8e7600 Compare December 10, 2024 20:14

ndgrigorian approved these changes Dec 10, 2024

View reviewed changes

oleksandr-pavlyk merged commit 0bcd635 into master Dec 10, 2024
30 checks passed

oleksandr-pavlyk deleted the boolean-indexing-improvements branch December 10, 2024 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boolean indexing improvements#1923

Boolean indexing improvements#1923
oleksandr-pavlyk merged 8 commits intomasterfrom
boolean-indexing-improvements

oleksandr-pavlyk commented Dec 6, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Dec 6, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Dec 6, 2024

Uh oh!

coveralls commented Dec 6, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Dec 7, 2024

Uh oh!

github-actions bot commented Dec 7, 2024

Uh oh!

github-actions bot commented Dec 9, 2024

Uh oh!

oleksandr-pavlyk commented Dec 10, 2024

Uh oh!

ndgrigorian left a comment

Uh oh!

github-actions bot commented Dec 10, 2024

Uh oh!

ndgrigorian left a comment

Uh oh!

github-actions bot commented Dec 10, 2024

Uh oh!

oleksandr-pavlyk commented Dec 10, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

oleksandr-pavlyk commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 6, 2024

Uh oh!

coveralls commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 7, 2024

Uh oh!

github-actions bot commented Dec 7, 2024

Uh oh!

github-actions bot commented Dec 9, 2024

Uh oh!

oleksandr-pavlyk commented Dec 10, 2024

Uh oh!

ndgrigorian left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 10, 2024

Uh oh!

ndgrigorian left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 10, 2024

Uh oh!

oleksandr-pavlyk commented Dec 10, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oleksandr-pavlyk commented Dec 6, 2024 •

edited

Loading

github-actions bot commented Dec 6, 2024 •

edited

Loading

coveralls commented Dec 6, 2024 •

edited

Loading