Boolean indexing improvements#1923
Conversation
|
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
|
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_296 ran successfully. |
f9abe3e to
a5b03ca
Compare
|
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_302 ran successfully. |
1 similar comment
|
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_302 ran successfully. |
|
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_313 ran successfully. |
|
Incidentally, a nice improvement from the latest commit is the acceleration of cumulative sum: whereas in 0.18.3: |
ndgrigorian
left a comment
There was a problem hiding this comment.
LGTM! This brings a nice performance boost
1. Use shared local memory to optimize access to neighboring elements of cumulative sums. 2. Introduce contig variant for masked_extract code 3. Removed unused orthog_nelems functor argument, and added local_accessor argument instead. The example ``` import dpctl.tensor as dpt x = dpt.ones(20241024, dtype='f4') m = dpt.ones(x.size, dtype='b1') %time x[m] ``` decreased from 41ms on Iris Xe WSL box to 37 ms.
Use local_accessor to improve memory bandwidth of the work-group.
Use shared local memory to improve global memory bandwidth.
Also implement get_lws to choose local-work-group-size from given choices I0 > I1 > I2 > ..., if n > I0, use I0, if n > I1 use I1, and so on.
The chunk update kernels processed consecutive elements in contiguous memory, hence sub-group memory access pattern was sub-optimal (no coalescing). This PR changes these kernels to process n_wi elements which are sub-group size apart, improving memory access patern. Running a micro-benchmark based on code from gh-1249 (for shape =(n, n,) where n = 4096) with this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010703916665753004 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.01079747307597211 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010864820314088353 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023878061203975922 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023666468500677083 ``` while before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011415911812542213 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011722088705196424 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030126182353813893 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030459783371986338 ``` Running the same code using NumPy (same size): ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.01416253090698134 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.014979530811413296 ``` The reason Level-Zero device is slower has to do with slow allocation/deallocation bug. OpenCL device has better timing. With this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.015038836885381627 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01527448468496678 ``` before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01758851639115838 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.017089676241286926 ```
Changed left-over update kernel to use coalesceed memory access.
…ed extract/place code
f365bba to
a8e7600
Compare
|
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_319 ran successfully. |
|
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_321 ran successfully. |
|
Array API conformance test failures are unrelated. Merging |
This PR is motivated by gh-1249. It changes masked extract, masked place and nonzero kernels to reduce global memory bandwidth through use of shared local memory.
Kernels are changes from being range-based, to being nd_range-based. Work-items of work-group collectively load
lws + 1elements ofcumulative_sumvalues into local accessor. Each work-item reads two values of this cumulative sum, and reading them from SLM reduces the number of GM accesses by a factor of 2.For masked extract, that is called in gh-1249, this PR introduce contiguous
srcarray specialization to improve performance, sincex_2d[m_2d]runs slower thanx_2d_flat[m_2d_flat].