Skip to content

(Prototype) q1_0 nrc = 2 and diabolic tiles branches#21

Draft
pl752 wants to merge 9 commits intoPrismML-Eng:masterfrom
pl752:perf/q1_0_nrc2
Draft

(Prototype) q1_0 nrc = 2 and diabolic tiles branches#21
pl752 wants to merge 9 commits intoPrismML-Eng:masterfrom
pl752:perf/q1_0_nrc2

Conversation

@pl752
Copy link
Copy Markdown

@pl752 pl752 commented Apr 9, 2026

Pretty much direct continuation of #10.
Vibe coded prototype (just for proof of concept, needs refining) of nrows = 2 branches for x86 SIMD.
Yields significant PP improvements as it allows better utilization of memory bandwidth (hot y operand, high compute density).
I also think ARM NEON is worth trying to expand with nrows = 2

flow run nrc=1 nrc=2 delta
SSSE3 pp512 43.22 t/s 52.69 t/s +21.91%
SSSE3 tg128 32.10 t/s 32.16 t/s +0.19%
AVX pp512 52.76 t/s 62.05 t/s +17.61%
AVX tg128 40.21 t/s 40.34 t/s +0.32%
AVX + F16C pp512 68.62 t/s 92.15 t/s +34.29%
AVX + F16C tg128 42.87 t/s 45.36 t/s +5.81% (idk)
AVX2 + FMA pp512 121.54 t/s 160.94 t/s +32.42%
AVX2 + FMA tg128 68.73 t/s 69.24 t/s +0.74%
AVX512BW pp512 128.38 t/s 158.17 t/s +23.20%
AVX512BW tg128 74.15 t/s 71.90 t/s -3.03%

Also for some reason AVX-512 opts do hurt performance for PP consistently for nrows = 2 and sometimes results are inconsistent
Code for these branches is enormous and is most likely suboptimal, so suggestions are welcome, register spills occur of course
Funny part is that I have tried iterating the AVX2 prototype, but haven't managed to achieve any improvements.

I have also tried altering tile geometry to use rectangular blocks due to significant operand size assymetry like was attempted in #4 by @Marxist-Leninist, which yields some changes, but is inconclusive.

blck_0 AVX2 pp512 delta AVX2 tg128 delta AVX-512 pp512 delta AVX-512 tg128 delta
16 172.63 - 73.36 - 170.97 - 75.24 -
32 176.26 +2.10% 74.00 +0.87% 171.14 +0.10% 75.57 +0.44%
64 175.55 +1.69% 74.42 +1.44% 172.33 +0.80% 76.57 +1.77%

@pl752 pl752 changed the title (Prototype) q1_0 nrc = 2 brances (Prototype) q1_0 nrc = 2 branches Apr 9, 2026
@github-actions github-actions bot added the ggml label Apr 9, 2026
@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 9, 2026

Hi @pl752 , impressive. Initially the tps gain is not visible, since my test prompt does not have a long context. Have you considered a 4x4 or even 8x8 mode for AVX-512 ?

2x2 batch, no loop/if
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |                        
| qwen3 1.7B unknown, may not work | 231.13 MiB |     1.72 B | CPU        |       2 |           pp512 |         24.61 ± 0.25 |                      
| qwen3 1.7B unknown, may not work | 231.13 MiB |     1.72 B | CPU        |       2 |           tg128 |         14.32 ± 0.29 |                      

no loop/if + single accumulator
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 1.7B Q1_0                | 231.13 MiB |     1.72 B | CPU        |       2 |           pp512 |         18.40 ± 1.04 |
| qwen3 1.7B Q1_0                | 231.13 MiB |     1.72 B | CPU        |       2 |           tg128 |         13.79 ± 0.74 |


@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 9, 2026

@zcattacz I was experimenting with repack and other shapes of 2x2 dot for AVX2, yet not so successuly, 4x4 dot is likely my next target
8x8 is ridiculuos and is likely to result in diminishing returns

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 9, 2026

Most likely that I will implement 4x1, 8x1, 4x2 or other kernel shapes outside default mul_mat

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 9, 2026

Some points from AI's analysis that seem to make sense:

  • tg hit memory bandwidth bottleneck, 70% sth is likely the top for that hardware. The activation matrix is almost always [D,1], effectively 1xN during tg. The extra overhead in AVX512 handling isn't paying off.
  • 4x4 is the standard sweet spot used in modern llama.cpp AVX512 kernels, expect another 15–35% improvement.
  • Check out the smart gear switching logics in ggml_compute_forward_mul_mat in ggml/src/ggml-cpu/ggml-cpu.c to avoid overhead.

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 9, 2026

Also I have found reason for lower than usual results, I have somehow missed later tests are run with -fa 0, still it doesn't change things much

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 9, 2026

Also cooking the slop continues, I have hooked up 4x2 kernel and got 200+ for PP on AVX-512

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 9, 2026

I have completed trying various tile shapes (final used forms are 1x1, 2x2, 2x1, 4x2 and 4x4), large tiles are only used where it is reasonable from register counts and memory bandwidth limitations.

Resulting code is pretty cursed/diabolic (ofc it is vibe coded and won't go anywhere near mainline), however it seems that it more or less maxes out my cpu, if there are no other significant refinements.

Results since nrc=2 as following (SSSE3 was not affected code wise), most benefits are from AVX-512 (4x4) in pp and AVX-2/512 (2x1) in tg:

flow run baseline new delta
AVX pp512 94.66 t/s 94.72 t/s +0.06%
AVX tg128 43.11 t/s 43.20 t/s +0.21%
AVX + F16C pp512 94.78 t/s 94.57 t/s -0.22%
AVX + F16C tg128 48.15 t/s 48.10 t/s -0.10%
AVX2 + FMA pp512 180.95 t/s 183.14 t/s +1.21%
AVX2 + FMA tg128 78.22 t/s 80.79 t/s +3.29%
AVX512BW pp512 177.11 t/s 207.96 t/s +17.42%
AVX512BW tg128 80.99 t/s 83.01 t/s +2.49%

I have tried other (larger or wider/longer shapes) and didn't obtain notable improvements
This PR (again) is just demo of extremities in x86 SIMD implementations, so bad code is intended

@pl752 pl752 changed the title (Prototype) q1_0 nrc = 2 branches (Prototype) q1_0 nrc = 2 and diabolic tiles branches Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants