(Prototype) q1_0 nrc = 2 and diabolic tiles branches by pl752 · Pull Request #21 · PrismML-Eng/llama.cpp

pl752 · 2026-04-09T08:20:54Z

Pretty much direct continuation of #10.
Vibe coded prototype (just for proof of concept, needs refining) of nrows = 2 branches for x86 SIMD.
Yields significant PP improvements as it allows better utilization of memory bandwidth (hot y operand, high compute density).
I also think ARM NEON is worth trying to expand with nrows = 2

flow	run	nrc=1	nrc=2	delta
`SSSE3`	`pp512`	43.22 t/s	52.69 t/s	+21.91%
`SSSE3`	`tg128`	32.10 t/s	32.16 t/s	+0.19%
`AVX`	`pp512`	52.76 t/s	62.05 t/s	+17.61%
`AVX`	`tg128`	40.21 t/s	40.34 t/s	+0.32%
`AVX` + `F16C`	`pp512`	68.62 t/s	92.15 t/s	+34.29%
`AVX` + `F16C`	`tg128`	42.87 t/s	45.36 t/s	+5.81% (idk)
`AVX2` + `FMA`	`pp512`	121.54 t/s	160.94 t/s	+32.42%
`AVX2` + `FMA`	`tg128`	68.73 t/s	69.24 t/s	+0.74%
`AVX512BW`	`pp512`	128.38 t/s	158.17 t/s	+23.20%
`AVX512BW`	`tg128`	74.15 t/s	71.90 t/s	-3.03%

Also for some reason AVX-512 opts do hurt performance for PP consistently for nrows = 2 and sometimes results are inconsistent
Code for these branches is enormous and is most likely suboptimal, so suggestions are welcome, register spills occur of course
Funny part is that I have tried iterating the AVX2 prototype, but haven't managed to achieve any improvements.

I have also tried altering tile geometry to use rectangular blocks due to significant operand size assymetry like was attempted in #4 by @Marxist-Leninist, which yields some changes, but is inconclusive.

`blck_0`	AVX2 `pp512`	delta	AVX2 `tg128`	delta	AVX-512 `pp512`	delta	AVX-512 `tg128`	delta
16	172.63	-	73.36	-	170.97	-	75.24	-
32	176.26	+2.10%	74.00	+0.87%	171.14	+0.10%	75.57	+0.44%
64	175.55	+1.69%	74.42	+1.44%	172.33	+0.80%	76.57	+1.77%

zcattacz · 2026-04-09T09:40:56Z

Hi @pl752 , impressive. Initially the tps gain is not visible, since my test prompt does not have a long context. Have you considered a 4x4 ~~or even 8x8~~ mode for AVX-512 ?

2x2 batch, no loop/if
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |                        
| qwen3 1.7B unknown, may not work | 231.13 MiB |     1.72 B | CPU        |       2 |           pp512 |         24.61 ± 0.25 |                      
| qwen3 1.7B unknown, may not work | 231.13 MiB |     1.72 B | CPU        |       2 |           tg128 |         14.32 ± 0.29 |                      

no loop/if + single accumulator
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 1.7B Q1_0                | 231.13 MiB |     1.72 B | CPU        |       2 |           pp512 |         18.40 ± 1.04 |
| qwen3 1.7B Q1_0                | 231.13 MiB |     1.72 B | CPU        |       2 |           tg128 |         13.79 ± 0.74 |

pl752 · 2026-04-09T10:56:52Z

@zcattacz I was experimenting with repack and other shapes of 2x2 dot for AVX2, yet not so successuly, 4x4 dot is likely my next target
8x8 is ridiculuos and is likely to result in diminishing returns

pl752 · 2026-04-09T11:09:24Z

Most likely that I will implement 4x1, 8x1, 4x2 or other kernel shapes outside default mul_mat

zcattacz · 2026-04-09T11:21:49Z

Some points from AI's analysis that seem to make sense:

tg hit memory bandwidth bottleneck, 70% sth is likely the top for that hardware. The activation matrix is almost always [D,1], effectively 1xN during tg. The extra overhead in AVX512 handling isn't paying off.
4x4 is the standard sweet spot used in modern llama.cpp AVX512 kernels, expect another 15–35% improvement.
Check out the smart gear switching logics in ggml_compute_forward_mul_mat in ggml/src/ggml-cpu/ggml-cpu.c to avoid overhead.

pl752 · 2026-04-09T12:23:33Z

Also I have found reason for lower than usual results, I have somehow missed later tests are run with -fa 0, still it doesn't change things much

pl752 · 2026-04-09T12:32:08Z

Also cooking ~~the slop~~ continues, I have hooked up 4x2 kernel and got 200+ for PP on AVX-512

pl752 · 2026-04-09T16:02:15Z

I have completed trying various tile shapes (final used forms are 1x1, 2x2, 2x1, 4x2 and 4x4), large tiles are only used where it is reasonable from register counts and memory bandwidth limitations.

Resulting code is pretty cursed/diabolic (ofc it is vibe coded and won't go anywhere near mainline), however it seems that it more or less maxes out my cpu, if there are no other significant refinements.

Results since nrc=2 as following (SSSE3 was not affected code wise), most benefits are from AVX-512 (4x4) in pp and AVX-2/512 (2x1) in tg:

flow	run	baseline	new	delta
`AVX`	`pp512`	94.66 t/s	94.72 t/s	+0.06%
`AVX`	`tg128`	43.11 t/s	43.20 t/s	+0.21%
`AVX` + `F16C`	`pp512`	94.78 t/s	94.57 t/s	-0.22%
`AVX` + `F16C`	`tg128`	48.15 t/s	48.10 t/s	-0.10%
`AVX2` + `FMA`	`pp512`	180.95 t/s	183.14 t/s	+1.21%
`AVX2` + `FMA`	`tg128`	78.22 t/s	80.79 t/s	+3.29%
`AVX512BW`	`pp512`	177.11 t/s	207.96 t/s	+17.42%
`AVX512BW`	`tg128`	80.99 t/s	83.01 t/s	+2.49%

I have tried other (larger or wider/longer shapes) and didn't obtain notable improvements
This PR (again) is just demo of extremities in x86 SIMD implementations, so bad code is intended

pl752 added 5 commits April 7, 2026 11:46

Implemented optimized q1_0 dot for x86 and generic

195593b

Removed redundant helper definition

e29cd48

Initial AVX2 nrc2 experiment

4d0a787

Experiment with rectangular tile

7ee5400

Implemented prototype nrc2 branches for SSSE3 and AVX

50d5d39

pl752 changed the title ~~(Prototype) q1_0 nrc = 2 brances~~ (Prototype) q1_0 nrc = 2 branches Apr 9, 2026

github-actions bot added the ggml label Apr 9, 2026

pl752 mentioned this pull request Apr 9, 2026

(Performance) Optimized x86 and generic q1_0(_g128) dot #10

Open

pl752 added 4 commits April 9, 2026 17:43

Hooked up special q1_0 4x2 dot kernel for AVX2/512

e5ce6d4

Implemented 4x4 dot fastpath

34f547d

Added 2x1 kernel for slightly faster q1_0 decoding

bec8051

Completed x86 SIMD extremities demo experiments

a88ee90

pl752 changed the title ~~(Prototype) q1_0 nrc = 2 branches~~ (Prototype) q1_0 nrc = 2 and diabolic tiles branches Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Prototype) q1_0 nrc = 2 and diabolic tiles branches#21

(Prototype) q1_0 nrc = 2 and diabolic tiles branches#21
pl752 wants to merge 9 commits intoPrismML-Eng:masterfrom
pl752:perf/q1_0_nrc2

pl752 commented Apr 9, 2026 •

edited

Loading

Uh oh!

zcattacz commented Apr 9, 2026 •

edited

Loading

Uh oh!

pl752 commented Apr 9, 2026 •

edited

Loading

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

zcattacz commented Apr 9, 2026 •

edited

Loading

Uh oh!

pl752 commented Apr 9, 2026 •

edited

Loading

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pl752 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcattacz commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

zcattacz commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pl752 commented Apr 9, 2026 •

edited

Loading

zcattacz commented Apr 9, 2026 •

edited

Loading

pl752 commented Apr 9, 2026 •

edited

Loading

zcattacz commented Apr 9, 2026 •

edited

Loading

pl752 commented Apr 9, 2026 •

edited

Loading