ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) by pl752 · Pull Request #21636 · ggml-org/llama.cpp

pl752 · 2026-04-08T16:02:36Z

Hello, I have prepared optimized implementation of cpu q1_0 dot product (mainly for Bonsai LLM models), this is a continuation of PrismML-Eng#10 PR, list of experiments conducted and some other benchmark results can be found there

This PR implements:

More efficient (less bit math and multiplications) generic implementation of dot product for (q1_0; q8_0)
x86 SIMD specific implementations of dot product for (q1_0; q8_0) for most of the realistic x86_64 targets (from SSSE3 to AVX2)

Checks performed so far:

test-quantization-fns works passes
model behaves well
perplexity runs completed for 5x512 batches of wikitext-2-test (unpacked gguf as a reference, Bonsai 1.7B)
llama-bench runs for Bonsai 1.7B
verified that assembly is efficient in terms of lack of register spills and good pipeline pressure

Benchmark results for Bonsai 1.7B

Benchmarks were performed with:

CPU: AMD Ryzen 5 7640HS (at 65w)
WSL vm
LPDDR5 @ 6400MT JEDEC
Threads: 10

Flow	`pp 512` t/s	`tg 128` t/s	Speedup
Initial*	2.05	1.32	1.0x / 1.0x
Scalar	13.07	9.38	6.4x / 7.1x
`SSSE3`	43.43	32.56	21.2x / 24.6x
`AVX`	53.54	40.70	26.1x / 30.8x
`AVX` + `F16C`**	73.87	45.94	36.0x / 34.7x
`AVX2` + `FMA`	131.03	73.85	63.9x / 55.9x
`AVX512`	137.75	76.91	67.1x / 58.2x

"*": Results for current mainline variant were extrapolated due to me being impatient
"**": F16C is enabled for AVX2/512 too and disabled previously (to reflect cpu ISA generations)

Perplexity summary for Bonsai 1.7B

Metric	Scalar	`SSSE3`	`AVX`	`AVX2` + `FMA`
Same top p	99.451 ± 0.207 %	99.059 ± 0.271 %	99.373 ± 0.221 %	99.686 ± 0.157 %
Mean KLD	0.000213 ± 0.000008	0.000228 ± 0.000010	0.000235 ± 0.000010	0.000218 ± 0.000009
Maximum KLD	0.004783	0.004070	0.004658	0.005173
99.9% KLD	0.002648	0.003666	0.003888	0.003778
99.0% KLD	0.001295	0.001730	0.001676	0.001318
Median KLD	0.000129	0.000141	0.000143	0.000134
1.0% KLD	-0.000012	-0.000009	-0.000007	-0.000006
Minimum KLD	-0.000051	-0.000040	-0.000057	-0.000045
Mean Δp	0.000 ± 0.009 %	0.011 ± 0.010 %	0.000 ± 0.010 %	0.011 ± 0.010 %
Maximum Δp	2.770 %	2.917 %	2.709 %	3.366 %
99.9% Δp	1.851 %	2.036 %	2.166 %	2.707 %
99.0% Δp	1.192 %	1.359 %	1.314 %	1.268 %
95.0% Δp	0.486 %	0.534 %	0.540 %	0.551 %
Median Δp	-0.000 %	0.000 %	0.000 %	0.000 %
5.0% Δp	-0.465 %	-0.558 %	-0.576 %	-0.494 %
1.0% Δp	-1.020 %	-1.034 %	-1.099 %	-0.989 %
0.1% Δp	-1.888 %	-1.412 %	-1.783 %	-1.675 %
Minimum Δp	-2.109 %	-1.823 %	-1.859 %	-2.133 %
RMS Δp	0.334 ± 0.017 %	0.360 ± 0.018 %	0.362 ± 0.017 %	0.364 ± 0.022 %

Things still to be done (most likely not this PR):

~~AVX512 implementation (I was unable to achieve meaningful improvements aside from opts from compiler) for Zen 4~~ pretty unlikely to be actually helpful on Zen 4 due to problem shape, currently small instruction length and memory pressure
Implementation for Zen 5 or modern Xeons as they have faster AVX512 pipeline
Implementing branches for nrc==2 as it shows potential for further speedup (pipeline is pretty hot in terms of memory bandwidth already), next (?) PR soon probably
~~Maybe some experiments outside (repack -> specialized mmvq/mmq; experimenting with scratch buffer configurations)~~ (Haven't found good use for it as for now; plain 4x4 dot is promising, but still WIP)
I have risc-v sbc with vec size of 256 and fp/bf support (spacemit k1), so maybe future PR for risc-v SIMD (or even spacemit MMA?)

People who have also contributed

@khosravipasha (mainline Q1_0 support and QA)
@zcattacz (major help and useful guidance)
@jordankzf (inspired me to try to optimize cpu code)

(other people who provided useful insights or experimented themselves)

AI usage disclosure

Was used for automating benchmarks, some of the tests and creating tables
Was NOT used to write any other text for PR or human interaction
Was used for prototyping and iteration (guided by me, final code was mostly manually refined and tested)

I have read and agree with the contributing guidelines
AI usage disclosure: YES, see above

pl752 · 2026-04-08T19:56:22Z

~~Aaand, we are live~~ Okay, reviews, requests and questions are welcome

khosravipasha · 2026-04-10T05:49:14Z

Tested this on a x86 CPU I have access to, "AMD EPYC 7543 32-Core Processor" (its on the cloud).

Before this runs <1 tok/s for the smallest model so decent speed up, not sure how the speed is comparison with other quantization formats for models of similar size with CPU-only, have not actively tried them.

CPU Benchmarks (fa=1, CPU-only build)

Model	Threads	pp512 (t/s)	tg128 (t/s)
Bonsai-1.7B	4	65.0 ± 3.8	41.1 ± 1.2
Bonsai-1.7B	8	128.5 ± 6.5	52.2 ± 0.2
Bonsai-1.7B	10	153.1 ± 5.6	57.4 ± 3.0
Bonsai-4B	4	27.0 ± 1.8	20.0 ± 0.6
Bonsai-4B	8	50.0 ± 3.3	34.0 ± 0.6
Bonsai-4B	10	59.7 ± 2.1	34.8 ± 0.3
Bonsai-8B	4	14.9 ± 0.3	12.2 ± 0.2
Bonsai-8B	8	27.6 ± 1.1	20.4 ± 1.0
Bonsai-8B	10	33.9 ± 1.3	22.9 ± 0.5

KL divergence with unpacked version:

Build	Model	Mean KLD	Same Top Token	Status
CPU	1.7B	0.000261 ± 0.000009	99.22%	PASS
CPU	4B	0.000214 ± 0.000014	99.14%	PASS
CPU	8B	0.000200 ± 0.000008	99.61%	PASS

pl752 · 2026-04-10T15:48:20Z

Sorry for the late commit, I have noticed that I forgot to take C=0; C=AxB+0 to C=AxB shortcut in AVX kernel for the first repeating block, like is done for AVX2. ~1% improvement in t/s only for AVX era isa. No accuracy/perplexity changes.

…uplicated generic fallback

pl752 · 2026-04-11T13:17:40Z

As I am still awaiting the review from @ggerganov or somebody else, decided to perform small code cleanup

khosravipasha · 2026-04-11T20:34:56Z

@pl752 maybe rename the PR to something simpler, CPU: Q1_0 x86 optimizations
(saw few more cpu PRs that were AI generated that were closed so they might have missed this)

pl752 · 2026-04-12T07:08:38Z

Why actually not; I have changed the name to more usual style for this place. I think that the staff is currently busy and will get here sooner or later, maybe we need to ping other reviewer who specializes on cpu backend of ggml. For now I will just switch to other projects, and maybe explore opportunities to optimize q1_0 cuda, make some work towards Risc-V support (vector SIMD and spacemit ime extensions) for q1_0 or start working on fork optimized for sm70 (tesla v100), as they are pretty obundant on chineese second hand market for pretty good price (and because I already have two). Also tile dot (standard nrc=2 and special larger kernels) will be further refined and explored for better ways of utilizing hardware (let's hope review will not take so much time that I would consider pushing it to the current PR.

zcattacz · 2026-04-13T01:16:38Z

I can see why this should be low priority:

so much unfinished business in the desc... needs time to breathe
reads like a procedural optimization of a recently merged good-enough impl... heck, does it actually change anything?
tagged only the owner of a large active repo... who's usually the last gatekeeper

Hmm, we're aiming for the annual sweep.

pl752 · 2026-04-13T03:33:09Z

I think I needed to point out more clearly that the current checkpoint is ready to be merged (in my opinion) (as it already provides huge performance improvements over current minimal viable implementation) and tag some people who usually review such kind of changes.

(@CISC, @am17an) To the people I've tagged - I'd be grateful if you could take a look at this PR)

am17an

Sorry I forgot to press submit on my review.

pl752 · 2026-04-14T15:23:19Z

@khosravipasha there is an old scalar implementation left in code for ARM, I think it can be replaced with call for generic func like in x86 now, what do you think? (I am running perplexity run in an emulator as for now, it seems to work properly)

khosravipasha · 2026-04-14T16:52:59Z

@am17an awesome thanks.

@pl752 What is the change exactly, its not included here right? How much is the improvement. Maybe best to do in another PR since this only mentions x86.

pl752 · 2026-04-14T17:01:32Z

@khosravipasha PR mentions x86 and GENERIC, this, in my opinion, means that arch-agnostic generic implementation is part of it too, so for consistency it is logical to include it to other implementations too. As for difference I am unable to tell about performance due to emulation, but I highly doubt that code which runs nearly an order of magnitude slower on x86, will behave relatively different on ARM.

As for change it is just replacing whole else section for ARM with one similar with x86

Patch

diff --git a/ggml/src/ggml-cpu/arch/arm/quants.c b/ggml/src/ggml-cpu/arch/arm/quants.c
index e09db59..cec5815 100644
--- a/ggml/src/ggml-cpu/arch/arm/quants.c
+++ b/ggml/src/ggml-cpu/arch/arm/quants.c
@@ -151,8 +151,6 @@ void ggml_vec_dot_q1_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
     const block_q1_0 * GGML_RESTRICT x = vx;
     const block_q8_0 * GGML_RESTRICT y = vy;
 
-    float sumf = 0.0f;
-
 #if defined(__ARM_NEON)
     float32x4_t sumv = vdupq_n_f32(0.0f);
 
@@ -212,31 +210,13 @@ void ggml_vec_dot_q1_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
         }
     }
 
-    sumf = vaddvq_f32(sumv);
+    *s = vaddvq_f32(sumv);
 #else
-    // Scalar fallback
-    for (int i = 0; i < nb; i++) {
-        const float d0 = GGML_FP16_TO_FP32(x[i].d);
-
-        // Process 4 Q8_0 blocks
-        for (int k = 0; k < 4; k++) {
-            const float d1 = GGML_FP16_TO_FP32(y[i*4 + k].d);
-
-            int sumi = 0;
-            for (int j = 0; j < QK8_0; j++) {
-                const int bit_index = k * QK8_0 + j;
-                const int byte_index = bit_index / 8;
-                const int bit_offset = bit_index % 8;
-
-                const int xi = ((x[i].qs[byte_index] >> bit_offset) & 1) ? 1 : -1;
-                sumi += xi * y[i*4 + k].qs[j];
-            }
-            sumf += d0 * d1 * sumi;
-        }
-    }
+    UNUSED(nb);
+    UNUSED(x);
+    UNUSED(y);
+    ggml_vec_dot_q1_0_q8_0_generic(n, s, bs, vx, bx, vy, by, nrc);
 #endif
-
-    *s = sumf;
 }

Which system are you using for benchmarking and validation of ARM implementation and are you able to test it?

PS:

Perplexity for non-NEON

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9305 ±    3.1730      -0.00190 ±    0.00229       0.00019 ±    0.00002     0.356 ±  0.064 %    100.000 ±  0.000 %
   2      20.1898 ±    3.4351       0.01389 ±    0.01155       0.00021 ±    0.00001     0.322 ±  0.037 %    99.608 ±  0.277 %
   3      20.8716 ±    2.7911       0.01009 ±    0.00771       0.00021 ±    0.00001     0.331 ±  0.026 %    99.216 ±  0.319 %
   4      21.2229 ±    2.3914       0.00747 ±    0.00581       0.00022 ±    0.00001     0.336 ±  0.021 %    99.314 ±  0.259 %
   5      21.0933 ±    2.1040       0.00594 ±    0.00468       0.00022 ±    0.00001     0.334 ±  0.018 %    99.451 ±  0.207 %

====== Perplexity statistics ======
Mean PPL(Q)                   :  21.093286 ±   2.103980
Mean PPL(base)                :  20.968387 ±   2.074795
Cor(ln(PPL(Q)), ln(PPL(base))):  99.89%
Mean ln(PPL(Q)/PPL(base))     :   0.005939 ±   0.004676
Mean PPL(Q)/PPL(base)         :   1.005957 ±   0.004704
Mean PPL(Q)-PPL(base)         :   0.124898 ±   0.101192

====== KL divergence statistics ======
Mean    KLD:   0.000217 ±   0.000009
Maximum KLD:   0.004590
99.9%   KLD:   0.002847
99.0%   KLD:   0.001380
95.0%   KLD:   0.000646
90.0%   KLD:   0.000488
Median  KLD:   0.000135
10.0%   KLD:   0.000002
 5.0%   KLD:   0.000000
 1.0%   KLD:  -0.000007
 0.1%   KLD:  -0.000032
Minimum KLD:  -0.000035

====== Token probability statistics ======
Mean    Δp:  0.009 ± 0.009 %
Maximum Δp:  3.273%
99.9%   Δp:  2.085%
99.0%   Δp:  1.149%
95.0%   Δp:  0.567%
90.0%   Δp:  0.296%
75.0%   Δp:  0.053%
Median  Δp: -0.000%
25.0%   Δp: -0.057%
10.0%   Δp: -0.312%
 5.0%   Δp: -0.493%
 1.0%   Δp: -0.934%
 0.1%   Δp: -1.448%
Minimum Δp: -1.827%
RMS Δp    :  0.334 ± 0.018 %
Same top p: 99.451 ± 0.207 %

khosravipasha · 2026-04-14T17:10:14Z

@pl752 Thanks for clarification, I only have access to Mac which is NEON path, have not tried it on other ARM CPUs.
I will leave it to cpu experts to include this now or not (okay with me either way).

pl752 · 2026-04-14T17:15:16Z

@khosravipasha you can test any path by disabling native build and defining used instructions manually:

Flags as for example for non-NEON

-DGGML_NATIVE=OFF \
-DGGML_CPU_ARM_ARCH=armv8-a+nosimd \
-DGGML_LLAMAFILE=OFF

UPD: perplexity run is healthy for NEON path too, so I will push, please notify if any regressions emerge

pl752 · 2026-04-14T17:21:43Z

@am17an I have pushed replacement of fallback path for ARM with new generic for consistency with x86, as for now no changes are planned for this PR anymore if nothing unexpected happens

am17an

Need someone else to approve to merge as well. @ggerganov?

pl752 added 2 commits April 7, 2026 11:46

Implemented optimized q1_0 dot for x86 and generic

195593b

Removed redundant helper definition

e29cd48

pl752 mentioned this pull request Apr 8, 2026

(Performance) Optimized x86 and generic q1_0(_g128) dot PrismML-Eng/llama.cpp#10

Open

pl752 marked this pull request as ready for review April 8, 2026 19:52

pl752 requested a review from ggerganov as a code owner April 8, 2026 19:52

Removed two redundant instructions from AVX q1_0 dot

8587b5c

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 10, 2026

pl752 added 2 commits April 11, 2026 18:06

Fixed inconsistency with fp16 conversion for generic q1_0 dot and ded…

0c4fb41

…uplicated generic fallback

Style cleanup around AVX q1_0 dot

7f82cf0

pl752 changed the title ~~(Performance; ggml-cpu) Optimized x86 and generic cpu q1_0 dot (follow up)~~ ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) Apr 12, 2026

am17an approved these changes Apr 14, 2026

View reviewed changes

Comment thread ggml/src/ggml-cpu/arch/x86/quants.c Outdated

Replaced explicitly unrolled blocks with inner for loop for q1_0

67f8d32

Replaced scalar ARM q1_0 impl with new generic one

715f62a

am17an approved these changes Apr 15, 2026

View reviewed changes

Conversation

pl752 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR implements:

Checks performed so far:

Things still to be done (most likely not this PR):

People who have also contributed

(other people who provided useful insights or experimented themselves)

AI usage disclosure

Uh oh!

pl752 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented Apr 10, 2026

CPU Benchmarks (fa=1, CPU-only build)

Uh oh!

pl752 commented Apr 10, 2026

Uh oh!

pl752 commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented Apr 11, 2026

Uh oh!

pl752 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcattacz commented Apr 13, 2026

Uh oh!

pl752 commented Apr 13, 2026

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pl752 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented Apr 14, 2026

Uh oh!

pl752 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented Apr 14, 2026

Uh oh!

pl752 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 14, 2026

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pl752 commented Apr 8, 2026 •

edited

Loading

pl752 commented Apr 8, 2026 •

edited

Loading

pl752 commented Apr 11, 2026 •

edited

Loading

pl752 commented Apr 12, 2026 •

edited

Loading

pl752 commented Apr 14, 2026 •

edited

Loading

pl752 commented Apr 14, 2026 •

edited

Loading

pl752 commented Apr 14, 2026 •

edited

Loading