Skip to content

ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up)#21636

Open
pl752 wants to merge 7 commits intoggml-org:masterfrom
pl752:perf/q1_0_g128_no_nofma
Open

ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up)#21636
pl752 wants to merge 7 commits intoggml-org:masterfrom
pl752:perf/q1_0_g128_no_nofma

Conversation

@pl752
Copy link
Copy Markdown
Contributor

@pl752 pl752 commented Apr 8, 2026

Hello, I have prepared optimized implementation of cpu q1_0 dot product (mainly for Bonsai LLM models), this is a continuation of PrismML-Eng#10 PR, list of experiments conducted and some other benchmark results can be found there

This PR implements:

  • More efficient (less bit math and multiplications) generic implementation of dot product for (q1_0; q8_0)
  • x86 SIMD specific implementations of dot product for (q1_0; q8_0) for most of the realistic x86_64 targets (from SSSE3 to AVX2)

Checks performed so far:

  • test-quantization-fns works passes
  • model behaves well
  • perplexity runs completed for 5x512 batches of wikitext-2-test (unpacked gguf as a reference, Bonsai 1.7B)
  • llama-bench runs for Bonsai 1.7B
  • verified that assembly is efficient in terms of lack of register spills and good pipeline pressure
Benchmark results for Bonsai 1.7B

Benchmarks were performed with:

  • CPU: AMD Ryzen 5 7640HS (at 65w)
  • WSL vm
  • LPDDR5 @ 6400MT JEDEC
  • Threads: 10
Flow pp 512 t/s tg 128 t/s Speedup
Initial* 2.05 1.32 1.0x / 1.0x
Scalar 13.07 9.38 6.4x / 7.1x
SSSE3 43.43 32.56 21.2x / 24.6x
AVX 53.54 40.70 26.1x / 30.8x
AVX + F16C** 73.87 45.94 36.0x / 34.7x
AVX2 + FMA 131.03 73.85 63.9x / 55.9x
AVX512 137.75 76.91 67.1x / 58.2x

"*": Results for current mainline variant were extrapolated due to me being impatient
"**": F16C is enabled for AVX2/512 too and disabled previously (to reflect cpu ISA generations)

Perplexity summary for Bonsai 1.7B
Metric Scalar SSSE3 AVX AVX2 + FMA
Same top p 99.451 ± 0.207 % 99.059 ± 0.271 % 99.373 ± 0.221 % 99.686 ± 0.157 %
Mean KLD 0.000213 ± 0.000008 0.000228 ± 0.000010 0.000235 ± 0.000010 0.000218 ± 0.000009
Maximum KLD 0.004783 0.004070 0.004658 0.005173
99.9% KLD 0.002648 0.003666 0.003888 0.003778
99.0% KLD 0.001295 0.001730 0.001676 0.001318
Median KLD 0.000129 0.000141 0.000143 0.000134
1.0% KLD -0.000012 -0.000009 -0.000007 -0.000006
Minimum KLD -0.000051 -0.000040 -0.000057 -0.000045
Mean Δp 0.000 ± 0.009 % 0.011 ± 0.010 % 0.000 ± 0.010 % 0.011 ± 0.010 %
Maximum Δp 2.770 % 2.917 % 2.709 % 3.366 %
99.9% Δp 1.851 % 2.036 % 2.166 % 2.707 %
99.0% Δp 1.192 % 1.359 % 1.314 % 1.268 %
95.0% Δp 0.486 % 0.534 % 0.540 % 0.551 %
Median Δp -0.000 % 0.000 % 0.000 % 0.000 %
5.0% Δp -0.465 % -0.558 % -0.576 % -0.494 %
1.0% Δp -1.020 % -1.034 % -1.099 % -0.989 %
0.1% Δp -1.888 % -1.412 % -1.783 % -1.675 %
Minimum Δp -2.109 % -1.823 % -1.859 % -2.133 %
RMS Δp 0.334 ± 0.017 % 0.360 ± 0.018 % 0.362 ± 0.017 % 0.364 ± 0.022 %

Things still to be done (most likely not this PR):

  • AVX512 implementation (I was unable to achieve meaningful improvements aside from opts from compiler) for Zen 4 pretty unlikely to be actually helpful on Zen 4 due to problem shape, currently small instruction length and memory pressure
  • Implementation for Zen 5 or modern Xeons as they have faster AVX512 pipeline
  • Implementing branches for nrc==2 as it shows potential for further speedup (pipeline is pretty hot in terms of memory bandwidth already), next (?) PR soon probably
  • Maybe some experiments outside (repack -> specialized mmvq/mmq; experimenting with scratch buffer configurations) (Haven't found good use for it as for now; plain 4x4 dot is promising, but still WIP)
  • I have risc-v sbc with vec size of 256 and fp/bf support (spacemit k1), so maybe future PR for risc-v SIMD (or even spacemit MMA?)

People who have also contributed

(other people who provided useful insights or experimented themselves)

AI usage disclosure

  • Was used for automating benchmarks, some of the tests and creating tables
  • Was NOT used to write any other text for PR or human interaction
  • Was used for prototyping and iteration (guided by me, final code was mostly manually refined and tested)

@pl752 pl752 marked this pull request as ready for review April 8, 2026 19:52
@pl752 pl752 requested a review from ggerganov as a code owner April 8, 2026 19:52
@pl752
Copy link
Copy Markdown
Contributor Author

pl752 commented Apr 8, 2026

Aaand, we are live Okay, reviews, requests and questions are welcome

@khosravipasha
Copy link
Copy Markdown
Contributor

Tested this on a x86 CPU I have access to, "AMD EPYC 7543 32-Core Processor" (its on the cloud).

Before this runs <1 tok/s for the smallest model so decent speed up, not sure how the speed is comparison with other quantization formats for models of similar size with CPU-only, have not actively tried them.

CPU Benchmarks (fa=1, CPU-only build)

Model Threads pp512 (t/s) tg128 (t/s)
Bonsai-1.7B 4 65.0 ± 3.8 41.1 ± 1.2
Bonsai-1.7B 8 128.5 ± 6.5 52.2 ± 0.2
Bonsai-1.7B 10 153.1 ± 5.6 57.4 ± 3.0
Bonsai-4B 4 27.0 ± 1.8 20.0 ± 0.6
Bonsai-4B 8 50.0 ± 3.3 34.0 ± 0.6
Bonsai-4B 10 59.7 ± 2.1 34.8 ± 0.3
Bonsai-8B 4 14.9 ± 0.3 12.2 ± 0.2
Bonsai-8B 8 27.6 ± 1.1 20.4 ± 1.0
Bonsai-8B 10 33.9 ± 1.3 22.9 ± 0.5

KL divergence with unpacked version:

Build Model Mean KLD Same Top Token Status
CPU 1.7B 0.000261 ± 0.000009 99.22% PASS
CPU 4B 0.000214 ± 0.000014 99.14% PASS
CPU 8B 0.000200 ± 0.000008 99.61% PASS

@pl752
Copy link
Copy Markdown
Contributor Author

pl752 commented Apr 10, 2026

Sorry for the late commit, I have noticed that I forgot to take C=0; C=AxB+0 to C=AxB shortcut in AVX kernel for the first repeating block, like is done for AVX2. ~1% improvement in t/s only for AVX era isa. No accuracy/perplexity changes.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 10, 2026
@pl752
Copy link
Copy Markdown
Contributor Author

pl752 commented Apr 11, 2026

As I am still awaiting the review from @ggerganov or somebody else, decided to perform small code cleanup

@khosravipasha
Copy link
Copy Markdown
Contributor

@pl752 maybe rename the PR to something simpler, CPU: Q1_0 x86 optimizations
(saw few more cpu PRs that were AI generated that were closed so they might have missed this)

@pl752 pl752 changed the title (Performance; ggml-cpu) Optimized x86 and generic cpu q1_0 dot (follow up) ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) Apr 12, 2026
@pl752
Copy link
Copy Markdown
Contributor Author

pl752 commented Apr 12, 2026

Why actually not; I have changed the name to more usual style for this place. I think that the staff is currently busy and will get here sooner or later, maybe we need to ping other reviewer who specializes on cpu backend of ggml. For now I will just switch to other projects, and maybe explore opportunities to optimize q1_0 cuda, make some work towards Risc-V support (vector SIMD and spacemit ime extensions) for q1_0 or start working on fork optimized for sm70 (tesla v100), as they are pretty obundant on chineese second hand market for pretty good price (and because I already have two). Also tile dot (standard nrc=2 and special larger kernels) will be further refined and explored for better ways of utilizing hardware (let's hope review will not take so much time that I would consider pushing it to the current PR.

@zcattacz
Copy link
Copy Markdown

I can see why this should be low priority:

  • so much unfinished business in the desc... needs time to breathe
  • reads like a procedural optimization of a recently merged good-enough impl... heck, does it actually change anything?
  • tagged only the owner of a large active repo... who's usually the last gatekeeper

Hmm, we're aiming for the annual sweep.

@pl752
Copy link
Copy Markdown
Contributor Author

pl752 commented Apr 13, 2026

I think I needed to point out more clearly that the current checkpoint is ready to be merged (in my opinion) (as it already provides huge performance improvements over current minimal viable implementation) and tag some people who usually review such kind of changes.

(@CISC, @am17an) To the people I've tagged - I'd be grateful if you could take a look at this PR)

Copy link
Copy Markdown
Contributor

@am17an am17an left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to press submit on my review.

Comment thread ggml/src/ggml-cpu/arch/x86/quants.c Outdated
@pl752
Copy link
Copy Markdown
Contributor Author

pl752 commented Apr 14, 2026

@khosravipasha there is an old scalar implementation left in code for ARM, I think it can be replaced with call for generic func like in x86 now, what do you think? (I am running perplexity run in an emulator as for now, it seems to work properly)

@khosravipasha
Copy link
Copy Markdown
Contributor

@am17an awesome thanks.

@pl752 What is the change exactly, its not included here right? How much is the improvement. Maybe best to do in another PR since this only mentions x86.

@pl752
Copy link
Copy Markdown
Contributor Author

pl752 commented Apr 14, 2026

@khosravipasha PR mentions x86 and GENERIC, this, in my opinion, means that arch-agnostic generic implementation is part of it too, so for consistency it is logical to include it to other implementations too. As for difference I am unable to tell about performance due to emulation, but I highly doubt that code which runs nearly an order of magnitude slower on x86, will behave relatively different on ARM.

As for change it is just replacing whole else section for ARM with one similar with x86

Patch
diff --git a/ggml/src/ggml-cpu/arch/arm/quants.c b/ggml/src/ggml-cpu/arch/arm/quants.c
index e09db59..cec5815 100644
--- a/ggml/src/ggml-cpu/arch/arm/quants.c
+++ b/ggml/src/ggml-cpu/arch/arm/quants.c
@@ -151,8 +151,6 @@ void ggml_vec_dot_q1_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
     const block_q1_0 * GGML_RESTRICT x = vx;
     const block_q8_0 * GGML_RESTRICT y = vy;
 
-    float sumf = 0.0f;
-
 #if defined(__ARM_NEON)
     float32x4_t sumv = vdupq_n_f32(0.0f);
 
@@ -212,31 +210,13 @@ void ggml_vec_dot_q1_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
         }
     }
 
-    sumf = vaddvq_f32(sumv);
+    *s = vaddvq_f32(sumv);
 #else
-    // Scalar fallback
-    for (int i = 0; i < nb; i++) {
-        const float d0 = GGML_FP16_TO_FP32(x[i].d);
-
-        // Process 4 Q8_0 blocks
-        for (int k = 0; k < 4; k++) {
-            const float d1 = GGML_FP16_TO_FP32(y[i*4 + k].d);
-
-            int sumi = 0;
-            for (int j = 0; j < QK8_0; j++) {
-                const int bit_index = k * QK8_0 + j;
-                const int byte_index = bit_index / 8;
-                const int bit_offset = bit_index % 8;
-
-                const int xi = ((x[i].qs[byte_index] >> bit_offset) & 1) ? 1 : -1;
-                sumi += xi * y[i*4 + k].qs[j];
-            }
-            sumf += d0 * d1 * sumi;
-        }
-    }
+    UNUSED(nb);
+    UNUSED(x);
+    UNUSED(y);
+    ggml_vec_dot_q1_0_q8_0_generic(n, s, bs, vx, bx, vy, by, nrc);
 #endif
-
-    *s = sumf;
 }
Which system are you using for benchmarking and validation of ARM implementation and are you able to test it?

PS:

Perplexity for non-NEON
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9305 ±    3.1730      -0.00190 ±    0.00229       0.00019 ±    0.00002     0.356 ±  0.064 %    100.000 ±  0.000 %
   2      20.1898 ±    3.4351       0.01389 ±    0.01155       0.00021 ±    0.00001     0.322 ±  0.037 %    99.608 ±  0.277 %
   3      20.8716 ±    2.7911       0.01009 ±    0.00771       0.00021 ±    0.00001     0.331 ±  0.026 %    99.216 ±  0.319 %
   4      21.2229 ±    2.3914       0.00747 ±    0.00581       0.00022 ±    0.00001     0.336 ±  0.021 %    99.314 ±  0.259 %
   5      21.0933 ±    2.1040       0.00594 ±    0.00468       0.00022 ±    0.00001     0.334 ±  0.018 %    99.451 ±  0.207 %

====== Perplexity statistics ======
Mean PPL(Q)                   :  21.093286 ±   2.103980
Mean PPL(base)                :  20.968387 ±   2.074795
Cor(ln(PPL(Q)), ln(PPL(base))):  99.89%
Mean ln(PPL(Q)/PPL(base))     :   0.005939 ±   0.004676
Mean PPL(Q)/PPL(base)         :   1.005957 ±   0.004704
Mean PPL(Q)-PPL(base)         :   0.124898 ±   0.101192

====== KL divergence statistics ======
Mean    KLD:   0.000217 ±   0.000009
Maximum KLD:   0.004590
99.9%   KLD:   0.002847
99.0%   KLD:   0.001380
95.0%   KLD:   0.000646
90.0%   KLD:   0.000488
Median  KLD:   0.000135
10.0%   KLD:   0.000002
 5.0%   KLD:   0.000000
 1.0%   KLD:  -0.000007
 0.1%   KLD:  -0.000032
Minimum KLD:  -0.000035

====== Token probability statistics ======
Mean    Δp:  0.009 ± 0.009 %
Maximum Δp:  3.273%
99.9%   Δp:  2.085%
99.0%   Δp:  1.149%
95.0%   Δp:  0.567%
90.0%   Δp:  0.296%
75.0%   Δp:  0.053%
Median  Δp: -0.000%
25.0%   Δp: -0.057%
10.0%   Δp: -0.312%
 5.0%   Δp: -0.493%
 1.0%   Δp: -0.934%
 0.1%   Δp: -1.448%
Minimum Δp: -1.827%
RMS Δp    :  0.334 ± 0.018 %
Same top p: 99.451 ± 0.207 %

@khosravipasha
Copy link
Copy Markdown
Contributor

@pl752 Thanks for clarification, I only have access to Mac which is NEON path, have not tried it on other ARM CPUs.
I will leave it to cpu experts to include this now or not (okay with me either way).

@pl752
Copy link
Copy Markdown
Contributor Author

pl752 commented Apr 14, 2026

@khosravipasha you can test any path by disabling native build and defining used instructions manually:

Flags as for example for non-NEON
-DGGML_NATIVE=OFF \
-DGGML_CPU_ARM_ARCH=armv8-a+nosimd \
-DGGML_LLAMAFILE=OFF
UPD: perplexity run is healthy for NEON path too, so I will push, please notify if any regressions emerge

@pl752
Copy link
Copy Markdown
Contributor Author

pl752 commented Apr 14, 2026

@am17an I have pushed replacement of fallback path for ARM with new generic for consistency with x86, as for now no changes are planned for this PR anymore if nothing unexpected happens

Copy link
Copy Markdown
Contributor

@am17an am17an left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need someone else to approve to merge as well. @ggerganov?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants