ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up)#21636
ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up)#21636pl752 wants to merge 7 commits intoggml-org:masterfrom
Conversation
|
|
|
Tested this on a x86 CPU I have access to, "AMD EPYC 7543 32-Core Processor" (its on the cloud). Before this runs <1 tok/s for the smallest model so decent speed up, not sure how the speed is comparison with other quantization formats for models of similar size with CPU-only, have not actively tried them. CPU Benchmarks (fa=1, CPU-only build)
KL divergence with unpacked version:
|
|
Sorry for the late commit, I have noticed that I forgot to take C=0; C=AxB+0 to C=AxB shortcut in AVX kernel for the first repeating block, like is done for AVX2. ~1% improvement in |
…uplicated generic fallback
|
As I am still awaiting the review from @ggerganov or somebody else, decided to perform small code cleanup |
|
@pl752 maybe rename the PR to something simpler, CPU: Q1_0 x86 optimizations |
|
Why actually not; I have changed the name to more usual style for this place. I think that the staff is currently busy and will get here sooner or later, maybe we need to ping other reviewer who specializes on cpu backend of ggml. For now I will just switch to other projects, and maybe explore opportunities to optimize q1_0 cuda, make some work towards Risc-V support (vector SIMD and spacemit ime extensions) for q1_0 or start working on fork optimized for sm70 (tesla v100), as they are pretty obundant on chineese second hand market for pretty good price (and because I already have two). Also tile dot (standard nrc=2 and special larger kernels) will be further refined and explored for better ways of utilizing hardware (let's hope review will not take so much time that I would consider pushing it to the current PR. |
|
I can see why this should be low priority:
Hmm, we're aiming for the annual sweep. |
|
I think I needed to point out more clearly that the current checkpoint is ready to be merged (in my opinion) (as it already provides huge performance improvements over current minimal viable implementation) and tag some people who usually review such kind of changes. (@CISC, @am17an) To the people I've tagged - I'd be grateful if you could take a look at this PR) |
am17an
left a comment
There was a problem hiding this comment.
Sorry I forgot to press submit on my review.
|
@khosravipasha there is an old scalar implementation left in code for ARM, I think it can be replaced with call for generic func like in x86 now, what do you think? (I am running perplexity run in an emulator as for now, it seems to work properly) |
|
@khosravipasha PR mentions x86 and GENERIC, this, in my opinion, means that arch-agnostic generic implementation is part of it too, so for consistency it is logical to include it to other implementations too. As for difference I am unable to tell about performance due to emulation, but I highly doubt that code which runs nearly an order of magnitude slower on x86, will behave relatively different on ARM. As for change it is just replacing whole else section for ARM with one similar with x86 PatchPS: Perplexity for non-NEON |
|
@pl752 Thanks for clarification, I only have access to Mac which is NEON path, have not tried it on other ARM CPUs. |
|
@khosravipasha you can test any path by disabling native build and defining used instructions manually: Flags as for example for non-NEON |
|
@am17an I have pushed replacement of fallback path for ARM with new generic for consistency with x86, as for now no changes are planned for this PR anymore if nothing unexpected happens |
am17an
left a comment
There was a problem hiding this comment.
Need someone else to approve to merge as well. @ggerganov?
Hello, I have prepared optimized implementation of cpu q1_0 dot product (mainly for Bonsai LLM models), this is a continuation of PrismML-Eng#10 PR, list of experiments conducted and some other benchmark results can be found there
This PR implements:
Checks performed so far:
Benchmark results for Bonsai 1.7B
Benchmarks were performed with:
10pp 512t/stg 128t/sSSSE3AVXAVX+F16C**AVX2+FMAAVX512"*": Results for current mainline variant were extrapolated due to me being impatient
"**": F16C is enabled for AVX2/512 too and disabled previously (to reflect cpu ISA generations)
Perplexity summary for Bonsai 1.7B
SSSE3AVXAVX2+FMAThings still to be done (most likely not this PR):
AVX512 implementation (I was unable to achieve meaningful improvements aside from opts from compiler) for Zen 4pretty unlikely to be actually helpful on Zen 4 due to problem shape, currently small instruction length and memory pressurenrc==2as it shows potential for further speedup (pipeline is pretty hot in terms of memory bandwidth already), next (?) PR soon probablyMaybe some experiments outside (repack -> specialized mmvq/mmq; experimenting with scratch buffer configurations)(Haven't found good use for it as for now; plain 4x4 dot is promising, but still WIP)People who have also contributed
(other people who provided useful insights or experimented themselves)
AI usage disclosure