Quantized SDPA by CC-Yeh · Pull Request #3026 · ml-explore/mlx

CC-Yeh · 2026-01-20T22:18:09Z

Proposed changes

Add Metal quantized SDPA vector kernels based on #1515

Speedup vs fp16

TODO:

Support Affine and NVFP4
Adapt Faster two pass sdpa #3023 once it's merged.
Cleanup

What improve performance:

Removed thread storage k, v to reduce register pressure (was waiting on synchronization).
Fused computation with dequantization
Tuned reading size ('uint16_t'/'uin32_t') for quantized k/v
Manual unroll better than clang loop optimizer

Checklist

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

awni · 2026-01-21T00:44:14Z

The numbers seem quite good.. a little too good to be true 😅

What's the difference between SDPA and Attention in the benchmark? Also what's the query sequence length used for the benchmark?

CC-Yeh · 2026-01-21T01:32:35Z

The numbers seem quite good.. a little too good to be true 😅

Totally agree, must be missing something 🤔

What's the difference between SDPA and Attention in the benchmark? Also what's the query sequence length used for the benchmark?

Attention is a simple reference implementation built from matmul + softmax + matmul (Maybe too naive?).
SDPA uses mx.fast.scaled_dot_product_attention, which hits the sdpa_vector_2pass kernels when Lq ≤ 8 (this case).

The query sequence length here is 1 (q.shape = (1, 32, 1, 128)), so this benchmark is measuring the single-token decode case, where one new token attends to a long KV cache (L = 32768).

CC-Yeh · 2026-01-21T02:01:41Z

@awni
Fixed some bugs in dequantizing 8bit and benchmark(unneccessary dequantization steps).
Now the numbers make more sense 😃

awni · 2026-01-21T02:51:25Z

So if I’m understanding correctly the fused implementation is slower in the quantized case than the unfused ops-based one?

CC-Yeh · 2026-01-21T03:07:03Z

Fused SDPA is faster: MXFP4 15.33 ms vs 24.71 ms, and MXFP8 26.09 ms vs 46.48 ms to decode a single query.

awni · 2026-01-21T03:11:37Z

Very nice!!

awni · 2026-01-21T14:23:32Z

mlx/fast.cpp

+  if (qmode == QuantizationMode::Nvfp4) {
+    throw std::invalid_argument(
+        "[quantized_scaled_dot_product_attention] Mode 'nvfp4' is not supported for fast attention.");
+  }


Why not nvfp4?

It’s on the way! I just wanted to make sure the PR structure was okay first.

Added support

awni · 2026-01-21T14:23:59Z

mlx/fast.cpp

+  if (qmode == QuantizationMode::Affine) {
+    throw std::invalid_argument(
+        "[quantized_scaled_dot_product_attention] Only fp quantization modes are supported.");
+  }


Why not affine?

Btw not suggesting we necessarily do it. Maybe it's better to be more limited in the quants we support here. Maybe fp8, fp4 are fine to start?

For example I don't think it's necessary to support every bit width because in practice no-one will ever use 2, 3 for KV cache quantization.

Added initial support, still has more room for tuning bit 2/3/5/6

awni · 2026-01-27T14:41:57Z

@CC-Yeh I'm interested in this PR moving forward. Let me know if you have questions. Also no need to support everything on a first pass. I think doing one 8-bit (fp8 / int8) quant well for Metal / CUDA is already probably good enough to start.

CC-Yeh · 2026-01-29T22:21:35Z

@awni

I’ve added the Metal paths for mxfp4/8, nvfp4, and affine(2/3/4/5/6/8) (affine is not optimized).
Further tuning likely needs validation on other machines.

For the CUDA path (maybe next PR), Colab doesn’t support NVFP4, so would need help for that.

awni · 2026-01-29T22:24:25Z

affine(2/3/4/5/6/8)

What group sizes did you do for that? I"m not convinced we need broad support for bitwidth X group size. I expect bits < 4 to be used rarely if ever.

CC-Yeh · 2026-01-29T22:35:34Z

affine(2/3/4/5/6/8)

What group sizes did you do for that? I"m not convinced we need broad support there. I expect < 4 to be used rarely.

What group sizes do you think we should support for affine? Currently it's templated so it can handle various
sizes, but I can limit the instantiations if there's a specific set that's practical.

template <typename T, int D, QuantMode mode, int group_size, int bits>
[[kernel]] void quant_sdpa_vector_2pass_1(

awni · 2026-01-29T22:37:52Z

Yes totally. I think it's good to keep it generic. But probably better to limit initial support and grow than vice versa.

I would maybe start with bits = {4, 6, 8} and just group_size = 32. I think 32 is most flexible for the head dimension right?

CC-Yeh · 2026-01-30T23:04:11Z

Yes totally. I think it's good to keep it generic. But probably better to limit initial support and grow than vice versa.

I would maybe start with bits = {4, 6, 8} and just group_size = 32. I think 32 is most flexible for the head dimension right?

Limited the affine support.

Yeah, 32 is most flexible for head dim.

…nary size)

CC-Yeh · 2026-02-07T13:48:25Z

Hey @awni

Just fine-tuned the block sizes and GQA factors, and switched from template kernels to a function_constant approach to trade some cold-start latency for reduced binary size. Ready for review!

altaic · 2026-02-07T19:24:56Z

mlx/backend/metal/scaled_dot_product_attention.cpp

  kname += "_";
  kname += std::to_string(q.shape(-1));
+  kname += "_";
+  kname += std::to_string(q.shape(-1));


I think you meant v.shape(-1) here?

Yeah, in 2 pass kernels both values are the same.

CC-Yeh force-pushed the quantized_sdpa branch from b64b7dc to 11b24f5 Compare January 20, 2026 22:23

CC-Yeh force-pushed the quantized_sdpa branch from 11b24f5 to 640ec94 Compare January 21, 2026 01:29

awni reviewed Jan 21, 2026

View reviewed changes

CC-Yeh force-pushed the quantized_sdpa branch from f3dc49d to 5af4060 Compare January 29, 2026 18:10

CC-Yeh changed the title ~~[WIP] Quantized SDPA~~ Quantized SDPA Jan 29, 2026

CC-Yeh marked this pull request as ready for review January 29, 2026 22:04

CC-Yeh requested a review from awni January 29, 2026 22:22

CC-Yeh force-pushed the quantized_sdpa branch from 3bc3e28 to c72fad9 Compare January 29, 2026 22:27

CC-Yeh added 9 commits February 7, 2026 21:15

first attempt

0e37a5b

fix

25287b4

Unify mxfp4/8 paths and optimize mxfp8 fused calculation

6879713

supports nvpf4

1dfc65d

supports affine 4/8 bits

a27d6d6

supports affine 2/3/5/6 bits

fa35d19

clean up

0481e19

adapt ml-explore#3023

49ae04c

Limit affine SDPA to group_size=32 and bits={4,6,8}

bd60a70

CC-Yeh added 4 commits February 7, 2026 21:15

fix group_size for nvfp4 and simplify code

cd874c6

refactor: use function_constant to reduce template specializations(bi…

e58230f

…nary size)

tune blocks

0269843

fix

d692162

CC-Yeh force-pushed the quantized_sdpa branch from 75231ee to d692162 Compare February 7, 2026 13:43

cleanup

4a25689

CC-Yeh force-pushed the quantized_sdpa branch from 1e8bfc1 to 4a25689 Compare February 7, 2026 15:44

CC-Yeh added 3 commits February 8, 2026 00:16

cleanup + refactor

c52f2ab

enable causal

a944812

support sinks

106ef6b

altaic reviewed Feb 7, 2026

View reviewed changes

cleanup

48650a4

CC-Yeh force-pushed the quantized_sdpa branch from 0046b95 to 48650a4 Compare February 8, 2026 03:19

cleanup

6291e80

Comments

Conversation

CC-Yeh commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Speedup vs fp16

Checklist

Uh oh!

awni commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026

Uh oh!

awni commented Jan 21, 2026

Uh oh!

CC-Yeh commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awni commented Jan 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awni commented Jan 27, 2026

Uh oh!

CC-Yeh commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awni commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CC-Yeh commented Jan 29, 2026

Uh oh!

awni commented Jan 29, 2026

Uh oh!

CC-Yeh commented Jan 30, 2026

Uh oh!

CC-Yeh commented Feb 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CC-Yeh commented Jan 20, 2026 •

edited

Loading

CC-Yeh commented Jan 21, 2026 •

edited

Loading

CC-Yeh commented Jan 29, 2026 •

edited

Loading

awni commented Jan 29, 2026 •

edited

Loading