cranelift(aarch64): share base+index across offset loads via uimm12 amode#13766
cranelift(aarch64): share base+index across offset loads via uimm12 amode#13766darmie wants to merge 3 commits into
Conversation
…mode When several loads share `base + (extended index)` and differ only by a uimm12-scaled offset, keep `base + index` as one CSE-able value and fold each offset into the load's immediate, so a single `add base, index, uxtw` is reused instead of a per-load `add base, #offset`. A zero offset keeps the single-instruction RegExtended form.
The amode CSE folds constant offsets into load/store immediates and keeps base+index as one add. Single-use cases are neutral (offset moves into the immediate); debug-exceptions sheds 2 instructions by reusing base+index across four offsets.
|
Thanks for this! I'm wary of the duplication being added here though, and amode rules are some of the most sensitive in Cranelift so we ideally want as little duplication as possible. For example the offset=0 rules are duplicates of above rules, and even using |
Address review feedback: the offset==0 rules only existed to beat the new UnsignedOffset rules for a zero offset, duplicating rules 4/5. Replace them with a `uimm12_scaled_nonzero_from_i64` guard on rules 8/9 so a zero offset falls through to the existing RegExtended rules instead. The UnsignedOffset rules simply take higher priority than 4/5 for non-zero offsets. Codegen and all aarch64 filetests are unchanged.
I have removed the duplication. Also added |
Subscribe to Label ActionDetailsThis issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "isle"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
|
This helps remove some duplication, yeah, but this is still effectively a duplication of rule (2), however, and it's also a duplication of 3/4/5 in terms of matching pattern. My thinking is that what you want can be achieved by raising the priority of rule 2 because the premise here is that the static offset should have the highest priority for getting folded into the instruction itself while leaving all other computations in a CSE-able form. |
When several loads share
base + (extended index)and differ only by auimm12-scaled offset (e.g. fields read through a computed index), AArch64 amode
selection emitted a separate
add base, #offsetper load, defeating CSE of theshared
base + index.This folds the offset into the load's immediate and keeps
base + indexas onevalue, so a single
add base, index, uxtwis materialized and reused. A zerooffset keeps the single-instruction
RegExtendedform, so it isneutral-or-better: a cluster of N such loads drops N address-adds to 1, a lone
load is unchanged.
Tested: new aarch64 precise-output filetest + a stack round-trip runtest
(aarch64/x86_64/s390x); the full aarch64 filetest and runtest suites pass.