Skip to content

[FEM.Elastic] Cache stiffness matrices before computations#6069

Open
fredroy wants to merge 2 commits intosofa-framework:masterfrom
fredroy:femelasticity_cachestiffmat
Open

[FEM.Elastic] Cache stiffness matrices before computations#6069
fredroy wants to merge 2 commits intosofa-framework:masterfrom
fredroy:femelasticity_cachestiffmat

Conversation

@fredroy
Copy link
Copy Markdown
Contributor

@fredroy fredroy commented Apr 6, 2026

[with-all-tests]

Based on

Cache the assembled stiffness matrices into a flat vector.
-> Cache friendly so supposedly faster

Results: depends on the OS/CPU:

  • nothing really changed for assembled versions
  • matrixfree is between 1.5 to 3x faster depending of MT and/or CPU architecture

on macOS the speed up is quite high (~3x faster) both sequential and parallel, for both hexa and tetra
but on linux/intel the speed up is only for hexa (???) 🤔

Benches: (on Validation/cantilever_beam with 10x10x60 grid)

macOS on M3Pro ( 12 P + 4E cores)

before:

examples/Validation/cantilever_beam/tetrahedron/linear/parallel/matrixfree/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 36.3091 s ( 2.75413 FPS).
examples/Validation/cantilever_beam/tetrahedron/linear/parallel/assembled/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 38.253 s ( 2.61417 FPS).
examples/Validation/cantilever_beam/tetrahedron/linear/sequential/matrixfree/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 58.5249 s ( 1.70868 FPS).
examples/Validation/cantilever_beam/tetrahedron/linear/sequential/assembled/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 38.3411 s ( 2.60816 FPS).
examples/Validation/cantilever_beam/tetrahedron/corotational/parallel/matrixfree/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 32.5673 s ( 3.07056 FPS).
examples/Validation/cantilever_beam/tetrahedron/corotational/parallel/assembled/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 38.1398 s ( 2.62194 FPS).
examples/Validation/cantilever_beam/tetrahedron/corotational/sequential/matrixfree/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 66.6544 s ( 1.50028 FPS).
examples/Validation/cantilever_beam/tetrahedron/corotational/sequential/assembled/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 38.6166 s ( 2.58956 FPS).
examples/Validation/cantilever_beam/hexahedron/linear/parallel/matrixfree/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 24.7706 s ( 4.03704 FPS).
examples/Validation/cantilever_beam/hexahedron/linear/parallel/assembled/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 41.7348 s ( 2.39608 FPS).
examples/Validation/cantilever_beam/hexahedron/linear/sequential/matrixfree/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 34.9136 s ( 2.86421 FPS).
examples/Validation/cantilever_beam/hexahedron/linear/sequential/assembled/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 42.795 s ( 2.33672 FPS).
examples/Validation/cantilever_beam/hexahedron/corotational/parallel/matrixfree/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 23.5781 s ( 4.24123 FPS).
examples/Validation/cantilever_beam/hexahedron/corotational/parallel/assembled/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 41.41 s ( 2.41487 FPS).
examples/Validation/cantilever_beam/hexahedron/corotational/sequential/matrixfree/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 32.6968 s ( 3.0584 FPS).
examples/Validation/cantilever_beam/hexahedron/corotational/sequential/assembled/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 41.7608 s ( 2.39459 FPS).

after:

examples/Validation/cantilever_beam/tetrahedron/linear/parallel/matrixfree/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 10.601 s ( 9.43308 FPS).
examples/Validation/cantilever_beam/tetrahedron/linear/parallel/assembled/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 37.6018 s ( 2.65945 FPS).
examples/Validation/cantilever_beam/tetrahedron/linear/sequential/matrixfree/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 18.0416 s ( 5.54276 FPS).
examples/Validation/cantilever_beam/tetrahedron/linear/sequential/assembled/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 37.8509 s ( 2.64194 FPS).
examples/Validation/cantilever_beam/tetrahedron/corotational/parallel/matrixfree/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 10.9561 s ( 9.12732 FPS).
examples/Validation/cantilever_beam/tetrahedron/corotational/parallel/assembled/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 37.7793 s ( 2.64695 FPS).
examples/Validation/cantilever_beam/tetrahedron/corotational/sequential/matrixfree/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 22.2861 s ( 4.48711 FPS).
examples/Validation/cantilever_beam/tetrahedron/corotational/sequential/assembled/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 37.8271 s ( 2.64361 FPS).
examples/Validation/cantilever_beam/hexahedron/linear/parallel/matrixfree/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 5.62417 s ( 17.7804 FPS).
examples/Validation/cantilever_beam/hexahedron/linear/parallel/assembled/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 40.9311 s ( 2.44313 FPS).
examples/Validation/cantilever_beam/hexahedron/linear/sequential/matrixfree/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 13.0542 s ( 7.66039 FPS).
examples/Validation/cantilever_beam/hexahedron/linear/sequential/assembled/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 40.8362 s ( 2.44881 FPS).
examples/Validation/cantilever_beam/hexahedron/corotational/parallel/matrixfree/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 5.91004 s ( 16.9204 FPS).
examples/Validation/cantilever_beam/hexahedron/corotational/parallel/assembled/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 41.0514 s ( 2.43597 FPS).
examples/Validation/cantilever_beam/hexahedron/corotational/sequential/matrixfree/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 15.0846 s ( 6.62926 FPS).
examples/Validation/cantilever_beam/hexahedron/corotational/sequential/assembled/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 40.9415 s ( 2.44251 FPS).

Linux on i7 13700K ( 8 P + 8E cores)

before:

examples/Validation/cantilever_beam/hexahedron/corotational/sequential/matrixfree/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 41.3984 s ( 2.41555 FPS).  
examples/Validation/cantilever_beam/hexahedron/corotational/sequential/assembled/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 42.7336 s ( 2.34008 FPS).  
examples/Validation/cantilever_beam/hexahedron/corotational/parallel/matrixfree/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 7.89221 s ( 12.6707 FPS).  
examples/Validation/cantilever_beam/hexahedron/corotational/parallel/assembled/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 43.0416 s ( 2.32333 FPS).  
examples/Validation/cantilever_beam/hexahedron/linear/sequential/matrixfree/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 38.1027 s ( 2.62449 FPS).  
examples/Validation/cantilever_beam/hexahedron/linear/sequential/assembled/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 42.7734 s ( 2.3379 FPS).  
examples/Validation/cantilever_beam/hexahedron/linear/parallel/matrixfree/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 7.83321 s ( 12.7662 FPS).  
examples/Validation/cantilever_beam/hexahedron/linear/parallel/assembled/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 42.8358 s ( 2.33449 FPS).  
examples/Validation/cantilever_beam/tetrahedron/corotational/sequential/matrixfree/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 55.3949 s ( 1.80522 FPS).  
examples/Validation/cantilever_beam/tetrahedron/corotational/sequential/assembled/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 40.1035 s ( 2.49355 FPS).  
examples/Validation/cantilever_beam/tetrahedron/corotational/parallel/matrixfree/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 12.6961 s ( 7.87642 FPS).  
examples/Validation/cantilever_beam/tetrahedron/corotational/parallel/assembled/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 39.7326 s ( 2.51682 FPS).  
examples/Validation/cantilever_beam/tetrahedron/linear/sequential/matrixfree/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 50.079 s ( 1.99684 FPS).  
examples/Validation/cantilever_beam/tetrahedron/linear/sequential/assembled/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 39.7847 s ( 2.51353 FPS).  
examples/Validation/cantilever_beam/tetrahedron/linear/parallel/matrixfree/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 11.6404 s ( 8.59079 FPS).  
examples/Validation/cantilever_beam/tetrahedron/linear/parallel/assembled/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 39.5339 s ( 2.52947 FPS). 

after:

./../../src/sandbox/examples/Validation/cantilever_beam/hexahedron/corotational/sequential/matrixfree/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 22.2626 s ( 4.49183 FPS).  
examples/Validation/cantilever_beam/hexahedron/corotational/sequential/assembled/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 42.9017 s ( 2.33091 FPS).  
examples/Validation/cantilever_beam/hexahedron/corotational/parallel/matrixfree/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 6.31693 s ( 15.8305 FPS).  
examples/Validation/cantilever_beam/hexahedron/corotational/parallel/assembled/HexahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 42.9067 s ( 2.33064 FPS).  
examples/Validation/cantilever_beam/hexahedron/linear/sequential/matrixfree/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 21.1257 s ( 4.73356 FPS).  
examples/Validation/cantilever_beam/hexahedron/linear/sequential/assembled/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 42.7848 s ( 2.33728 FPS).  
examples/Validation/cantilever_beam/hexahedron/linear/parallel/matrixfree/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 6.24967 s ( 16.0008 FPS).  
examples/Validation/cantilever_beam/hexahedron/linear/parallel/assembled/HexahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 42.9491 s ( 2.32834 FPS).  
examples/Validation/cantilever_beam/tetrahedron/corotational/sequential/matrixfree/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 54.303 s ( 1.84152 FPS).  
examples/Validation/cantilever_beam/tetrahedron/corotational/sequential/assembled/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 40.3483 s ( 2.47842 FPS).  
examples/Validation/cantilever_beam/tetrahedron/corotational/parallel/matrixfree/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 13.21 s ( 7.57002 FPS).  
examples/Validation/cantilever_beam/tetrahedron/corotational/parallel/assembled/TetrahedronCorotationalFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 39.6303 s ( 2.52332 FPS).  
examples/Validation/cantilever_beam/tetrahedron/linear/sequential/matrixfree/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 46.3272 s ( 2.15856 FPS).  
examples/Validation/cantilever_beam/tetrahedron/linear/sequential/assembled/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 39.8387 s ( 2.51012 FPS).  
examples/Validation/cantilever_beam/tetrahedron/linear/parallel/matrixfree/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 12.6507 s ( 7.90468 FPS).  
examples/Validation/cantilever_beam/tetrahedron/linear/parallel/assembled/TetrahedronLinearSmallStrainFEMForceField.scn
[INFO]    [BatchGUI] 100 iterations done in 39.7568 s ( 2.51529 FPS).

Explanation by Claude on the why the difference between cpus (seems plausible 🫠)

Why the contiguous buffer optimization helps more on M3 Max than i7-13700K

  1. Memory access pattern is the key

  The optimization replaces indirect access through FactorizedElementStiffness (which contains both the factorized form and the assembled matrix, plus other metadata) with direct access to a
  contiguous array of just the assembled matrices. The benefit is data locality: instead of striding through large structs to reach the matrix you need, you're reading densely-packed matrices
  sequentially.

  2. M3 Max: unified memory, but L1/L2 cache still matters

  The M3 Max has excellent memory bandwidth (~400 GB/s), but the real win here isn't bandwidth — it's cache line utilization. Before the optimization, each FactorizedElementStiffness struct is large
  (it contains the factorized representation + the dense NxN matrix + other data). When you iterate over elements and only read the assembled matrix portion, you're pulling entire cache lines that are
   mostly wasted on the other struct members. With the contiguous buffer, every byte fetched into cache is useful.

  The 3-4x speedup on matrixfree paths suggests the old code was severely cache-polluted: the working set of stiffness data exceeded L2 but the contiguous version fits (or at least fits much better).

  3. i7-13700K: why less improvement?

  Several factors:

  - Larger L2/L3 caches: The i7-13700K has 30MB L3 shared cache vs M3 Max's ~48MB, but the per-core L2 is 1.25MB (P-cores) vs M3 Max's ~16MB shared L2 cluster. However, x86 cache prefetchers are
  extremely aggressive — Intel's hardware prefetcher can detect strided access patterns and prefetch ahead even through large structs. Apple Silicon prefetchers are good but may not handle
  large-stride patterns as aggressively.
  - GCC vs Clang codegen: GCC may already be generating code that handles the indirect access pattern well (e.g., better loop unrolling or vectorization of the matrix-vector product even with the old
  layout). Clang on macOS might have been generating less optimal code for the strided access, making the contiguous layout a bigger relative win.

  4. Why no speedup for tetrahedron on Linux?

  This is the most telling data point. For tetrahedra:
  - NumberOfNodesInElement = 4, NumberOfDofsInElement = 12
  - Assembled matrix size: 12×12 = 144 floats = 576 bytes per element

  For hexahedra:
  - NumberOfNodesInElement = 8, NumberOfDofsInElement = 24
  - Assembled matrix size: 24×24 = 576 floats = 2304 bytes per element

  The FactorizedElementStiffness struct overhead is proportionally much larger for hexahedra (the factorized form + metadata adds significant padding around the 2.3KB matrix). For tetrahedra, the
  assembled matrix is small enough that:

  1. The FactorizedElementStiffness struct for tet is also smaller, so more elements fit in cache even without the optimization
  2. Intel's hardware prefetcher handles the smaller stride effectively
  3. The matrix-vector product itself (12×12) is fast enough that the bottleneck may be elsewhere (e.g., the scatter/gather of nodal displacements, rotation multiplication, or task scheduling
  overhead)

  On M3 Max, even the tet case shows a 3.4x speedup, which suggests Apple Silicon is more sensitive to cache pollution from the struct padding — possibly because its prefetcher doesn't compensate as
  well for the strided access pattern, or because the M3 Max's memory subsystem has higher latency for cache misses that the contiguous layout avoids.

By submitting this pull request, I acknowledge that
I have read, understand, and agree SOFA Developer Certificate of Origin (DCO).


Reviewers will merge this pull-request only if

  • it builds with SUCCESS for all platforms on the CI.
  • it does not generate new warnings.
  • it does not generate new unit test failures.
  • it does not generate new scene test failures.
  • it does not break API compatibility.
  • it is more than 1 week old (or has fast-merge label).

fredroy added 2 commits April 6, 2026 09:01
  Extract assembled stiffness matrices into a separate contiguous buffer
  (m_assembledStiffnessMatrices) to replace getReadAccessor calls on
  Data<vector<FactorizedElementStiffness>> inside parallel forEachRange
  lambdas. The read accessor acquires a shared lock on the Data object,
  causing contention across threads and effectively serializing the
  parallel work during CG iterations. Using a direct const reference to a
  plain vector eliminates this synchronization bottleneck (~3x speedup in
  parallel mode). As a secondary benefit, the contiguous buffer only
  stores the assembled 24x24 matrices (~4.6 KB each) rather than the full
  FactorizedElementStiffness structs (~14 KB each), improving cache
  utilization.
@fredroy fredroy added pr: enhancement About a possible enhancement pr: status wip Development in the pull-request is still in progress pr: based on previous PR PR based on a previous PR, therefore to be merged ONLY subsequently pr: AI-aided Label notifying the reviewers that part or all of the PR has been generated with the help of an AI pr: status to review To notify reviewers to review this pull-request and removed pr: status wip Development in the pull-request is still in progress labels Apr 6, 2026
@fredroy
Copy link
Copy Markdown
Contributor Author

fredroy commented Apr 6, 2026

[ci-build][with-all-tests]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr: AI-aided Label notifying the reviewers that part or all of the PR has been generated with the help of an AI pr: based on previous PR PR based on a previous PR, therefore to be merged ONLY subsequently pr: enhancement About a possible enhancement pr: status to review To notify reviewers to review this pull-request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant