Math: FFT: Optimizations with Xtensa HiFi code#10637
Math: FFT: Optimizations with Xtensa HiFi code#10637singalsu wants to merge 2 commits intothesofproject:mainfrom
Conversation
This patch optimizes the cycle count of the radix-2 Cooley-Tukey implementation with with three changes: - Dedicated depth-1 stage: all N/2 butterflies use a real twiddle factor W^0 = 1+0j, so the complex multiply is replaced by plain add or subtract. - Skip multiply for j=0 in stages >= 2: The first butterfly in every group also uses W^0, saving an additional ~N/2 complex multiplications across all remaining stages. - Pointer arithmetic: replace per-butterfly index arithmetic (outx[k+j], outx[k+j+n], twiddle[i*j]) with auto-incrementing pointers and strided twiddle access (tw_r += stride), eliminating integer multiplies for address computation. This change saves 11 MCPS (from 74 MCPS to 63 MCPS) in STFT Process module in MTL platform with 1024/256 size/hop FFT processing. It was tested with scripts: scripts/rebuild-testbench.sh -p mtl scripts/sof-testbench-helper.sh -x -m stft_process_1024_256_ \ -p profile-stft_process.txt Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
6acb7ad to
0fef7d3
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors the multi-radix FFT implementation by moving the generic multi-FFT logic out of fft_multi.c and introducing a HiFi3-optimized implementation using Xtensa intrinsics, plus a small HiFi3 FFT kernel optimization.
Changes:
- Split
dft3_32()andfft_multi_execute_32()out offft_multi.cinto new generic and HiFi3-specific source files. - Add a new HiFi3-optimized multi-FFT implementation (
fft_multi_hifi3.c) using packed complex arithmetic and fused MAC operations. - Optimize the HiFi3 32-bit FFT kernel by special-casing the first stage and skipping the twiddle multiply for
j=0.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/cmocka/src/math/fft/CMakeLists.txt | Updates unit test build inputs to compile the new split multi-FFT sources. |
| src/math/fft/fft_multi_hifi3.c | Adds HiFi3-optimized dft3_32() and fft_multi_execute_32() implementations. |
| src/math/fft/fft_multi_generic.c | Adds the generic dft3_32() and fft_multi_execute_32() implementations previously in fft_multi.c. |
| src/math/fft/fft_multi.c | Removes dft3_32()/fft_multi_execute_32() from this file, leaving plan allocation/free + twiddle table inclusion. |
| src/math/fft/fft_32_hifi3.c | Optimizes HiFi3 FFT stage execution (skip twiddle multiply in first stage and for j=0). |
| src/math/fft/CMakeLists.txt | Adds the new multi-FFT source files to the build when CONFIG_MATH_FFT_MULTI is enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This patch adds HiFi3 versions for functions dft3_32() and fft_multi_execute_32(). The functions are implemented to fft_multi_hifi3.c and the generic versions are moved to fft_multi_generic.c. in MTL platform the optimization saves 119 MCPS, from 237 MCPS to 118 MCPS. The test was done with script run: scripts/rebuild-testbench.sh -p mtl scripts/sof-testbench-helper.sh -x -m stft_process_1536_240_ \ -p profile-stft_process.txt The above STFT used FFT length of 1536 with hop 240. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
0fef7d3 to
9e70d76
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #ifdef DEBUG_DUMP_TO_FILE | ||
| FILE *fh1 = fopen("debug_fft_multi_int1.txt", "w"); | ||
| FILE *fh2 = fopen("debug_fft_multi_int2.txt", "w"); | ||
| FILE *fh3 = fopen("debug_fft_multi_twiddle.txt", "w"); | ||
| FILE *fh4 = fopen("debug_fft_multi_dft_out.txt", "w"); | ||
| #endif |
There was a problem hiding this comment.
DEBUG_DUMP_TO_FILE block uses FILE, fopen(), fprintf(), and fclose() but this file does not include <stdio.h>. If DEBUG_DUMP_TO_FILE is enabled, this will fail to compile; include <stdio.h> within the same #ifdef DEBUG_DUMP_TO_FILE guard (or unconditionally if preferred).
| /** @brief Q1.31 constant -0.5 */ | ||
| #define DFT3_COEFR -1073741824 | ||
| /** @brief Q1.31 constant sqrt(3)/2 */ | ||
| #define DFT3_COEFI 1859775393 | ||
| /** @brief Q1.31 constant 1/3 */ | ||
| #define DFT3_SCALE 715827883 | ||
|
|
||
| /** | ||
| * dft3_32() - Compute 3-point DFT of Q1.31 complex data (HiFi3). | ||
| * @param x_in Pointer to 3 input complex samples in Q1.31. | ||
| * @param y Pointer to 3 output complex samples in Q1.31. |
There was a problem hiding this comment.
The DFT3 documentation here says the output is Q1.31, but the public API comment for dft3_32() in sof/math/fft.h describes the output as Q3.29. Please reconcile these so callers have a single, accurate definition of the output format (either update this comment to match the API, or adjust the API docs if they are incorrect).
| /** @brief Q1.31 constant -0.5 */ | |
| #define DFT3_COEFR -1073741824 | |
| /** @brief Q1.31 constant sqrt(3)/2 */ | |
| #define DFT3_COEFI 1859775393 | |
| /** @brief Q1.31 constant 1/3 */ | |
| #define DFT3_SCALE 715827883 | |
| /** | |
| * dft3_32() - Compute 3-point DFT of Q1.31 complex data (HiFi3). | |
| * @param x_in Pointer to 3 input complex samples in Q1.31. | |
| * @param y Pointer to 3 output complex samples in Q1.31. | |
| /** @brief Q3.29 constant -0.5 (same fixed-point format as dft3_32() inputs) */ | |
| #define DFT3_COEFR -1073741824 | |
| /** @brief Q3.29 constant sqrt(3)/2 */ | |
| #define DFT3_COEFI 1859775393 | |
| /** @brief Q3.29 constant 1/3 */ | |
| #define DFT3_SCALE 715827883 | |
| /** | |
| * dft3_32() - Compute 3-point DFT of Q3.29 complex data (HiFi3). | |
| * @param x_in Pointer to 3 input complex samples in Q3.29. | |
| * @param y Pointer to 3 output complex samples in Q3.29. |
No description provided.