Skip to content

Math: FFT: Optimizations with Xtensa HiFi code#10637

Open
singalsu wants to merge 2 commits intothesofproject:mainfrom
singalsu:math_fft_xtensa_hifi_optimizations
Open

Math: FFT: Optimizations with Xtensa HiFi code#10637
singalsu wants to merge 2 commits intothesofproject:mainfrom
singalsu:math_fft_xtensa_hifi_optimizations

Conversation

@singalsu
Copy link
Collaborator

No description provided.

This patch optimizes the cycle count of the radix-2 Cooley-Tukey
implementation with with three changes:

- Dedicated depth-1 stage: all N/2 butterflies use a real twiddle
  factor W^0 = 1+0j, so the complex multiply is replaced by plain
  add or subtract.

- Skip multiply for j=0 in stages >= 2: The first butterfly in every
  group also uses W^0, saving an additional ~N/2 complex multiplications
  across all remaining stages.

- Pointer arithmetic: replace per-butterfly index arithmetic
  (outx[k+j], outx[k+j+n], twiddle[i*j]) with auto-incrementing
  pointers and strided twiddle access (tw_r += stride), eliminating
  integer multiplies for address computation.

This change saves 11 MCPS (from 74 MCPS to 63 MCPS) in STFT Process
module in MTL platform with 1024/256 size/hop FFT processing. It was
tested with scripts:

scripts/rebuild-testbench.sh -p mtl
scripts/sof-testbench-helper.sh -x -m stft_process_1024_256_ \
  -p profile-stft_process.txt

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
@singalsu singalsu force-pushed the math_fft_xtensa_hifi_optimizations branch from 6acb7ad to 0fef7d3 Compare March 20, 2026 16:00
@singalsu singalsu marked this pull request as ready for review March 20, 2026 16:29
Copilot AI review requested due to automatic review settings March 20, 2026 16:29
@singalsu singalsu requested a review from kv2019i as a code owner March 20, 2026 16:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the multi-radix FFT implementation by moving the generic multi-FFT logic out of fft_multi.c and introducing a HiFi3-optimized implementation using Xtensa intrinsics, plus a small HiFi3 FFT kernel optimization.

Changes:

  • Split dft3_32() and fft_multi_execute_32() out of fft_multi.c into new generic and HiFi3-specific source files.
  • Add a new HiFi3-optimized multi-FFT implementation (fft_multi_hifi3.c) using packed complex arithmetic and fused MAC operations.
  • Optimize the HiFi3 32-bit FFT kernel by special-casing the first stage and skipping the twiddle multiply for j=0.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/cmocka/src/math/fft/CMakeLists.txt Updates unit test build inputs to compile the new split multi-FFT sources.
src/math/fft/fft_multi_hifi3.c Adds HiFi3-optimized dft3_32() and fft_multi_execute_32() implementations.
src/math/fft/fft_multi_generic.c Adds the generic dft3_32() and fft_multi_execute_32() implementations previously in fft_multi.c.
src/math/fft/fft_multi.c Removes dft3_32()/fft_multi_execute_32() from this file, leaving plan allocation/free + twiddle table inclusion.
src/math/fft/fft_32_hifi3.c Optimizes HiFi3 FFT stage execution (skip twiddle multiply in first stage and for j=0).
src/math/fft/CMakeLists.txt Adds the new multi-FFT source files to the build when CONFIG_MATH_FFT_MULTI is enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This patch adds HiFi3 versions for functions dft3_32() and
fft_multi_execute_32(). The functions are implemented to
fft_multi_hifi3.c and the generic versions are moved to
fft_multi_generic.c.

in MTL platform the optimization saves 119 MCPS, from 237 MCPS
to 118 MCPS. The test was done with script run:

scripts/rebuild-testbench.sh -p mtl
scripts/sof-testbench-helper.sh -x -m stft_process_1536_240_ \
  -p profile-stft_process.txt

The above STFT used FFT length of 1536 with hop 240.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
@singalsu singalsu force-pushed the math_fft_xtensa_hifi_optimizations branch from 0fef7d3 to 9e70d76 Compare March 20, 2026 16:51
@singalsu singalsu requested a review from Copilot March 20, 2026 17:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +78 to +83
#ifdef DEBUG_DUMP_TO_FILE
FILE *fh1 = fopen("debug_fft_multi_int1.txt", "w");
FILE *fh2 = fopen("debug_fft_multi_int2.txt", "w");
FILE *fh3 = fopen("debug_fft_multi_twiddle.txt", "w");
FILE *fh4 = fopen("debug_fft_multi_dft_out.txt", "w");
#endif
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG_DUMP_TO_FILE block uses FILE, fopen(), fprintf(), and fclose() but this file does not include <stdio.h>. If DEBUG_DUMP_TO_FILE is enabled, this will fail to compile; include <stdio.h> within the same #ifdef DEBUG_DUMP_TO_FILE guard (or unconditionally if preferred).

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +47
/** @brief Q1.31 constant -0.5 */
#define DFT3_COEFR -1073741824
/** @brief Q1.31 constant sqrt(3)/2 */
#define DFT3_COEFI 1859775393
/** @brief Q1.31 constant 1/3 */
#define DFT3_SCALE 715827883

/**
* dft3_32() - Compute 3-point DFT of Q1.31 complex data (HiFi3).
* @param x_in Pointer to 3 input complex samples in Q1.31.
* @param y Pointer to 3 output complex samples in Q1.31.
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DFT3 documentation here says the output is Q1.31, but the public API comment for dft3_32() in sof/math/fft.h describes the output as Q3.29. Please reconcile these so callers have a single, accurate definition of the output format (either update this comment to match the API, or adjust the API docs if they are incorrect).

Suggested change
/** @brief Q1.31 constant -0.5 */
#define DFT3_COEFR -1073741824
/** @brief Q1.31 constant sqrt(3)/2 */
#define DFT3_COEFI 1859775393
/** @brief Q1.31 constant 1/3 */
#define DFT3_SCALE 715827883
/**
* dft3_32() - Compute 3-point DFT of Q1.31 complex data (HiFi3).
* @param x_in Pointer to 3 input complex samples in Q1.31.
* @param y Pointer to 3 output complex samples in Q1.31.
/** @brief Q3.29 constant -0.5 (same fixed-point format as dft3_32() inputs) */
#define DFT3_COEFR -1073741824
/** @brief Q3.29 constant sqrt(3)/2 */
#define DFT3_COEFI 1859775393
/** @brief Q3.29 constant 1/3 */
#define DFT3_SCALE 715827883
/**
* dft3_32() - Compute 3-point DFT of Q3.29 complex data (HiFi3).
* @param x_in Pointer to 3 input complex samples in Q3.29.
* @param y Pointer to 3 output complex samples in Q3.29.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants