feat: add optional DiT lyric timing output by javAlborz · Pull Request #79 · ServeurpersoCom/acestep.cpp

javAlborz · 2026-05-20T14:00:45Z

Summary

add an opt-in return_lyric_timing request field for ACE-Step DiT lyric-attention timing
capture selected cross-attention heads during a Python-style alignment replay and convert them to compact line/token timing JSON
expose timing from ace-synth as *.lyric-timing.json and from /synth as an optional lyric_timing multipart JSON part
document the request field and response behavior

Notes

This is DiT lyric-attention timing, not ASR ground truth. If generated vocals skip or mutate words, the timestamps can reflect the model's intended lyric plan rather than the audible waveform.

Validation

g++ -std=c++17 ... -fsyntax-only src/pipeline-synth-ops.cpp
g++ -std=c++17 ... -fsyntax-only tools/ace-synth.cpp
g++ -std=c++17 ... -fsyntax-only tools/ace-server.cpp

Full local CMake build was not run in this environment because cmake is not installed on the host.

Summary by CodeRabbit

Release Notes

New Features
- Added optional return_lyric_timing parameter to enable lyric timing output
- API responses now include lyric timing JSON data in multipart responses when enabled
- CLI tools generate *.lyric-timing.json sidecar files containing lyric timestamp information
Documentation
- Updated API endpoint and CLI documentation to describe the new lyric timing feature and output format

coderabbitai · 2026-05-20T14:00:54Z

📝 Walkthrough

Walkthrough

This PR implements optional lyric timing capture during music synthesis. When enabled via return_lyric_timing request flag, the DiT model exports cross-attention scores during generation, which are post-processed through alignment matrix computation and dynamic time warping to produce token-frame timing metadata as JSON sidecars.

Changes

Lyric Timing Feature

Layer / File(s)	Summary
Request schema and documentation `src/request.h`, `src/request.cpp`, `docs/ARCHITECTURE.md`, `README.md`	Introduces `return_lyric_timing` boolean flag (default `false`) with request parsing, serialization, and user documentation across API and CLI references.
DiT graph cross-attention score capture `src/dit.h`, `src/dit-graph.h`	Adds `capture_lyric_attn` flag to `DiTGGML` model and layer-indexed conditional cross-attention path: when enabled for specific layers, explicitly computes and exports attention scores as named graph outputs instead of using flash attention.
DiT sampler alignment and score extraction `src/dit-sampler.h`	Extends `dit_ggml_generate` with optional post-sampling phase: replays alignment forward step on aligned latent, re-runs DiT graph, extracts cross-attention score tensors from layers 2–6, and populates caller-provided head/frame accumulation buffers for selected token ranges.
Lyric timing alignment algorithm `src/lyric-timing.h`	Standalone DSP implementation: incremental BPE token decoding, normalized attention-based alignment matrix with repeated median suppression, dynamic time warping for monotonic token/frame path recovery, and JSON/LRC output generation with per-token start/end times and error handling.
Debug tensor utilities `src/debug.h`	Adds 3D float tensor dumper and i32 tensor dumping infrastructure (generic and 1D wrapper) for debugging captured attention matrices and alignment metadata.
Pipeline lyric token extraction and metadata `src/pipeline-synth-impl.h`, `src/pipeline-synth-ops.cpp`	Extends `SynthState` with lyric timing fields; implements lyric token range identification in `ops_encode_text` by tokenizing language/lyric header and trimming at first EOS.
Pipeline generation orchestration `src/pipeline-synth-ops.cpp`	Orchestrates feature in `ops_dit_generate`: snapshots DiT flags, conditionally enables capture and forces non-flash attention, passes `LyricAttentionDebug` to sampler, validates captured data, builds JSON via DTW algorithm or error JSON on failure.
Job API and getter `src/pipeline-synth.h`, `src/pipeline-synth.cpp`	Exposes `ace_synth_job_get_lyric_timing_json()` public API to retrieve lyric timing JSON (limited to `track_idx == 0` for batched jobs); initializes lyric metadata in job allocation.
Output threading and serialization `tools/synth-batch-runner.h`, `tools/ace-server.cpp`, `tools/ace-synth.cpp`	Threads lyric timing JSON from jobs through batch runner (optional output parameter), integrates into server synth worker and multipart response builder (injects optional `lyric_timing` JSON part), and extends CLI to write per-track JSON sidecars.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Timing hops with whispered care,
Cross-attention woven through the air,
DTW aligns the beat and lyric dance,
Each token finds its frame's sweet chance,
JSON sidecars bloom, the rhythm's truth!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: add optional DiT lyric timing output' accurately describes the main change: adding an optional feature for DiT lyric timing output with JSON serialization.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/ARCHITECTURE.md`:
- Around line 938-940: Update the multipart/mixed docs for synth to explicitly
state the batch limitation: clarify that the optional lyric_timing JSON part is
sample-0 scoped (applies only to the first generated track in a batch) and not
produced per track; reference the existing "synth" multipart/mixed description
and mention "lyric_timing" and "sample-0" so readers know only one lyric_timing
part will be present when multiple tracks are generated in a single batch.

In `@src/dit-sampler.h`:
- Around line 785-791: The code extracting lyric/timing scores currently reads
only the first batch slice from tensor t_scores by using ggml_backend_tensor_get
with offset 0 and sample_elems = n0*n1*n2 (variables n0, n1, n2, sample_elems,
and t_scores), which ignores any batch dimension; change this to either (A)
guard/fail-fast: detect the batch size (the remaining dimension of t_scores,
e.g. n3 or t_scores->ne[3]) and return an error or assert if batch > 1 so
callers know per-sample output is unsupported, or (B) support batching by
looping over batch index b and calling ggml_backend_tensor_get with the proper
per-batch offset (b * sample_elems) to fill per-sample score buffers; update the
related blocks (the similar section around lines 798-815) to use the same guard
or per-batch loop and ensure sample_elems is computed as n0*n1*n2 and multiplied
by batch when computing total bytes read.
- Around line 743-749: The code currently hardcodes t_last = 1.0f / (float)
num_steps which can mismatch non-linear/custom schedules; instead compute t_last
from the actual sampling schedule used by the sampler (e.g., read the final
timestep value from the schedule array or call the schedule accessor used by
sampling) and assign that value to t_last before calling ggml_backend_tensor_set
for t_t and t_tr (replace the 1.0f/num_steps expression with the schedule's
final timestep used by the sampling code so replay aligns exactly with the
sampler).

In `@src/pipeline-synth-ops.cpp`:
- Around line 480-485: The current code uses header_ids.size() to set
s.lyric_start_idx, which is wrong because BPE merges can cross the header+lyrics
boundary; instead encode the full concatenated string (e.g., combined_str =
header_str + lyric_str) with bpe_encode to get full_ids and then compute the
split index by mapping token boundaries to character/byte offsets (walk tokens
from full_ids and accumulate decoded byte/char lengths until you reach
header_str.size()); set s.lyric_token_ids to the lyric slice from full_ids
starting at that computed split index and set s.lyric_start_idx and
s.lyric_end_idx from that split and lyric length (update uses of header_ids,
header_str, lyric_ids, bpe_encode, s.lyric_token_ids, s.lyric_start_idx,
s.lyric_end_idx).

In `@src/pipeline-synth.cpp`:
- Around line 672-674: The function ace_synth_job_get_lyric_timing_json
currently gates access to lyric timing by hardcoding track_idx == 0; remove that
check and implement per-track validation and retrieval: instead of "track_idx !=
0", validate track_idx against the actual per-job track container (e.g.
job->state.lyric_timing_json.size() or job->state.track_count), return nullptr
if out of range, and return the specific per-track string (e.g.
job->state.lyric_timing_json[track_idx].c_str()) only if that entry is
non-empty; update ace_synth_job_get_lyric_timing_json to use those symbols for
correct multi-track behavior.

In `@tools/ace-synth.cpp`:
- Around line 333-341: The current write block for timing_path opens tf and
writes lyric_timings[b] but ignores fwrite/fclose results; after opening tf in
the block that uses tf, check that fwrite returns the expected byte count
(lyric_timings[b].size()) and check fclose returns 0, and on either failure emit
a descriptive stderr message including errno/strerror (or perror) that includes
timing_path and batch index b, and propagate the failure by setting a non-zero
exit/error flag or returning an error from the enclosing function so the process
exits non-zero instead of silently continuing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 911578a7-d8c9-4cfa-816c-e79942c32081

📥 Commits

Reviewing files that changed from the base of the PR and between 4922ed1 and 38d3ca6.

📒 Files selected for processing (16)

README.md
docs/ARCHITECTURE.md
src/debug.h
src/dit-graph.h
src/dit-sampler.h
src/dit.h
src/lyric-timing.h
src/pipeline-synth-impl.h
src/pipeline-synth-ops.cpp
src/pipeline-synth.cpp
src/pipeline-synth.h
src/request.cpp
src/request.h
tools/ace-server.cpp
tools/ace-synth.cpp
tools/synth-batch-runner.h

coderabbitai · 2026-05-20T14:08:06Z

+  synth:      multipart/mixed (one audio part + one latent part per track, paired;
+               optional lyric_timing JSON part when requested)
  understand: multipart/mixed (one json part + one latent part for the source)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document the batch limitation for lyric_timing explicitly.

This section should state that lyric timing is currently sample-0 scoped (only the first generated track in a batch), otherwise clients may assume one lyric_timing part per track.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/ARCHITECTURE.md` around lines 938 - 940, Update the multipart/mixed docs for synth to explicitly state the batch limitation: clarify that the optional lyric_timing JSON part is sample-0 scoped (applies only to the first generated track in a batch) and not produced per track; reference the existing "synth" multipart/mixed description and mention "lyric_timing" and "sample-0" so readers know only one lyric_timing part will be present when multiple tracks are generated in a single batch.

coderabbitai · 2026-05-20T14:08:06Z

+        float t_last = 1.0f / (float) num_steps;
+        if (t_t) {
+            ggml_backend_tensor_set(t_t, &t_last, 0, sizeof(float));
+        }
+        if (t_tr) {
+            ggml_backend_tensor_set(t_tr, &t_last, 0, sizeof(float));
+        }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the actual final schedule timestep for alignment replay.

Line 743 hardcodes 1.0f / num_steps, which can desync replay from the real final denoise step when schedule is non-linear/custom. Use the same last timestep used by sampling.

Proposed fix

- float t_last = 1.0f / (float) num_steps; + float t_last = schedule[num_steps - 1];

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

float t_last = 1.0f / (float) num_steps;

if (t_t) {

ggml_backend_tensor_set(t_t, &t_last, 0, sizeof(float));

}

if (t_tr) {

ggml_backend_tensor_set(t_tr, &t_last, 0, sizeof(float));

}

float t_last = schedule[num_steps - 1];

if (t_t) {

ggml_backend_tensor_set(t_t, &t_last, 0, sizeof(float));

}

if (t_tr) {

ggml_backend_tensor_set(t_tr, &t_last, 0, sizeof(float));

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/dit-sampler.h` around lines 743 - 749, The code currently hardcodes t_last = 1.0f / (float) num_steps which can mismatch non-linear/custom schedules; instead compute t_last from the actual sampling schedule used by the sampler (e.g., read the final timestep value from the schedule array or call the schedule accessor used by sampling) and assign that value to t_last before calling ggml_backend_tensor_set for t_t and t_tr (replace the 1.0f/num_steps expression with the schedule's final timestep used by the sampling code so replay aligns exactly with the sampler).

coderabbitai · 2026-05-20T14:08:06Z

+            int n0 = (int) t_scores->ne[0];  // encoder tokens
+            int n1 = (int) t_scores->ne[1];  // latent frames
+            int n2 = (int) t_scores->ne[2];  // heads
+            int sample_elems = n0 * n1 * n2;
+            std::vector<float> scores(sample_elems);
+            ggml_backend_tensor_get(t_scores, scores.data(), 0, sample_elems * sizeof(float));
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Lyric score extraction currently ignores batch dimension and only reads sample 0.

Lines 785-791 read only n0*n1*n2 from offset 0, so for N > 1 timing is extracted from the first batch slice only. This silently produces incomplete/wrong results for batched synthesis.

Proposed guard (fail fast until per-sample output is supported)

+ if (N != 1) { + fprintf(stderr, "[DiT] lyric timing capture currently supports N==1 only\n"); + ggml_free(ctx); + return -1; + } + for (int layer = 2; layer <= 6; layer++) {

Also applies to: 798-815

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/dit-sampler.h` around lines 785 - 791, The code extracting lyric/timing scores currently reads only the first batch slice from tensor t_scores by using ggml_backend_tensor_get with offset 0 and sample_elems = n0*n1*n2 (variables n0, n1, n2, sample_elems, and t_scores), which ignores any batch dimension; change this to either (A) guard/fail-fast: detect the batch size (the remaining dimension of t_scores, e.g. n3 or t_scores->ne[3]) and return an error or assert if batch > 1 so callers know per-sample output is unsupported, or (B) support batching by looping over batch index b and calling ggml_backend_tensor_get with the proper per-batch offset (b * sample_elems) to fill per-sample score buffers; update the related blocks (the similar section around lines 798-815) to use the same guard or per-batch loop and ensure sample_elems is computed as n0*n1*n2 and multiplied by batch when computing total bytes read.

coderabbitai · 2026-05-20T14:08:06Z

+                std::string  header_str = std::string("# Languages\n") + language_b + "\n\n# Lyric\n";
+                auto         header_ids = bpe_encode(bpe, header_str.c_str(), false);
+
+                s.lyric_token_ids = lyric_ids;
+                s.lyric_start_idx = (int) header_ids.size();
+                s.lyric_end_idx   = (int) lyric_ids.size();


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Compute lyric_start_idx from the full token stream, not a separately tokenized header.

With BPE, merges can cross the header + lyrics boundary, so header_ids.size() is not guaranteed to match the real start index inside lyric_ids. This can mis-slice tokens and skew timing JSON.

A safer approach is to derive the boundary from the already-tokenized full sequence (e.g., decode-incremental char offsets against header_str.size()), then set lyric_start_idx from that exact crossing point.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/pipeline-synth-ops.cpp` around lines 480 - 485, The current code uses header_ids.size() to set s.lyric_start_idx, which is wrong because BPE merges can cross the header+lyrics boundary; instead encode the full concatenated string (e.g., combined_str = header_str + lyric_str) with bpe_encode to get full_ids and then compute the split index by mapping token boundaries to character/byte offsets (walk tokens from full_ids and accumulate decoded byte/char lengths until you reach header_str.size()); set s.lyric_token_ids to the lyric slice from full_ids starting at that computed split index and set s.lyric_start_idx and s.lyric_end_idx from that split and lyric length (update uses of header_ids, header_str, lyric_ids, bpe_encode, s.lyric_token_ids, s.lyric_start_idx, s.lyric_end_idx).

coderabbitai · 2026-05-20T14:08:07Z

+const char * ace_synth_job_get_lyric_timing_json(const AceSynthJob * job, int track_idx) {
+    if (!job || track_idx != 0 || job->state.lyric_timing_json.empty()) {
+        return nullptr;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the hardcoded track-0 gate in lyric timing getter.

Line 673 forces track_idx != 0 to return nullptr, so batch outputs beyond index 0 never receive lyric timing even though callers query per-track.

Suggested fix

const char * ace_synth_job_get_lyric_timing_json(const AceSynthJob * job, int track_idx) { - if (!job || track_idx != 0 || job->state.lyric_timing_json.empty()) { + if (!job || track_idx < 0 || track_idx >= job->batch_n || job->state.lyric_timing_json.empty()) { return nullptr; } return job->state.lyric_timing_json.c_str(); }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/pipeline-synth.cpp` around lines 672 - 674, The function ace_synth_job_get_lyric_timing_json currently gates access to lyric timing by hardcoding track_idx == 0; remove that check and implement per-track validation and retrieval: instead of "track_idx != 0", validate track_idx against the actual per-job track container (e.g. job->state.lyric_timing_json.size() or job->state.track_count), return nullptr if out of range, and return the specific per-track string (e.g. job->state.lyric_timing_json[track_idx].c_str()) only if that entry is non-empty; update ace_synth_job_get_lyric_timing_json to use those symbols for correct multi-track behavior.

coderabbitai · 2026-05-20T14:08:07Z

+            FILE * tf = fopen(timing_path, "wb");
+            if (!tf) {
+                fprintf(stderr, "[Ace-Synth Batch%d] FATAL: failed to write %s\n", b, timing_path);
+            } else {
+                fwrite(lyric_timings[b].data(), 1, lyric_timings[b].size(), tf);
+                fputc('\n', tf);
+                fclose(tf);
+                fprintf(stderr, "[Ace-Synth Batch%d] Wrote %s\n", b, timing_path);
+            }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Check and surface lyric sidecar write failures.

The sidecar path ignores fwrite/fclose outcomes, and write failures don’t affect exit status. This can silently produce incomplete or missing .lyric-timing.json files in automation.

Suggested fix

+ bool lyric_timing_write_failed = false; // Write output files for (int b = 0; b < (int) all_audio.size(); b++) { @@ FILE * tf = fopen(timing_path, "wb"); if (!tf) { - fprintf(stderr, "[Ace-Synth Batch%d] FATAL: failed to write %s\n", b, timing_path); + fprintf(stderr, "[Ace-Synth Batch%d] ERROR: failed to open %s\n", b, timing_path); + lyric_timing_write_failed = true; } else { - fwrite(lyric_timings[b].data(), 1, lyric_timings[b].size(), tf); - fputc('\n', tf); - fclose(tf); - fprintf(stderr, "[Ace-Synth Batch%d] Wrote %s\n", b, timing_path); + const size_t want = lyric_timings[b].size(); + const size_t wrote = fwrite(lyric_timings[b].data(), 1, want, tf); + const bool ok = (wrote == want) && (fputc('\n', tf) != EOF) && (fclose(tf) == 0); + if (!ok) { + fprintf(stderr, "[Ace-Synth Batch%d] ERROR: failed to flush %s\n", b, timing_path); + lyric_timing_write_failed = true; + } else { + fprintf(stderr, "[Ace-Synth Batch%d] Wrote %s\n", b, timing_path); + } } } @@ - fprintf(stderr, "[Ace-Synth] All done\n"); - return 0; + if (lyric_timing_write_failed) { + return 1; + } + fprintf(stderr, "[Ace-Synth] All done\n"); + return 0;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/ace-synth.cpp` around lines 333 - 341, The current write block for timing_path opens tf and writes lyric_timings[b] but ignores fwrite/fclose results; after opening tf in the block that uses tf, check that fwrite returns the expected byte count (lyric_timings[b].size()) and check fclose returns 0, and on either failure emit a descriptive stderr message including errno/strerror (or perror) that includes timing_path and batch index b, and propagate the failure by setting a non-zero exit/error flag or returning an error from the enclosing function so the process exits non-zero instead of silently continuing.

feat: add optional DiT lyric timing output

38d3ca6

javAlborz marked this pull request as ready for review May 20, 2026 14:03

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add optional DiT lyric timing output#79

feat: add optional DiT lyric timing output#79
javAlborz wants to merge 1 commit into
ServeurpersoCom:masterfrom
javAlborz:feature/lyric-timing-attention

javAlborz commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

javAlborz commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Validation

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

javAlborz commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading