Skip to content

feat: add optional DiT lyric timing output#79

Open
javAlborz wants to merge 1 commit into
ServeurpersoCom:masterfrom
javAlborz:feature/lyric-timing-attention
Open

feat: add optional DiT lyric timing output#79
javAlborz wants to merge 1 commit into
ServeurpersoCom:masterfrom
javAlborz:feature/lyric-timing-attention

Conversation

@javAlborz
Copy link
Copy Markdown

@javAlborz javAlborz commented May 20, 2026

Summary

  • add an opt-in return_lyric_timing request field for ACE-Step DiT lyric-attention timing
  • capture selected cross-attention heads during a Python-style alignment replay and convert them to compact line/token timing JSON
  • expose timing from ace-synth as *.lyric-timing.json and from /synth as an optional lyric_timing multipart JSON part
  • document the request field and response behavior

Notes

This is DiT lyric-attention timing, not ASR ground truth. If generated vocals skip or mutate words, the timestamps can reflect the model's intended lyric plan rather than the audible waveform.

Validation

  • g++ -std=c++17 ... -fsyntax-only src/pipeline-synth-ops.cpp
  • g++ -std=c++17 ... -fsyntax-only tools/ace-synth.cpp
  • g++ -std=c++17 ... -fsyntax-only tools/ace-server.cpp

Full local CMake build was not run in this environment because cmake is not installed on the host.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added optional return_lyric_timing parameter to enable lyric timing output
    • API responses now include lyric timing JSON data in multipart responses when enabled
    • CLI tools generate *.lyric-timing.json sidecar files containing lyric timestamp information
  • Documentation

    • Updated API endpoint and CLI documentation to describe the new lyric timing feature and output format

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

This PR implements optional lyric timing capture during music synthesis. When enabled via return_lyric_timing request flag, the DiT model exports cross-attention scores during generation, which are post-processed through alignment matrix computation and dynamic time warping to produce token-frame timing metadata as JSON sidecars.

Changes

Lyric Timing Feature

Layer / File(s) Summary
Request schema and documentation
src/request.h, src/request.cpp, docs/ARCHITECTURE.md, README.md
Introduces return_lyric_timing boolean flag (default false) with request parsing, serialization, and user documentation across API and CLI references.
DiT graph cross-attention score capture
src/dit.h, src/dit-graph.h
Adds capture_lyric_attn flag to DiTGGML model and layer-indexed conditional cross-attention path: when enabled for specific layers, explicitly computes and exports attention scores as named graph outputs instead of using flash attention.
DiT sampler alignment and score extraction
src/dit-sampler.h
Extends dit_ggml_generate with optional post-sampling phase: replays alignment forward step on aligned latent, re-runs DiT graph, extracts cross-attention score tensors from layers 2–6, and populates caller-provided head/frame accumulation buffers for selected token ranges.
Lyric timing alignment algorithm
src/lyric-timing.h
Standalone DSP implementation: incremental BPE token decoding, normalized attention-based alignment matrix with repeated median suppression, dynamic time warping for monotonic token/frame path recovery, and JSON/LRC output generation with per-token start/end times and error handling.
Debug tensor utilities
src/debug.h
Adds 3D float tensor dumper and i32 tensor dumping infrastructure (generic and 1D wrapper) for debugging captured attention matrices and alignment metadata.
Pipeline lyric token extraction and metadata
src/pipeline-synth-impl.h, src/pipeline-synth-ops.cpp
Extends SynthState with lyric timing fields; implements lyric token range identification in ops_encode_text by tokenizing language/lyric header and trimming at first EOS.
Pipeline generation orchestration
src/pipeline-synth-ops.cpp
Orchestrates feature in ops_dit_generate: snapshots DiT flags, conditionally enables capture and forces non-flash attention, passes LyricAttentionDebug to sampler, validates captured data, builds JSON via DTW algorithm or error JSON on failure.
Job API and getter
src/pipeline-synth.h, src/pipeline-synth.cpp
Exposes ace_synth_job_get_lyric_timing_json() public API to retrieve lyric timing JSON (limited to track_idx == 0 for batched jobs); initializes lyric metadata in job allocation.
Output threading and serialization
tools/synth-batch-runner.h, tools/ace-server.cpp, tools/ace-synth.cpp
Threads lyric timing JSON from jobs through batch runner (optional output parameter), integrates into server synth worker and multipart response builder (injects optional lyric_timing JSON part), and extends CLI to write per-track JSON sidecars.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Timing hops with whispered care,
Cross-attention woven through the air,
DTW aligns the beat and lyric dance,
Each token finds its frame's sweet chance,
JSON sidecars bloom, the rhythm's truth!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: add optional DiT lyric timing output' accurately describes the main change: adding an optional feature for DiT lyric timing output with JSON serialization.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@javAlborz javAlborz marked this pull request as ready for review May 20, 2026 14:03
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/ARCHITECTURE.md`:
- Around line 938-940: Update the multipart/mixed docs for synth to explicitly
state the batch limitation: clarify that the optional lyric_timing JSON part is
sample-0 scoped (applies only to the first generated track in a batch) and not
produced per track; reference the existing "synth" multipart/mixed description
and mention "lyric_timing" and "sample-0" so readers know only one lyric_timing
part will be present when multiple tracks are generated in a single batch.

In `@src/dit-sampler.h`:
- Around line 785-791: The code extracting lyric/timing scores currently reads
only the first batch slice from tensor t_scores by using ggml_backend_tensor_get
with offset 0 and sample_elems = n0*n1*n2 (variables n0, n1, n2, sample_elems,
and t_scores), which ignores any batch dimension; change this to either (A)
guard/fail-fast: detect the batch size (the remaining dimension of t_scores,
e.g. n3 or t_scores->ne[3]) and return an error or assert if batch > 1 so
callers know per-sample output is unsupported, or (B) support batching by
looping over batch index b and calling ggml_backend_tensor_get with the proper
per-batch offset (b * sample_elems) to fill per-sample score buffers; update the
related blocks (the similar section around lines 798-815) to use the same guard
or per-batch loop and ensure sample_elems is computed as n0*n1*n2 and multiplied
by batch when computing total bytes read.
- Around line 743-749: The code currently hardcodes t_last = 1.0f / (float)
num_steps which can mismatch non-linear/custom schedules; instead compute t_last
from the actual sampling schedule used by the sampler (e.g., read the final
timestep value from the schedule array or call the schedule accessor used by
sampling) and assign that value to t_last before calling ggml_backend_tensor_set
for t_t and t_tr (replace the 1.0f/num_steps expression with the schedule's
final timestep used by the sampling code so replay aligns exactly with the
sampler).

In `@src/pipeline-synth-ops.cpp`:
- Around line 480-485: The current code uses header_ids.size() to set
s.lyric_start_idx, which is wrong because BPE merges can cross the header+lyrics
boundary; instead encode the full concatenated string (e.g., combined_str =
header_str + lyric_str) with bpe_encode to get full_ids and then compute the
split index by mapping token boundaries to character/byte offsets (walk tokens
from full_ids and accumulate decoded byte/char lengths until you reach
header_str.size()); set s.lyric_token_ids to the lyric slice from full_ids
starting at that computed split index and set s.lyric_start_idx and
s.lyric_end_idx from that split and lyric length (update uses of header_ids,
header_str, lyric_ids, bpe_encode, s.lyric_token_ids, s.lyric_start_idx,
s.lyric_end_idx).

In `@src/pipeline-synth.cpp`:
- Around line 672-674: The function ace_synth_job_get_lyric_timing_json
currently gates access to lyric timing by hardcoding track_idx == 0; remove that
check and implement per-track validation and retrieval: instead of "track_idx !=
0", validate track_idx against the actual per-job track container (e.g.
job->state.lyric_timing_json.size() or job->state.track_count), return nullptr
if out of range, and return the specific per-track string (e.g.
job->state.lyric_timing_json[track_idx].c_str()) only if that entry is
non-empty; update ace_synth_job_get_lyric_timing_json to use those symbols for
correct multi-track behavior.

In `@tools/ace-synth.cpp`:
- Around line 333-341: The current write block for timing_path opens tf and
writes lyric_timings[b] but ignores fwrite/fclose results; after opening tf in
the block that uses tf, check that fwrite returns the expected byte count
(lyric_timings[b].size()) and check fclose returns 0, and on either failure emit
a descriptive stderr message including errno/strerror (or perror) that includes
timing_path and batch index b, and propagate the failure by setting a non-zero
exit/error flag or returning an error from the enclosing function so the process
exits non-zero instead of silently continuing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 911578a7-d8c9-4cfa-816c-e79942c32081

📥 Commits

Reviewing files that changed from the base of the PR and between 4922ed1 and 38d3ca6.

📒 Files selected for processing (16)
  • README.md
  • docs/ARCHITECTURE.md
  • src/debug.h
  • src/dit-graph.h
  • src/dit-sampler.h
  • src/dit.h
  • src/lyric-timing.h
  • src/pipeline-synth-impl.h
  • src/pipeline-synth-ops.cpp
  • src/pipeline-synth.cpp
  • src/pipeline-synth.h
  • src/request.cpp
  • src/request.h
  • tools/ace-server.cpp
  • tools/ace-synth.cpp
  • tools/synth-batch-runner.h

Comment thread docs/ARCHITECTURE.md
Comment on lines +938 to 940
synth: multipart/mixed (one audio part + one latent part per track, paired;
optional lyric_timing JSON part when requested)
understand: multipart/mixed (one json part + one latent part for the source)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document the batch limitation for lyric_timing explicitly.

This section should state that lyric timing is currently sample-0 scoped (only the first generated track in a batch), otherwise clients may assume one lyric_timing part per track.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/ARCHITECTURE.md` around lines 938 - 940, Update the multipart/mixed docs
for synth to explicitly state the batch limitation: clarify that the optional
lyric_timing JSON part is sample-0 scoped (applies only to the first generated
track in a batch) and not produced per track; reference the existing "synth"
multipart/mixed description and mention "lyric_timing" and "sample-0" so readers
know only one lyric_timing part will be present when multiple tracks are
generated in a single batch.

Comment thread src/dit-sampler.h
Comment on lines +743 to +749
float t_last = 1.0f / (float) num_steps;
if (t_t) {
ggml_backend_tensor_set(t_t, &t_last, 0, sizeof(float));
}
if (t_tr) {
ggml_backend_tensor_set(t_tr, &t_last, 0, sizeof(float));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the actual final schedule timestep for alignment replay.

Line 743 hardcodes 1.0f / num_steps, which can desync replay from the real final denoise step when schedule is non-linear/custom. Use the same last timestep used by sampling.

Proposed fix
-        float t_last = 1.0f / (float) num_steps;
+        float t_last = schedule[num_steps - 1];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
float t_last = 1.0f / (float) num_steps;
if (t_t) {
ggml_backend_tensor_set(t_t, &t_last, 0, sizeof(float));
}
if (t_tr) {
ggml_backend_tensor_set(t_tr, &t_last, 0, sizeof(float));
}
float t_last = schedule[num_steps - 1];
if (t_t) {
ggml_backend_tensor_set(t_t, &t_last, 0, sizeof(float));
}
if (t_tr) {
ggml_backend_tensor_set(t_tr, &t_last, 0, sizeof(float));
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dit-sampler.h` around lines 743 - 749, The code currently hardcodes
t_last = 1.0f / (float) num_steps which can mismatch non-linear/custom
schedules; instead compute t_last from the actual sampling schedule used by the
sampler (e.g., read the final timestep value from the schedule array or call the
schedule accessor used by sampling) and assign that value to t_last before
calling ggml_backend_tensor_set for t_t and t_tr (replace the 1.0f/num_steps
expression with the schedule's final timestep used by the sampling code so
replay aligns exactly with the sampler).

Comment thread src/dit-sampler.h
Comment on lines +785 to +791
int n0 = (int) t_scores->ne[0]; // encoder tokens
int n1 = (int) t_scores->ne[1]; // latent frames
int n2 = (int) t_scores->ne[2]; // heads
int sample_elems = n0 * n1 * n2;
std::vector<float> scores(sample_elems);
ggml_backend_tensor_get(t_scores, scores.data(), 0, sample_elems * sizeof(float));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Lyric score extraction currently ignores batch dimension and only reads sample 0.

Lines 785-791 read only n0*n1*n2 from offset 0, so for N > 1 timing is extracted from the first batch slice only. This silently produces incomplete/wrong results for batched synthesis.

Proposed guard (fail fast until per-sample output is supported)
+        if (N != 1) {
+            fprintf(stderr, "[DiT] lyric timing capture currently supports N==1 only\n");
+            ggml_free(ctx);
+            return -1;
+        }
+
         for (int layer = 2; layer <= 6; layer++) {

Also applies to: 798-815

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dit-sampler.h` around lines 785 - 791, The code extracting lyric/timing
scores currently reads only the first batch slice from tensor t_scores by using
ggml_backend_tensor_get with offset 0 and sample_elems = n0*n1*n2 (variables n0,
n1, n2, sample_elems, and t_scores), which ignores any batch dimension; change
this to either (A) guard/fail-fast: detect the batch size (the remaining
dimension of t_scores, e.g. n3 or t_scores->ne[3]) and return an error or assert
if batch > 1 so callers know per-sample output is unsupported, or (B) support
batching by looping over batch index b and calling ggml_backend_tensor_get with
the proper per-batch offset (b * sample_elems) to fill per-sample score buffers;
update the related blocks (the similar section around lines 798-815) to use the
same guard or per-batch loop and ensure sample_elems is computed as n0*n1*n2 and
multiplied by batch when computing total bytes read.

Comment on lines +480 to +485
std::string header_str = std::string("# Languages\n") + language_b + "\n\n# Lyric\n";
auto header_ids = bpe_encode(bpe, header_str.c_str(), false);

s.lyric_token_ids = lyric_ids;
s.lyric_start_idx = (int) header_ids.size();
s.lyric_end_idx = (int) lyric_ids.size();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Compute lyric_start_idx from the full token stream, not a separately tokenized header.

With BPE, merges can cross the header + lyrics boundary, so header_ids.size() is not guaranteed to match the real start index inside lyric_ids. This can mis-slice tokens and skew timing JSON.

A safer approach is to derive the boundary from the already-tokenized full sequence (e.g., decode-incremental char offsets against header_str.size()), then set lyric_start_idx from that exact crossing point.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pipeline-synth-ops.cpp` around lines 480 - 485, The current code uses
header_ids.size() to set s.lyric_start_idx, which is wrong because BPE merges
can cross the header+lyrics boundary; instead encode the full concatenated
string (e.g., combined_str = header_str + lyric_str) with bpe_encode to get
full_ids and then compute the split index by mapping token boundaries to
character/byte offsets (walk tokens from full_ids and accumulate decoded
byte/char lengths until you reach header_str.size()); set s.lyric_token_ids to
the lyric slice from full_ids starting at that computed split index and set
s.lyric_start_idx and s.lyric_end_idx from that split and lyric length (update
uses of header_ids, header_str, lyric_ids, bpe_encode, s.lyric_token_ids,
s.lyric_start_idx, s.lyric_end_idx).

Comment thread src/pipeline-synth.cpp
Comment on lines +672 to +674
const char * ace_synth_job_get_lyric_timing_json(const AceSynthJob * job, int track_idx) {
if (!job || track_idx != 0 || job->state.lyric_timing_json.empty()) {
return nullptr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the hardcoded track-0 gate in lyric timing getter.

Line 673 forces track_idx != 0 to return nullptr, so batch outputs beyond index 0 never receive lyric timing even though callers query per-track.

Suggested fix
 const char * ace_synth_job_get_lyric_timing_json(const AceSynthJob * job, int track_idx) {
-    if (!job || track_idx != 0 || job->state.lyric_timing_json.empty()) {
+    if (!job || track_idx < 0 || track_idx >= job->batch_n || job->state.lyric_timing_json.empty()) {
         return nullptr;
     }
     return job->state.lyric_timing_json.c_str();
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pipeline-synth.cpp` around lines 672 - 674, The function
ace_synth_job_get_lyric_timing_json currently gates access to lyric timing by
hardcoding track_idx == 0; remove that check and implement per-track validation
and retrieval: instead of "track_idx != 0", validate track_idx against the
actual per-job track container (e.g. job->state.lyric_timing_json.size() or
job->state.track_count), return nullptr if out of range, and return the specific
per-track string (e.g. job->state.lyric_timing_json[track_idx].c_str()) only if
that entry is non-empty; update ace_synth_job_get_lyric_timing_json to use those
symbols for correct multi-track behavior.

Comment thread tools/ace-synth.cpp
Comment on lines +333 to +341
FILE * tf = fopen(timing_path, "wb");
if (!tf) {
fprintf(stderr, "[Ace-Synth Batch%d] FATAL: failed to write %s\n", b, timing_path);
} else {
fwrite(lyric_timings[b].data(), 1, lyric_timings[b].size(), tf);
fputc('\n', tf);
fclose(tf);
fprintf(stderr, "[Ace-Synth Batch%d] Wrote %s\n", b, timing_path);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Check and surface lyric sidecar write failures.

The sidecar path ignores fwrite/fclose outcomes, and write failures don’t affect exit status. This can silently produce incomplete or missing .lyric-timing.json files in automation.

Suggested fix
+    bool lyric_timing_write_failed = false;
     // Write output files
     for (int b = 0; b < (int) all_audio.size(); b++) {
@@
             FILE * tf = fopen(timing_path, "wb");
             if (!tf) {
-                fprintf(stderr, "[Ace-Synth Batch%d] FATAL: failed to write %s\n", b, timing_path);
+                fprintf(stderr, "[Ace-Synth Batch%d] ERROR: failed to open %s\n", b, timing_path);
+                lyric_timing_write_failed = true;
             } else {
-                fwrite(lyric_timings[b].data(), 1, lyric_timings[b].size(), tf);
-                fputc('\n', tf);
-                fclose(tf);
-                fprintf(stderr, "[Ace-Synth Batch%d] Wrote %s\n", b, timing_path);
+                const size_t want  = lyric_timings[b].size();
+                const size_t wrote = fwrite(lyric_timings[b].data(), 1, want, tf);
+                const bool ok = (wrote == want) && (fputc('\n', tf) != EOF) && (fclose(tf) == 0);
+                if (!ok) {
+                    fprintf(stderr, "[Ace-Synth Batch%d] ERROR: failed to flush %s\n", b, timing_path);
+                    lyric_timing_write_failed = true;
+                } else {
+                    fprintf(stderr, "[Ace-Synth Batch%d] Wrote %s\n", b, timing_path);
+                }
             }
         }
@@
-    fprintf(stderr, "[Ace-Synth] All done\n");
-    return 0;
+    if (lyric_timing_write_failed) {
+        return 1;
+    }
+    fprintf(stderr, "[Ace-Synth] All done\n");
+    return 0;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/ace-synth.cpp` around lines 333 - 341, The current write block for
timing_path opens tf and writes lyric_timings[b] but ignores fwrite/fclose
results; after opening tf in the block that uses tf, check that fwrite returns
the expected byte count (lyric_timings[b].size()) and check fclose returns 0,
and on either failure emit a descriptive stderr message including errno/strerror
(or perror) that includes timing_path and batch index b, and propagate the
failure by setting a non-zero exit/error flag or returning an error from the
enclosing function so the process exits non-zero instead of silently continuing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant