feat: add optional DiT lyric timing output#79
Conversation
📝 WalkthroughWalkthroughThis PR implements optional lyric timing capture during music synthesis. When enabled via ChangesLyric Timing Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/ARCHITECTURE.md`:
- Around line 938-940: Update the multipart/mixed docs for synth to explicitly
state the batch limitation: clarify that the optional lyric_timing JSON part is
sample-0 scoped (applies only to the first generated track in a batch) and not
produced per track; reference the existing "synth" multipart/mixed description
and mention "lyric_timing" and "sample-0" so readers know only one lyric_timing
part will be present when multiple tracks are generated in a single batch.
In `@src/dit-sampler.h`:
- Around line 785-791: The code extracting lyric/timing scores currently reads
only the first batch slice from tensor t_scores by using ggml_backend_tensor_get
with offset 0 and sample_elems = n0*n1*n2 (variables n0, n1, n2, sample_elems,
and t_scores), which ignores any batch dimension; change this to either (A)
guard/fail-fast: detect the batch size (the remaining dimension of t_scores,
e.g. n3 or t_scores->ne[3]) and return an error or assert if batch > 1 so
callers know per-sample output is unsupported, or (B) support batching by
looping over batch index b and calling ggml_backend_tensor_get with the proper
per-batch offset (b * sample_elems) to fill per-sample score buffers; update the
related blocks (the similar section around lines 798-815) to use the same guard
or per-batch loop and ensure sample_elems is computed as n0*n1*n2 and multiplied
by batch when computing total bytes read.
- Around line 743-749: The code currently hardcodes t_last = 1.0f / (float)
num_steps which can mismatch non-linear/custom schedules; instead compute t_last
from the actual sampling schedule used by the sampler (e.g., read the final
timestep value from the schedule array or call the schedule accessor used by
sampling) and assign that value to t_last before calling ggml_backend_tensor_set
for t_t and t_tr (replace the 1.0f/num_steps expression with the schedule's
final timestep used by the sampling code so replay aligns exactly with the
sampler).
In `@src/pipeline-synth-ops.cpp`:
- Around line 480-485: The current code uses header_ids.size() to set
s.lyric_start_idx, which is wrong because BPE merges can cross the header+lyrics
boundary; instead encode the full concatenated string (e.g., combined_str =
header_str + lyric_str) with bpe_encode to get full_ids and then compute the
split index by mapping token boundaries to character/byte offsets (walk tokens
from full_ids and accumulate decoded byte/char lengths until you reach
header_str.size()); set s.lyric_token_ids to the lyric slice from full_ids
starting at that computed split index and set s.lyric_start_idx and
s.lyric_end_idx from that split and lyric length (update uses of header_ids,
header_str, lyric_ids, bpe_encode, s.lyric_token_ids, s.lyric_start_idx,
s.lyric_end_idx).
In `@src/pipeline-synth.cpp`:
- Around line 672-674: The function ace_synth_job_get_lyric_timing_json
currently gates access to lyric timing by hardcoding track_idx == 0; remove that
check and implement per-track validation and retrieval: instead of "track_idx !=
0", validate track_idx against the actual per-job track container (e.g.
job->state.lyric_timing_json.size() or job->state.track_count), return nullptr
if out of range, and return the specific per-track string (e.g.
job->state.lyric_timing_json[track_idx].c_str()) only if that entry is
non-empty; update ace_synth_job_get_lyric_timing_json to use those symbols for
correct multi-track behavior.
In `@tools/ace-synth.cpp`:
- Around line 333-341: The current write block for timing_path opens tf and
writes lyric_timings[b] but ignores fwrite/fclose results; after opening tf in
the block that uses tf, check that fwrite returns the expected byte count
(lyric_timings[b].size()) and check fclose returns 0, and on either failure emit
a descriptive stderr message including errno/strerror (or perror) that includes
timing_path and batch index b, and propagate the failure by setting a non-zero
exit/error flag or returning an error from the enclosing function so the process
exits non-zero instead of silently continuing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 911578a7-d8c9-4cfa-816c-e79942c32081
📒 Files selected for processing (16)
README.mddocs/ARCHITECTURE.mdsrc/debug.hsrc/dit-graph.hsrc/dit-sampler.hsrc/dit.hsrc/lyric-timing.hsrc/pipeline-synth-impl.hsrc/pipeline-synth-ops.cppsrc/pipeline-synth.cppsrc/pipeline-synth.hsrc/request.cppsrc/request.htools/ace-server.cpptools/ace-synth.cpptools/synth-batch-runner.h
| synth: multipart/mixed (one audio part + one latent part per track, paired; | ||
| optional lyric_timing JSON part when requested) | ||
| understand: multipart/mixed (one json part + one latent part for the source) |
There was a problem hiding this comment.
Document the batch limitation for lyric_timing explicitly.
This section should state that lyric timing is currently sample-0 scoped (only the first generated track in a batch), otherwise clients may assume one lyric_timing part per track.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/ARCHITECTURE.md` around lines 938 - 940, Update the multipart/mixed docs
for synth to explicitly state the batch limitation: clarify that the optional
lyric_timing JSON part is sample-0 scoped (applies only to the first generated
track in a batch) and not produced per track; reference the existing "synth"
multipart/mixed description and mention "lyric_timing" and "sample-0" so readers
know only one lyric_timing part will be present when multiple tracks are
generated in a single batch.
| float t_last = 1.0f / (float) num_steps; | ||
| if (t_t) { | ||
| ggml_backend_tensor_set(t_t, &t_last, 0, sizeof(float)); | ||
| } | ||
| if (t_tr) { | ||
| ggml_backend_tensor_set(t_tr, &t_last, 0, sizeof(float)); | ||
| } |
There was a problem hiding this comment.
Use the actual final schedule timestep for alignment replay.
Line 743 hardcodes 1.0f / num_steps, which can desync replay from the real final denoise step when schedule is non-linear/custom. Use the same last timestep used by sampling.
Proposed fix
- float t_last = 1.0f / (float) num_steps;
+ float t_last = schedule[num_steps - 1];📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| float t_last = 1.0f / (float) num_steps; | |
| if (t_t) { | |
| ggml_backend_tensor_set(t_t, &t_last, 0, sizeof(float)); | |
| } | |
| if (t_tr) { | |
| ggml_backend_tensor_set(t_tr, &t_last, 0, sizeof(float)); | |
| } | |
| float t_last = schedule[num_steps - 1]; | |
| if (t_t) { | |
| ggml_backend_tensor_set(t_t, &t_last, 0, sizeof(float)); | |
| } | |
| if (t_tr) { | |
| ggml_backend_tensor_set(t_tr, &t_last, 0, sizeof(float)); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/dit-sampler.h` around lines 743 - 749, The code currently hardcodes
t_last = 1.0f / (float) num_steps which can mismatch non-linear/custom
schedules; instead compute t_last from the actual sampling schedule used by the
sampler (e.g., read the final timestep value from the schedule array or call the
schedule accessor used by sampling) and assign that value to t_last before
calling ggml_backend_tensor_set for t_t and t_tr (replace the 1.0f/num_steps
expression with the schedule's final timestep used by the sampling code so
replay aligns exactly with the sampler).
| int n0 = (int) t_scores->ne[0]; // encoder tokens | ||
| int n1 = (int) t_scores->ne[1]; // latent frames | ||
| int n2 = (int) t_scores->ne[2]; // heads | ||
| int sample_elems = n0 * n1 * n2; | ||
| std::vector<float> scores(sample_elems); | ||
| ggml_backend_tensor_get(t_scores, scores.data(), 0, sample_elems * sizeof(float)); | ||
|
|
There was a problem hiding this comment.
Lyric score extraction currently ignores batch dimension and only reads sample 0.
Lines 785-791 read only n0*n1*n2 from offset 0, so for N > 1 timing is extracted from the first batch slice only. This silently produces incomplete/wrong results for batched synthesis.
Proposed guard (fail fast until per-sample output is supported)
+ if (N != 1) {
+ fprintf(stderr, "[DiT] lyric timing capture currently supports N==1 only\n");
+ ggml_free(ctx);
+ return -1;
+ }
+
for (int layer = 2; layer <= 6; layer++) {Also applies to: 798-815
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/dit-sampler.h` around lines 785 - 791, The code extracting lyric/timing
scores currently reads only the first batch slice from tensor t_scores by using
ggml_backend_tensor_get with offset 0 and sample_elems = n0*n1*n2 (variables n0,
n1, n2, sample_elems, and t_scores), which ignores any batch dimension; change
this to either (A) guard/fail-fast: detect the batch size (the remaining
dimension of t_scores, e.g. n3 or t_scores->ne[3]) and return an error or assert
if batch > 1 so callers know per-sample output is unsupported, or (B) support
batching by looping over batch index b and calling ggml_backend_tensor_get with
the proper per-batch offset (b * sample_elems) to fill per-sample score buffers;
update the related blocks (the similar section around lines 798-815) to use the
same guard or per-batch loop and ensure sample_elems is computed as n0*n1*n2 and
multiplied by batch when computing total bytes read.
| std::string header_str = std::string("# Languages\n") + language_b + "\n\n# Lyric\n"; | ||
| auto header_ids = bpe_encode(bpe, header_str.c_str(), false); | ||
|
|
||
| s.lyric_token_ids = lyric_ids; | ||
| s.lyric_start_idx = (int) header_ids.size(); | ||
| s.lyric_end_idx = (int) lyric_ids.size(); |
There was a problem hiding this comment.
Compute lyric_start_idx from the full token stream, not a separately tokenized header.
With BPE, merges can cross the header + lyrics boundary, so header_ids.size() is not guaranteed to match the real start index inside lyric_ids. This can mis-slice tokens and skew timing JSON.
A safer approach is to derive the boundary from the already-tokenized full sequence (e.g., decode-incremental char offsets against header_str.size()), then set lyric_start_idx from that exact crossing point.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/pipeline-synth-ops.cpp` around lines 480 - 485, The current code uses
header_ids.size() to set s.lyric_start_idx, which is wrong because BPE merges
can cross the header+lyrics boundary; instead encode the full concatenated
string (e.g., combined_str = header_str + lyric_str) with bpe_encode to get
full_ids and then compute the split index by mapping token boundaries to
character/byte offsets (walk tokens from full_ids and accumulate decoded
byte/char lengths until you reach header_str.size()); set s.lyric_token_ids to
the lyric slice from full_ids starting at that computed split index and set
s.lyric_start_idx and s.lyric_end_idx from that split and lyric length (update
uses of header_ids, header_str, lyric_ids, bpe_encode, s.lyric_token_ids,
s.lyric_start_idx, s.lyric_end_idx).
| const char * ace_synth_job_get_lyric_timing_json(const AceSynthJob * job, int track_idx) { | ||
| if (!job || track_idx != 0 || job->state.lyric_timing_json.empty()) { | ||
| return nullptr; |
There was a problem hiding this comment.
Remove the hardcoded track-0 gate in lyric timing getter.
Line 673 forces track_idx != 0 to return nullptr, so batch outputs beyond index 0 never receive lyric timing even though callers query per-track.
Suggested fix
const char * ace_synth_job_get_lyric_timing_json(const AceSynthJob * job, int track_idx) {
- if (!job || track_idx != 0 || job->state.lyric_timing_json.empty()) {
+ if (!job || track_idx < 0 || track_idx >= job->batch_n || job->state.lyric_timing_json.empty()) {
return nullptr;
}
return job->state.lyric_timing_json.c_str();
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/pipeline-synth.cpp` around lines 672 - 674, The function
ace_synth_job_get_lyric_timing_json currently gates access to lyric timing by
hardcoding track_idx == 0; remove that check and implement per-track validation
and retrieval: instead of "track_idx != 0", validate track_idx against the
actual per-job track container (e.g. job->state.lyric_timing_json.size() or
job->state.track_count), return nullptr if out of range, and return the specific
per-track string (e.g. job->state.lyric_timing_json[track_idx].c_str()) only if
that entry is non-empty; update ace_synth_job_get_lyric_timing_json to use those
symbols for correct multi-track behavior.
| FILE * tf = fopen(timing_path, "wb"); | ||
| if (!tf) { | ||
| fprintf(stderr, "[Ace-Synth Batch%d] FATAL: failed to write %s\n", b, timing_path); | ||
| } else { | ||
| fwrite(lyric_timings[b].data(), 1, lyric_timings[b].size(), tf); | ||
| fputc('\n', tf); | ||
| fclose(tf); | ||
| fprintf(stderr, "[Ace-Synth Batch%d] Wrote %s\n", b, timing_path); | ||
| } |
There was a problem hiding this comment.
Check and surface lyric sidecar write failures.
The sidecar path ignores fwrite/fclose outcomes, and write failures don’t affect exit status. This can silently produce incomplete or missing .lyric-timing.json files in automation.
Suggested fix
+ bool lyric_timing_write_failed = false;
// Write output files
for (int b = 0; b < (int) all_audio.size(); b++) {
@@
FILE * tf = fopen(timing_path, "wb");
if (!tf) {
- fprintf(stderr, "[Ace-Synth Batch%d] FATAL: failed to write %s\n", b, timing_path);
+ fprintf(stderr, "[Ace-Synth Batch%d] ERROR: failed to open %s\n", b, timing_path);
+ lyric_timing_write_failed = true;
} else {
- fwrite(lyric_timings[b].data(), 1, lyric_timings[b].size(), tf);
- fputc('\n', tf);
- fclose(tf);
- fprintf(stderr, "[Ace-Synth Batch%d] Wrote %s\n", b, timing_path);
+ const size_t want = lyric_timings[b].size();
+ const size_t wrote = fwrite(lyric_timings[b].data(), 1, want, tf);
+ const bool ok = (wrote == want) && (fputc('\n', tf) != EOF) && (fclose(tf) == 0);
+ if (!ok) {
+ fprintf(stderr, "[Ace-Synth Batch%d] ERROR: failed to flush %s\n", b, timing_path);
+ lyric_timing_write_failed = true;
+ } else {
+ fprintf(stderr, "[Ace-Synth Batch%d] Wrote %s\n", b, timing_path);
+ }
}
}
@@
- fprintf(stderr, "[Ace-Synth] All done\n");
- return 0;
+ if (lyric_timing_write_failed) {
+ return 1;
+ }
+ fprintf(stderr, "[Ace-Synth] All done\n");
+ return 0;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/ace-synth.cpp` around lines 333 - 341, The current write block for
timing_path opens tf and writes lyric_timings[b] but ignores fwrite/fclose
results; after opening tf in the block that uses tf, check that fwrite returns
the expected byte count (lyric_timings[b].size()) and check fclose returns 0,
and on either failure emit a descriptive stderr message including errno/strerror
(or perror) that includes timing_path and batch index b, and propagate the
failure by setting a non-zero exit/error flag or returning an error from the
enclosing function so the process exits non-zero instead of silently continuing.
Summary
return_lyric_timingrequest field for ACE-Step DiT lyric-attention timingace-synthas*.lyric-timing.jsonand from/synthas an optionallyric_timingmultipart JSON partNotes
This is DiT lyric-attention timing, not ASR ground truth. If generated vocals skip or mutate words, the timestamps can reflect the model's intended lyric plan rather than the audible waveform.
Validation
g++ -std=c++17 ... -fsyntax-only src/pipeline-synth-ops.cppg++ -std=c++17 ... -fsyntax-only tools/ace-synth.cppg++ -std=c++17 ... -fsyntax-only tools/ace-server.cppFull local CMake build was not run in this environment because
cmakeis not installed on the host.Summary by CodeRabbit
Release Notes
New Features
return_lyric_timingparameter to enable lyric timing output*.lyric-timing.jsonsidecar files containing lyric timestamp informationDocumentation