[None][fix] Fix MTP rewind corner case #9693

axxx03 · 2025-12-04T06:39:37Z

Description

When using overlap scheduler with MTP, the request may have been terminated and removed from mSequences before rewindKVCache is called, we should avoid this.

axxx03 · 2025-12-04T06:40:32Z

/bot run

coderabbitai · 2025-12-04T06:41:55Z

📝 Walkthrough

Walkthrough

Added a pre-emptive existence check in KVCacheManager::rewindKVCache to return early if the target sequence no longer exists in mSequences, preventing rewinding operations on removed sequences in multi-token prediction overlap scenarios.

Changes

Cohort / File(s)	Summary
KVCache defensive guard `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Added existence validation for sequence requestId before rewind operation; function returns early if sequence not found in mSequences, skipping token rewind to prevent errors on terminated or removed sequences

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Verify the guard condition logic and scope (locked guard placement)
Confirm that early return doesn't skip necessary cleanup or state synchronization
Check that sequence removal scenarios (overlap termination, MTP mode) align with this defensive behavior

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	The PR description is minimal and lacks critical required sections from the template such as test coverage details and PR checklist completion.	Complete the description with: (1) a detailed explanation of the bug scenario and why it occurs, (2) specific test coverage details that validate the fix, (3) confirmation of PR checklist items, particularly testing and coding guidelines compliance.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title specifically identifies a bug fix for an MTP rewind corner case, which directly aligns with the actual change: adding an existence check in KVCacheManager::rewindKVCache to handle missing sequences in MTP overlap mode.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

2845-2856: Guard against stale request IDs in rewindKVCache is good; double‑check concurrency assumptions

The early existence check under mSequencesMtx cleanly handles the MTP overlap case where rewindKVCache is invoked after the request has already been removed, avoiding an out‑of‑range on mSequences.at() via getSequence() / removeToken.

One thing to sanity‑check: if removeSequence(requestId) can run concurrently with rewindKVCache(requestId, …), there’s still a narrow window between this pre‑check and the subsequent removeToken() calls where the sequence could be extracted, and getSequence() would still at() a missing key. If the scheduler guarantees these two operations are not concurrent for the same requestId, this implementation is sufficient; otherwise, a future refactor (e.g., splitting removeToken into a locked helper and holding mSequencesMtx across the whole rewind) might be worth considering.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9aa86d and 88b27ce.

📒 Files selected for processing (1)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements)
If and else should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g., thisIsASubDir and thisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues), though the 'm' pre...

Files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

🧠 Learnings (8)

📓 Common learnings

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

📚 Learning: 2025-08-21T09:41:49.347Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-20T06:48:45.368Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-20T06:56:02.889Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-15T06:46:53.813Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-06T08:18:28.669Z

Learnt from: zhengd-nv
Repo: NVIDIA/TensorRT-LLM PR: 6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

🧬 Code graph analysis (1)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (16)

requestId (108-111)

requestId (108-108)

requestId (113-117)

requestId (113-114)

requestId (119-124)

requestId (119-121)

requestId (126-130)

requestId (126-127)

requestId (132-135)

requestId (132-132)

requestId (137-140)

requestId (137-137)

requestId (169-172)

requestId (169-169)

requestId (190-194)

requestId (190-191)

Shixiaowei02 · 2025-12-08T12:54:26Z

@lfr-0531 Could you please help the customer review this fix? Thanks you!

Shixiaowei02 · 2025-12-23T13:43:14Z

/bot run

tensorrt-cicd · 2025-12-23T13:48:43Z

PR_Github #29632 [ run ] triggered by Bot. Commit: be1f481

tensorrt-cicd · 2025-12-23T13:48:44Z

PR_Github #29632 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 12 PM PST on 12/23.

eopXD

Slightly confused about the corner case here.

Rewind is only called if the generation request is not in a complete state. This should be a hard assertion rather than returning with only a warning.

        # rewind kv cache
        for request in scheduled_batch.generation_requests:
            if request.state != LlmRequestState.GENERATION_COMPLETE:
                if request.py_rewind_len > 0:
                    self.rewind_kv_cache(request, request.py_rewind_len)

May I ask what specific case have you encountered that requires the code change of this MR?

axxx03 · 2025-12-31T06:53:47Z

Slightly confused about the corner case here.

Rewind is only called if the generation request is not in a complete state. This should be a hard assertion rather than returning with only a warning.
        # rewind kv cache
        for request in scheduled_batch.generation_requests:
            if request.state != LlmRequestState.GENERATION_COMPLETE:
                if request.py_rewind_len > 0:
                    self.rewind_kv_cache(request, request.py_rewind_len)
May I ask what specific case have you encountered that requires the code change of this MR?

Hi, this corner case also happens when using TensorRT-LLM with FlexKV.

In this case, when finishing kv cache offloading, the correspoding resources will be released. However, when combing with overlap scheduling + MTP, in the overlap step, if the MTP verification fails, a KV cache rewind may be triggered. At this point, the KV cache has already been released due to the earlier offloading, so the rewind operation attempts to access non-existent KV cache entries, which leads to an IndexError.

Shixiaowei02 · 2026-01-08T02:26:08Z

/bot run

tensorrt-cicd · 2026-01-08T02:33:39Z

PR_Github #30972 [ run ] triggered by Bot. Commit: 526b3ca

Shixiaowei02 · 2026-01-08T02:55:06Z

After discussion, we believe that the situation where KV offloading to storage overlaps with MTP is inherently conflicting. Could @jthomson04 help take a look or add some tests? Thanks!

cc @pcastonguay

tensorrt-cicd · 2026-01-08T06:30:29Z

PR_Github #30972 [ run ] completed with state SUCCESS. Commit: 526b3ca
/LLM/main/L0_MergeRequest_PR pipeline #23930 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Shixiaowei02 · 2026-01-21T02:15:07Z

/bot run

Shixiaowei02 · 2026-01-21T02:16:02Z

/bot run

tensorrt-cicd · 2026-01-21T02:21:24Z

PR_Github #32838 [ run ] triggered by Bot. Commit: cca24d7

tensorrt-cicd · 2026-01-21T02:22:10Z

PR_Github #32839 [ run ] triggered by Bot. Commit: cca24d7

tensorrt-cicd · 2026-01-21T02:22:12Z

PR_Github #32838 [ run ] completed with state ABORTED. Commit: cca24d7

tensorrt-cicd · 2026-01-21T04:36:21Z

PR_Github #32839 [ run ] completed with state SUCCESS. Commit: cca24d7
/LLM/main/L0_MergeRequest_PR pipeline #25408 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

jthomson04 · 2026-01-22T17:19:25Z

@axxx03 Can you please run pre-commit and push the changes?

pcastonguay · 2026-01-22T18:19:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-22T18:25:47Z

PR_Github #33211 [ run ] triggered by Bot. Commit: cca24d7

dagil-nvidia · 2026-01-22T21:12:52Z

@axxx03 it's failing pre-commit check

tensorrt-cicd · 2026-01-22T22:43:52Z

PR_Github #33211 [ run ] completed with state SUCCESS. Commit: cca24d7
/LLM/main/L0_MergeRequest_PR pipeline #25659 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

pcastonguay · 2026-01-23T02:16:08Z

@dagil-nvidia @jthomson04 fyi I opened #10941 to fix pre-commit and trigger CI so it can run overnight.

pcastonguay · 2026-01-24T12:20:50Z

Closing. #10941 has been merged.

axxx03 · 2026-01-26T03:39:15Z

Sorry for the late reply.

axxx03 requested a review from a team as a code owner December 4, 2025 06:39

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Dec 4, 2025

Shixiaowei02 requested a review from lfr-0531 December 8, 2025 12:52

Shixiaowei02 approved these changes Dec 8, 2025

View reviewed changes

Shixiaowei02 force-pushed the bugfix/mtp_with_async_scheduler branch from 88b27ce to be1f481 Compare December 23, 2025 13:42

Shixiaowei02 changed the title ~~[BugFix] Fix MTP rewind corner case~~ [None][fix] Fix MTP rewind corner case Dec 23, 2025

eopXD reviewed Dec 30, 2025

View reviewed changes

Shixiaowei02 force-pushed the bugfix/mtp_with_async_scheduler branch from be1f481 to 526b3ca Compare January 8, 2026 02:25

Shixiaowei02 requested review from Shixiaowei02 and jthomson04 January 8, 2026 02:48

fix mtp rewind corner case

cca24d7

Shixiaowei02 force-pushed the bugfix/mtp_with_async_scheduler branch from 526b3ca to cca24d7 Compare January 21, 2026 02:15

jthomson04 mentioned this pull request Jan 22, 2026

[None][feat] KV Connector Support for MTP #10932

Merged

1 task

pcastonguay approved these changes Jan 22, 2026

View reviewed changes

pcastonguay closed this Jan 24, 2026

[None][fix] Fix MTP rewind corner case #9693

[None][fix] Fix MTP rewind corner case #9693

Conversation

axxx03 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

axxx03 commented Dec 4, 2025

Uh oh!

coderabbitai bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Shixiaowei02 commented Dec 8, 2025

Uh oh!

Shixiaowei02 commented Dec 23, 2025

Uh oh!

tensorrt-cicd commented Dec 23, 2025

Uh oh!

tensorrt-cicd commented Dec 23, 2025

Uh oh!

eopXD left a comment

Choose a reason for hiding this comment

Uh oh!

axxx03 commented Dec 31, 2025

Uh oh!

Shixiaowei02 commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

Shixiaowei02 commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

Shixiaowei02 commented Jan 21, 2026

Uh oh!

Shixiaowei02 commented Jan 21, 2026

Uh oh!

tensorrt-cicd commented Jan 21, 2026

Uh oh!

tensorrt-cicd commented Jan 21, 2026

Uh oh!

tensorrt-cicd commented Jan 21, 2026

Uh oh!

tensorrt-cicd commented Jan 21, 2026

Uh oh!

jthomson04 commented Jan 22, 2026

Uh oh!

pcastonguay commented Jan 22, 2026

Uh oh!

tensorrt-cicd commented Jan 22, 2026

Uh oh!

dagil-nvidia commented Jan 22, 2026

Uh oh!

tensorrt-cicd commented Jan 22, 2026

Uh oh!

pcastonguay commented Jan 23, 2026

Uh oh!

pcastonguay commented Jan 24, 2026

Uh oh!

axxx03 commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

axxx03 commented Dec 4, 2025 •

edited

Loading

coderabbitai bot commented Dec 4, 2025 •

edited

Loading