Skip to content

Conversation

@axxx03
Copy link
Contributor

@axxx03 axxx03 commented Dec 4, 2025

Description

When using overlap scheduler with MTP, the request may have been terminated and removed from mSequences before rewindKVCache is called, we should avoid this.

@axxx03 axxx03 requested a review from a team as a code owner December 4, 2025 06:39
@axxx03
Copy link
Contributor Author

axxx03 commented Dec 4, 2025

/bot run

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

📝 Walkthrough

Walkthrough

Added a pre-emptive existence check in KVCacheManager::rewindKVCache to return early if the target sequence no longer exists in mSequences, preventing rewinding operations on removed sequences in multi-token prediction overlap scenarios.

Changes

Cohort / File(s) Summary
KVCache defensive guard
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Added existence validation for sequence requestId before rewind operation; function returns early if sequence not found in mSequences, skipping token rewind to prevent errors on terminated or removed sequences

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Verify the guard condition logic and scope (locked guard placement)
  • Confirm that early return doesn't skip necessary cleanup or state synchronization
  • Check that sequence removal scenarios (overlap termination, MTP mode) align with this defensive behavior

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning The PR description is minimal and lacks critical required sections from the template such as test coverage details and PR checklist completion. Complete the description with: (1) a detailed explanation of the bug scenario and why it occurs, (2) specific test coverage details that validate the fix, (3) confirmation of PR checklist items, particularly testing and coding guidelines compliance.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title specifically identifies a bug fix for an MTP rewind corner case, which directly aligns with the actual change: adding an existence check in KVCacheManager::rewindKVCache to handle missing sequences in MTP overlap mode.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

2845-2856: Guard against stale request IDs in rewindKVCache is good; double‑check concurrency assumptions

The early existence check under mSequencesMtx cleanly handles the MTP overlap case where rewindKVCache is invoked after the request has already been removed, avoiding an out‑of‑range on mSequences.at() via getSequence() / removeToken.

One thing to sanity‑check: if removeSequence(requestId) can run concurrently with rewindKVCache(requestId, …), there’s still a narrow window between this pre‑check and the subsequent removeToken() calls where the sequence could be extracted, and getSequence() would still at() a missing key. If the scheduler guarantees these two operations are not concurrent for the same requestId, this implementation is sufficient; otherwise, a future refactor (e.g., splitting removeToken into a locked helper and holding mSequencesMtx across the whole rewind) might be worth considering.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9aa86d and 88b27ce.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements)
If and else should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g., thisIsASubDir and thisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues), though the 'm' pre...

Files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧠 Learnings (8)
📓 Common learnings
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
Repo: NVIDIA/TensorRT-LLM PR: 6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧬 Code graph analysis (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (16)
  • requestId (108-111)
  • requestId (108-108)
  • requestId (113-117)
  • requestId (113-114)
  • requestId (119-124)
  • requestId (119-121)
  • requestId (126-130)
  • requestId (126-127)
  • requestId (132-135)
  • requestId (132-132)
  • requestId (137-140)
  • requestId (137-137)
  • requestId (169-172)
  • requestId (169-169)
  • requestId (190-194)
  • requestId (190-191)

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Dec 4, 2025
@Shixiaowei02 Shixiaowei02 requested a review from lfr-0531 December 8, 2025 12:52
@Shixiaowei02
Copy link
Collaborator

@lfr-0531 Could you please help the customer review this fix? Thanks you!

@Shixiaowei02 Shixiaowei02 force-pushed the bugfix/mtp_with_async_scheduler branch from 88b27ce to be1f481 Compare December 23, 2025 13:42
@Shixiaowei02
Copy link
Collaborator

/bot run

@Shixiaowei02 Shixiaowei02 changed the title [BugFix] Fix MTP rewind corner case [None][fix] Fix MTP rewind corner case Dec 23, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #29632 [ run ] triggered by Bot. Commit: be1f481

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29632 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 12 PM PST on 12/23.

Copy link
Collaborator

@eopXD eopXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused about the corner case here.

Rewind is only called if the generation request is not in a complete state. This should be a hard assertion rather than returning with only a warning.

        # rewind kv cache
        for request in scheduled_batch.generation_requests:
            if request.state != LlmRequestState.GENERATION_COMPLETE:
                if request.py_rewind_len > 0:
                    self.rewind_kv_cache(request, request.py_rewind_len)

May I ask what specific case have you encountered that requires the code change of this MR?

@axxx03
Copy link
Contributor Author

axxx03 commented Dec 31, 2025

Slightly confused about the corner case here.

Rewind is only called if the generation request is not in a complete state. This should be a hard assertion rather than returning with only a warning.

        # rewind kv cache
        for request in scheduled_batch.generation_requests:
            if request.state != LlmRequestState.GENERATION_COMPLETE:
                if request.py_rewind_len > 0:
                    self.rewind_kv_cache(request, request.py_rewind_len)

May I ask what specific case have you encountered that requires the code change of this MR?

Hi, this corner case also happens when using TensorRT-LLM with FlexKV.

In this case, when finishing kv cache offloading, the correspoding resources will be released. However, when combing with overlap scheduling + MTP, in the overlap step, if the MTP verification fails, a KV cache rewind may be triggered. At this point, the KV cache has already been released due to the earlier offloading, so the rewind operation attempts to access non-existent KV cache entries, which leads to an IndexError.

@Shixiaowei02 Shixiaowei02 force-pushed the bugfix/mtp_with_async_scheduler branch from be1f481 to 526b3ca Compare January 8, 2026 02:25
@Shixiaowei02
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30972 [ run ] triggered by Bot. Commit: 526b3ca

@Shixiaowei02
Copy link
Collaborator

After discussion, we believe that the situation where KV offloading to storage overlaps with MTP is inherently conflicting. Could @jthomson04 help take a look or add some tests? Thanks!

cc @pcastonguay

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30972 [ run ] completed with state SUCCESS. Commit: 526b3ca
/LLM/main/L0_MergeRequest_PR pipeline #23930 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@Shixiaowei02 Shixiaowei02 force-pushed the bugfix/mtp_with_async_scheduler branch from 526b3ca to cca24d7 Compare January 21, 2026 02:15
@Shixiaowei02
Copy link
Collaborator

/bot run

1 similar comment
@Shixiaowei02
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32838 [ run ] triggered by Bot. Commit: cca24d7

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32839 [ run ] triggered by Bot. Commit: cca24d7

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32838 [ run ] completed with state ABORTED. Commit: cca24d7

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32839 [ run ] completed with state SUCCESS. Commit: cca24d7
/LLM/main/L0_MergeRequest_PR pipeline #25408 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@jthomson04
Copy link
Collaborator

@axxx03 Can you please run pre-commit and push the changes?

@pcastonguay
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33211 [ run ] triggered by Bot. Commit: cca24d7

@dagil-nvidia
Copy link

@axxx03 it's failing pre-commit check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33211 [ run ] completed with state SUCCESS. Commit: cca24d7
/LLM/main/L0_MergeRequest_PR pipeline #25659 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@pcastonguay
Copy link
Collaborator

@dagil-nvidia @jthomson04 fyi I opened #10941 to fix pre-commit and trigger CI so it can run overnight.

@pcastonguay
Copy link
Collaborator

Closing. #10941 has been merged.

@axxx03
Copy link
Contributor Author

axxx03 commented Jan 26, 2026

Sorry for the late reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants