Avoid CPU offload wait_event for validation#2793
Avoid CPU offload wait_event for validation#2793vasunvidia wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR optimises the CPU-offload code path for the case where no tensors end up in Key points:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant CS as Current CUDA Stream
participant OS as Offload Stream
participant OLS as OffloadableLayerState
Note over OLS: start_offload() — tensor_list non-empty
CS->>OLS: push_tensor(t)
OLS->>OS: wait_event(t.start_reload_event)
OS->>OS: copy_(tensor, non_blocking=True)
OLS->>OS: finish_offload_event.record()
Note over OLS: release_activation_forward_gpu_memory() — tensor_list non-empty
OLS->>CS: wait_event(finish_offload_event)
OLS->>OLS: fwd_gpu_tensor_group = TensorGroup()
OLS->>OLS: del finish_offload_event
Note over OLS: start_offload() — tensor_list EMPTY (e.g. validation)
OLS-->>OLS: (no event created, no copy issued)
Note over OLS: release_activation_forward_gpu_memory() — tensor_list EMPTY
OLS-->>OLS: (no wait_event called — PR optimization)
Reviews (2): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
| if len(self.fwd_gpu_tensor_group.tensor_list) > 0: | ||
| torch.cuda.current_stream().wait_event(self.finish_offload_event) # type: ignore[arg-type] | ||
|
|
||
| torch.cuda.current_stream().wait_event(self.finish_offload_event) # type: ignore[arg-type] | ||
|
|
||
| # GPU memory can be released safely after the offload. | ||
| # Notice that the memory needs to be kept alive when GPU->CPU copy is performed. | ||
| self.fwd_gpu_tensor_group = TensorGroup() | ||
| del self.finish_offload_event | ||
| # GPU memory can be released safely after the offload. | ||
| # Notice that the memory needs to be kept alive when GPU->CPU copy is performed. | ||
| self.fwd_gpu_tensor_group = TensorGroup() | ||
| del self.finish_offload_event |
There was a problem hiding this comment.
fwd_gpu_tensor_group not reset on empty-list path
When tensor_list is empty, fwd_gpu_tensor_group is left holding the (empty) processed TensorGroup returned by tensor_group_process_before_offload() rather than being reset to a fresh TensorGroup(). This is benign — there are no GPU tensor references to release — and release_all_memory() will always reinitialise it before the next forward pass. However, the behaviour is now asymmetric with the non-empty path and could confuse future readers.
Consider resetting fwd_gpu_tensor_group unconditionally to keep the post-condition of this method consistent regardless of whether tensors were present:
| if len(self.fwd_gpu_tensor_group.tensor_list) > 0: | |
| torch.cuda.current_stream().wait_event(self.finish_offload_event) # type: ignore[arg-type] | |
| torch.cuda.current_stream().wait_event(self.finish_offload_event) # type: ignore[arg-type] | |
| # GPU memory can be released safely after the offload. | |
| # Notice that the memory needs to be kept alive when GPU->CPU copy is performed. | |
| self.fwd_gpu_tensor_group = TensorGroup() | |
| del self.finish_offload_event | |
| # GPU memory can be released safely after the offload. | |
| # Notice that the memory needs to be kept alive when GPU->CPU copy is performed. | |
| self.fwd_gpu_tensor_group = TensorGroup() | |
| del self.finish_offload_event | |
| self.state = "offload_finished" | |
| if len(self.fwd_gpu_tensor_group.tensor_list) > 0: | |
| torch.cuda.current_stream().wait_event(self.finish_offload_event) # type: ignore[arg-type] | |
| # GPU memory can be released safely after the offload. | |
| # Notice that the memory needs to be kept alive when GPU->CPU copy is performed. | |
| del self.finish_offload_event | |
| self.fwd_gpu_tensor_group = TensorGroup() |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
/te-ci pytorch |
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
63ed892 to
b71c0d5
Compare
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: