Open
Conversation
Details: test_checkpoint_sharding.py::TestCheckpointShard::test[bigscience/bloom-560m-fp16] silently hangs if incorrect version of transformers(>4.43.4) is installed, due to no proper handling of SystemExit in the pool worker. Fix is throw RuntimeError. Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com>
PKUWZP
approved these changes
Mar 6, 2026
Collaborator
PKUWZP
left a comment
There was a problem hiding this comment.
Thanks for sending the fix. This PR fixes a test hang in test_checkpoint_sharding.py for the BLOOM 560m model. The root cause is that sys.exit() is called inside a multiprocessing pool worker when an incompatible transformers version is detected. sys.exit() raises SystemExit, which multiprocessing.Pool does not propagate back to the parent process cleanly — it silently hangs instead. The fix replaces sys.exit() with raise RuntimeError(...), which is properly caught and re-raised by the pool.
If @loadams and @hwchen2017 do not have questions I approve this PR fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
test_checkpoint_sharding.py::TestCheckpointShard::test[bigscience/bloom-560m-fp16]silently hangs if incorrect version of transformers(>4.43.4) is installed, due to no proper handling of SystemExit in the pool worker.Fix is throw RuntimeError instead of sys.exit.
Test:
Run
pytest tests/unit/inference/test_checkpoint_sharding.py -k 'bloom-560m-fp16'Result: no hangs, throw exception: