Skip to content

[Bloom] Fix hangs of bloom test#7890

Open
k-artem wants to merge 2 commits intodeepspeedai:masterfrom
k-artem:fix_bloom_test_hangs
Open

[Bloom] Fix hangs of bloom test#7890
k-artem wants to merge 2 commits intodeepspeedai:masterfrom
k-artem:fix_bloom_test_hangs

Conversation

@k-artem
Copy link
Contributor

@k-artem k-artem commented Mar 6, 2026

test_checkpoint_sharding.py::TestCheckpointShard::test[bigscience/bloom-560m-fp16] silently hangs if incorrect version of transformers(>4.43.4) is installed, due to no proper handling of SystemExit in the pool worker.

Fix is throw RuntimeError instead of sys.exit.

  • Test:
    Run pytest tests/unit/inference/test_checkpoint_sharding.py -k 'bloom-560m-fp16'

  • Result: no hangs, throw exception:

E           RuntimeError: Transformers version 4.57.3 exceeds version 4.43.4! After transformers version 4.43.4, BLOOM inference with DeepSpeed is no longer supported.

/usr/lib/python3.12/multiprocessing/pool.py:774: RuntimeError

Details:
test_checkpoint_sharding.py::TestCheckpointShard::test[bigscience/bloom-560m-fp16]
silently hangs if incorrect version of transformers(>4.43.4) is installed, due to
no proper handling of SystemExit in the pool worker.
Fix is throw RuntimeError.

Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com>
@PKUWZP PKUWZP self-requested a review March 6, 2026 21:08
Copy link
Collaborator

@PKUWZP PKUWZP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending the fix. This PR fixes a test hang in test_checkpoint_sharding.py for the BLOOM 560m model. The root cause is that sys.exit() is called inside a multiprocessing pool worker when an incompatible transformers version is detected. sys.exit() raises SystemExit, which multiprocessing.Pool does not propagate back to the parent process cleanly — it silently hangs instead. The fix replaces sys.exit() with raise RuntimeError(...), which is properly caught and re-raised by the pool.

If @loadams and @hwchen2017 do not have questions I approve this PR fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants