[Engine] apiserver&engine exit when work failed to start #6322
+6
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
当前启动时如果worker启动失败,apiserver&engine不会退出。具体原因是会一直在
while self.get_profile_block_num_signal.value[0] == 0:等待,但是因为worker已经挂了,永远拿不到If the worker fails to start during the current startup process, the apiserver&engine will not exit. The specific reason is that it will keep waiting in the loop
while self.get_profile_block_num_signal.value[0] == 0:, but since the worker has already crashed, it will never receive the signalModifications
增加判断worker进程是否存在
Add a check to determine whether the worker process exists
Usage or Command
no
Accuracy Tests
no
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.