DAOS-18238 chk: handle CRT_EVS_GRPMOD event from CaRT PG#17459
DAOS-18238 chk: handle CRT_EVS_GRPMOD event from CaRT PG#17459
Conversation
|
Ticket title is 'recovery/cat_recov_core.py:CatRecovCoreTest.test_daos_cat_recov_core - CR20-28 failed - 1 rank adminexcluded, others checkerstarted' |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17459/1/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17459/1/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17459/1/testReport/ |
To guarantee that the rank deatch event will not be omitted, related CR logic needs handle the event from both SWIM and CaRT PG, although there will be a lot of useless event callback. Test-tag: recovery Signed-off-by: Fan Yong <fan.yong@hpe.com>
e5f878f to
8104c06
Compare
| D_GOTO(out, rc = -DER_NOMEM); | ||
|
|
||
| cdr->cdr_rank = rank; | ||
| } else if (d_list_empty(&ins->ci_dead_ranks)) { |
There was a problem hiding this comment.
[Question] Is ci_dead_ranks protected by ci_abt_mutex?
There was a problem hiding this comment.
In theory, it is yes. But for this logic, it only cares whether someone has ever added the rank CRT_EVT_DEAD event into the ins->ci_dead_ranks. That will be earlier than current CRT_EVT_ALIVE event. So if the list has become empty, then either related CRT_EVT_DEAD has been handled or is being handled. For both case, current CRT_EVT_ALIVE event will be useless and can be ignored. It is no matter to race with the event for other rank(s). So even if we do not take ci_abt_mutex when check the list empty, it is still OK. On the other hand, if the list is not empty, then the subsequent logic will take ci_abt_mutex and try to find out former CRT_EVT_DEAD event for related rank. That will serialize inserting/removing event into/from such list.
|
Ping reviewers @jgmoore-or @gnailzenh , thanks! |
To guarantee that the rank death event will not be omitted, related CR logic needs handle the event from both SWIM and CaRT PG, although there will be a lot of useless event callback.
Test-tag: recovery
Steps for the author:
After all prior steps are complete: