Skip to content

DAOS-18238 chk: handle CRT_EVS_GRPMOD event from CaRT PG#17459

Merged
gnailzenh merged 1 commit intomasterfrom
Nasf-Fan/DAOS-18238_1
Feb 1, 2026
Merged

DAOS-18238 chk: handle CRT_EVS_GRPMOD event from CaRT PG#17459
gnailzenh merged 1 commit intomasterfrom
Nasf-Fan/DAOS-18238_1

Conversation

@Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Jan 27, 2026

To guarantee that the rank death event will not be omitted, related CR logic needs handle the event from both SWIM and CaRT PG, although there will be a lot of useless event callback.

Test-tag: recovery

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

github-actions bot commented Jan 27, 2026

Ticket title is 'recovery/cat_recov_core.py:CatRecovCoreTest.test_daos_cat_recov_core - CR20-28 failed - 1 rank adminexcluded, others checkerstarted'
Status is 'In Review'
Labels: '2.6.4-aurora.p1,2.8.0tb1,ci_master_daily,ci_master_provider,daily_test'
https://daosio.atlassian.net/browse/DAOS-18238

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17459/1/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17459/1/testReport/

To guarantee that the rank deatch event will not be omitted, related
CR logic needs handle the event from both SWIM and CaRT PG, although
there will be a lot of useless event callback.

Test-tag: recovery

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18238_1 branch from e5f878f to 8104c06 Compare January 28, 2026 04:44
@Nasf-Fan Nasf-Fan marked this pull request as ready for review January 29, 2026 02:02
D_GOTO(out, rc = -DER_NOMEM);

cdr->cdr_rank = rank;
} else if (d_list_empty(&ins->ci_dead_ranks)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Is ci_dead_ranks protected by ci_abt_mutex?

Copy link
Contributor Author

@Nasf-Fan Nasf-Fan Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it is yes. But for this logic, it only cares whether someone has ever added the rank CRT_EVT_DEAD event into the ins->ci_dead_ranks. That will be earlier than current CRT_EVT_ALIVE event. So if the list has become empty, then either related CRT_EVT_DEAD has been handled or is being handled. For both case, current CRT_EVT_ALIVE event will be useless and can be ignored. It is no matter to race with the event for other rank(s). So even if we do not take ci_abt_mutex when check the list empty, it is still OK. On the other hand, if the list is not empty, then the subsequent logic will take ci_abt_mutex and try to find out former CRT_EVT_DEAD event for related rank. That will serialize inserting/removing event into/from such list.

@Nasf-Fan Nasf-Fan requested a review from liw January 30, 2026 02:46
@Nasf-Fan
Copy link
Contributor Author

Ping reviewers @jgmoore-or @gnailzenh , thanks!

@gnailzenh gnailzenh merged commit bdfdf73 into master Feb 1, 2026
41 checks passed
@gnailzenh gnailzenh deleted the Nasf-Fan/DAOS-18238_1 branch February 1, 2026 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants