Skip to content

DAOS-18487 rebuild: disallow reint/extend when with DOWN targets#17462

Open
liuxuezhao wants to merge 2 commits intomasterfrom
lxz/reint_check_allow
Open

DAOS-18487 rebuild: disallow reint/extend when with DOWN targets#17462
liuxuezhao wants to merge 2 commits intomasterfrom
lxz/reint_check_allow

Conversation

@liuxuezhao
Copy link
Contributor

@liuxuezhao liuxuezhao commented Jan 27, 2026

If not in delay_rebuild mode disallow reint/extend if with DOWN targets, user should try later after rebuild done.

Features: rebuild

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@liuxuezhao liuxuezhao requested review from a team as code owners January 27, 2026 07:37
@github-actions
Copy link

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-18487

* not the PS leader of the specified term, this
* rdb_resign call does nothing.
*/
rdb_resign(svc->ps_rsvc.s_db, svc->ps_rsvc.s_term);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need call rdb_resign here? it looks copied from pool_svc_update_map(), maybe @liw could confirm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why reint and extend need to check the self-heal policy: The only automatic recovery (i.e., self-heal) is the SWIM-based exclusion case; all reint/extend are manual. Am I missing @liuxuezhao's point, I wonder?

Copy link
Contributor Author

@liuxuezhao liuxuezhao Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if pool_rebuild is enabled, if with DOWN tgt should disallow REINT.
But if pool_rebuild is disabled, even with DOWN tgt, cannot disallow REINT, that is the difference right?

If don't check pool_rebuild, you mean always allow or disallow REINT if with DOWN tgt already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the self-heal policy can't tell us anything about whether there is an ongoing rebuild task or not. Also, because reint/extend are manual, they are not managed by the self-heal policy. Hence, it sounds like we don't need to look at the self-heal policy here.

Copy link
Contributor

@liw liw Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example 1: The rebuild flags in both the system and pool self-heal properties are disabled, but the administrator invokes dmg pool exclude ...---a rebuild task is in progress while self-heal rebuild is disabled.

Example 2: The rebuild flags in both the system and pool self-heal properties are enabled, an engine is excluded, but the administrator happens to disable one of the rebuild flags---a rebuild task is in progress while self-heal rebuild is disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you tend to always disallow REINT if with DOWN tgt already, despite of pool_rebuild policy setting?

Then how could it work for the case that user disable pool_rebuild, so SWIM exclusion will not trigger rebuild. And following REINT will not trigger rebuild either, that is not what we want, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we do want the reint to happen. What I'm saying is only "it seems looking at self-heal policy is not the right way". Where is the fundamental issue? Is it about "have there been any rebuild attempt for the DOWN targets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's remove the pool_rebuild policy check now, later can refine it if needed. thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see the full context of the code/conversation here due to the subsequent force push. For the case that a user has disabled rebuild in a pool's self_heal property, a SWIM exclusion occurs and does not trigger rebuild - then, if a reintegrate command is issued, with this patch it will not be permitted. So, if the administrator actually wants to reintegrate the targets:

  • first the pool self_heal property would need to be modified to enable rebuild, and dmg system self-heal eval command would be issued. Those actions will allow the DOWN targets to be rebuilt to DOWN_OUT state.
  • After this, another reintegrate command could be issued, which would then be permitted and a reintegration rebuild would run.
    Do I have it right? If so, that sounds OK to me.

wangshilong
wangshilong previously approved these changes Jan 27, 2026
@wangshilong wangshilong requested a review from kccain January 27, 2026 08:01
If not in delay_rebuild mode disallow reint/extend if with DOWN targets,
user should try later after rebuild done.

Features: rebuild

Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
Features: rebuild

Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
@liuxuezhao liuxuezhao force-pushed the lxz/reint_check_allow branch from 7257867 to 20cf39d Compare January 27, 2026 09:12
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17462/3/execution/node/1355/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17462/3/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17462/3/testReport/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants