DAOS-18487 rebuild: disallow reint/extend when with DOWN targets#17462
DAOS-18487 rebuild: disallow reint/extend when with DOWN targets#17462liuxuezhao wants to merge 2 commits intomasterfrom
Conversation
|
Errors are Unable to load ticket data |
src/pool/srv_pool.c
Outdated
| * not the PS leader of the specified term, this | ||
| * rdb_resign call does nothing. | ||
| */ | ||
| rdb_resign(svc->ps_rsvc.s_db, svc->ps_rsvc.s_term); |
There was a problem hiding this comment.
I am not sure we need call rdb_resign here? it looks copied from pool_svc_update_map(), maybe @liw could confirm.
There was a problem hiding this comment.
I'm not sure why reint and extend need to check the self-heal policy: The only automatic recovery (i.e., self-heal) is the SWIM-based exclusion case; all reint/extend are manual. Am I missing @liuxuezhao's point, I wonder?
There was a problem hiding this comment.
if pool_rebuild is enabled, if with DOWN tgt should disallow REINT.
But if pool_rebuild is disabled, even with DOWN tgt, cannot disallow REINT, that is the difference right?
If don't check pool_rebuild, you mean always allow or disallow REINT if with DOWN tgt already?
There was a problem hiding this comment.
Yes, I think the self-heal policy can't tell us anything about whether there is an ongoing rebuild task or not. Also, because reint/extend are manual, they are not managed by the self-heal policy. Hence, it sounds like we don't need to look at the self-heal policy here.
There was a problem hiding this comment.
Example 1: The rebuild flags in both the system and pool self-heal properties are disabled, but the administrator invokes dmg pool exclude ...---a rebuild task is in progress while self-heal rebuild is disabled.
Example 2: The rebuild flags in both the system and pool self-heal properties are enabled, an engine is excluded, but the administrator happens to disable one of the rebuild flags---a rebuild task is in progress while self-heal rebuild is disabled.
There was a problem hiding this comment.
so you tend to always disallow REINT if with DOWN tgt already, despite of pool_rebuild policy setting?
Then how could it work for the case that user disable pool_rebuild, so SWIM exclusion will not trigger rebuild. And following REINT will not trigger rebuild either, that is not what we want, no?
There was a problem hiding this comment.
Right, we do want the reint to happen. What I'm saying is only "it seems looking at self-heal policy is not the right way". Where is the fundamental issue? Is it about "have there been any rebuild attempt for the DOWN targets?
There was a problem hiding this comment.
OK, let's remove the pool_rebuild policy check now, later can refine it if needed. thanks.
There was a problem hiding this comment.
I can't see the full context of the code/conversation here due to the subsequent force push. For the case that a user has disabled rebuild in a pool's self_heal property, a SWIM exclusion occurs and does not trigger rebuild - then, if a reintegrate command is issued, with this patch it will not be permitted. So, if the administrator actually wants to reintegrate the targets:
- first the pool
self_healproperty would need to be modified to enable rebuild, anddmg system self-heal evalcommand would be issued. Those actions will allow theDOWNtargets to be rebuilt toDOWN_OUTstate. - After this, another reintegrate command could be issued, which would then be permitted and a reintegration rebuild would run.
Do I have it right? If so, that sounds OK to me.
If not in delay_rebuild mode disallow reint/extend if with DOWN targets, user should try later after rebuild done. Features: rebuild Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
ad8b720 to
7257867
Compare
Features: rebuild Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
7257867 to
20cf39d
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17462/3/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17462/3/execution/node/1355/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17462/3/testReport/ |
|
Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17462/3/testReport/ |
If not in delay_rebuild mode disallow reint/extend if with DOWN targets, user should try later after rebuild done.
Features: rebuild
Steps for the author:
After all prior steps are complete: