OCPEDGE-2393: Revert "NO-JIRA: [TNF] disable etcd member management after etcd handover"#1570
OCPEDGE-2393: Revert "NO-JIRA: [TNF] disable etcd member management after etcd handover"#1570fonta-rh wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
@fonta-rh: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR removes the dual-replica topology check infrastructure from cluster member management. It eliminates the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 6 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment Tip You can disable sequence diagrams in the walkthrough.Disable the |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go (1)
91-96:⚠️ Potential issue | 🔴 CriticalKeep the TNF handover ownership guard in the removal path.
PR
#1534explicitly moved DualReplica post-handover member add/remove under Pacemaker, and this revert removes both the topology signal and the early return from this controller. After this change,sync()can call the member-removal paths again in the same state Pacemaker is supposed to own, which reintroduces a dual-writer/quorum-risking path on etcd membership. Please restore an explicit handover check here unless the Pacemaker-side ownership was removed in the same change. (github.com)Also applies to: 105-171
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go` around lines 91 - 96, The TNF handover ownership guard that prevented this controller from performing DualReplica post-handover member add/remove was removed; restore an explicit ownership check so the removal path is skipped when Pacemaker owns handover. Add the handover/topology check back into the clustermemberremovalcontroller (e.g., at the start of syncer.Sync or in the controller setup that uses factory.New()/WithSync) to return early when the TNF/pacemaker handover flag indicates Pacemaker ownership (reinstating the previous early-return and topology signal guard used around DualReplica member removal), so the controller does not execute member-removal code while Pacemaker is the owner.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/clustermembercontroller/clustermembercontroller.go`:
- Around line 93-95: The change removed the infra-based guard/informer handling
and allows reconcileMembers() to add/promote learners unconditionally, which can
race Pacemaker after DualReplica handover; restore the previous guard/informer
behavior in the controller builder (the
factory.New().ResyncEvery(...).WithBareInformers(networkInformer.Informer()).WithInformers(informers...)
chain) and add an explicit ownership check inside reconcileMembers() that skips
learner add/promote operations when DualReplica external-etcd handover is
complete or when Pacemaker owns membership changes (i.e., gate the code paths
that perform learner add/promote with a check for the handover/ownership state
before making membership changes).
---
Outside diff comments:
In
`@pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go`:
- Around line 91-96: The TNF handover ownership guard that prevented this
controller from performing DualReplica post-handover member add/remove was
removed; restore an explicit ownership check so the removal path is skipped when
Pacemaker owns handover. Add the handover/topology check back into the
clustermemberremovalcontroller (e.g., at the start of syncer.Sync or in the
controller setup that uses factory.New()/WithSync) to return early when the
TNF/pacemaker handover flag indicates Pacemaker ownership (reinstating the
previous early-return and topology signal guard used around DualReplica member
removal), so the controller does not execute member-removal code while Pacemaker
is the owner.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 684872d1-13d7-4023-8554-0f21ac2958ba
📒 Files selected for processing (5)
pkg/operator/ceohelpers/external_etcd_status.gopkg/operator/ceohelpers/external_etcd_status_test.gopkg/operator/clustermembercontroller/clustermembercontroller.gopkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.gopkg/operator/starter.go
💤 Files with no reviewable changes (3)
- pkg/operator/ceohelpers/external_etcd_status.go
- pkg/operator/starter.go
- pkg/operator/ceohelpers/external_etcd_status_test.go
| return factory.New().ResyncEvery(time.Minute). | ||
| WithBareInformers(networkInformer.Informer(), infrastructureInformer.Informer()). | ||
| WithBareInformers(networkInformer.Informer()). | ||
| WithInformers(informers...). |
There was a problem hiding this comment.
Do not re-enable CEO learner add/promote after DualReplica handover without coordination.
According to PR #1534, once DualReplica external-etcd handover completes, Pacemaker owns membership changes. Removing the infra-based guard and informer here makes reconcileMembers() add/promote learners unconditionally again, with no replacement ownership check in this file. That can race Pacemaker on the same etcd membership state. (github.com)
Also applies to: 105-135
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/clustermembercontroller/clustermembercontroller.go` around lines
93 - 95, The change removed the infra-based guard/informer handling and allows
reconcileMembers() to add/promote learners unconditionally, which can race
Pacemaker after DualReplica handover; restore the previous guard/informer
behavior in the controller builder (the
factory.New().ResyncEvery(...).WithBareInformers(networkInformer.Informer()).WithInformers(informers...)
chain) and add an explicit ownership check inside reconcileMembers() that skips
learner add/promote operations when DualReplica external-etcd handover is
complete or when Pacemaker owns membership changes (i.e., gate the code paths
that perform learner add/promote with a check for the handover/ownership state
before making membership changes).
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dusk125, fonta-rh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
@fonta-rh: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@fonta-rh: An error was encountered searching for bug OCPEDGE-2393 on the Jira server at https://issues.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
request failed. Please analyze the request body for more details. Status code: 403:
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@fonta-rh: An error was encountered searching for bug OCPEDGE-2393 on the Jira server at https://issues.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
request failed. Please analyze the request body for more details. Status code: 403:
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@fonta-rh: The referenced Jira(s) [OCPEDGE-2393] could not be located, all automatically applied jira labels will be removed. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/hold After testing, this doesn't improve reliability on IPv4, which was the primary reason for exploring this revert. |
Summary
This reverts PR #1534 (merge commit b3d205a), which disabled etcd member management (add/remove) in DualReplica (TNF) clusters after the external etcd transition completed.
ShouldSkipMemberManagementForDualReplicahelper fromceohelpersClusterMemberControllerandClusterMemberRemovalControllerinfrastructureInformerwiring from both controllers andstarter.goReverts #1534
Summary by CodeRabbit
Refactoring