OCPEDGE-2393: Revert "NO-JIRA: [TNF] disable etcd member management after etcd handover" by fonta-rh · Pull Request #1570 · openshift/cluster-etcd-operator

fonta-rh · 2026-03-13T13:18:29Z

Summary

This reverts PR #1534 (merge commit b3d205a), which disabled etcd member management (add/remove) in DualReplica (TNF) clusters after the external etcd transition completed.

Removes ShouldSkipMemberManagementForDualReplica helper from ceohelpers
Removes DualReplica skip guards from ClusterMemberController and ClusterMemberRemovalController
Removes infrastructureInformer wiring from both controllers and starter.go

Reverts #1534

Summary by CodeRabbit

Refactoring

Removed dual-replica topology checks from cluster member management workflows, simplifying the member reconciliation logic
Eliminated conditional skip behavior previously used to gate member management based on infrastructure topology status
Cleaned up infrastructure observer dependencies from cluster member management and removal controllers

…isable-etcd-member-promotion-in-tnf" This reverts commit b3d205a, reversing changes made to 3f63dbe.

openshift-ci-robot · 2026-03-13T13:18:33Z

@fonta-rh: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

This reverts PR #1534 (merge commit b3d205a), which disabled etcd member management (add/remove) in DualReplica (TNF) clusters after the external etcd transition completed.

Removes ShouldSkipMemberManagementForDualReplica helper from ceohelpers

Removes DualReplica skip guards from ClusterMemberController and ClusterMemberRemovalController

Removes infrastructureInformer wiring from both controllers and starter.go

Reverts #1534

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-03-13T13:18:48Z

Walkthrough

This PR removes the dual-replica topology check infrastructure from cluster member management. It eliminates the ShouldSkipMemberManagementForDualReplica function, removes infrastructure lister dependencies from multiple controllers, and simplifies the member reconciliation flow by removing conditional member management gating based on topology.

Changes

Cohort / File(s)	Summary
External ETCD Status Helpers `pkg/operator/ceohelpers/external_etcd_status.go`, `pkg/operator/ceohelpers/external_etcd_status_test.go`	Removed `ShouldSkipMemberManagementForDualReplica` function definition and its complete test suite, including related imports and test infrastructure.
Cluster Member Controllers `pkg/operator/clustermembercontroller/clustermembercontroller.go`, `pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go`	Removed infrastructure lister field and informer parameter from both controllers. Eliminated dual-replica skip logic from reconciliation paths; removed conditional member management gating that previously checked topology status.
Operator Starter `pkg/operator/starter.go`	Removed Infrastructure informer usage from envVarController and NewQuorumChecker initialization paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 6 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	❓ Inconclusive	The custom check requests review of Ginkgo test code quality, but this codebase uses standard Go testing package with table-driven tests, not Ginkgo.	Clarify whether the check should apply to standard Go testing patterns or confirm Ginkgo-specific review is required. The current tests follow existing codebase patterns.

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: reverting PR `#1534` which disabled etcd member management in DualReplica clusters. This is directly supported by the file summaries showing removal of ShouldSkipMemberManagementForDualReplica logic and related infrastructure checks.
Stable And Deterministic Test Names	✅ Passed	The custom check validates Ginkgo test names for stability; however, this repository uses standard Go testing framework, not Ginkgo, making this check not applicable.
Microshift Test Compatibility	✅ Passed	This PR is a revert that removes previously added code and does not introduce any new Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR is a revert that removes code and unit tests, not adding any new Ginkgo e2e tests.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This pull request is a revert that removes previously added code and does not add any new Ginkgo e2e tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go (1)
91-96: ⚠️ Potential issue | 🔴 Critical

Keep the TNF handover ownership guard in the removal path.

PR #1534 explicitly moved DualReplica post-handover member add/remove under Pacemaker, and this revert removes both the topology signal and the early return from this controller. After this change, sync() can call the member-removal paths again in the same state Pacemaker is supposed to own, which reintroduces a dual-writer/quorum-risking path on etcd membership. Please restore an explicit handover check here unless the Pacemaker-side ownership was removed in the same change. (github.com)

Also applies to: 105-171
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go`
around lines 91 - 96, The TNF handover ownership guard that prevented this
controller from performing DualReplica post-handover member add/remove was
removed; restore an explicit ownership check so the removal path is skipped when
Pacemaker owns handover. Add the handover/topology check back into the
clustermemberremovalcontroller (e.g., at the start of syncer.Sync or in the
controller setup that uses factory.New()/WithSync) to return early when the
TNF/pacemaker handover flag indicates Pacemaker ownership (reinstating the
previous early-return and topology signal guard used around DualReplica member
removal), so the controller does not execute member-removal code while Pacemaker
is the owner.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/clustermembercontroller/clustermembercontroller.go`:
- Around line 93-95: The change removed the infra-based guard/informer handling
and allows reconcileMembers() to add/promote learners unconditionally, which can
race Pacemaker after DualReplica handover; restore the previous guard/informer
behavior in the controller builder (the
factory.New().ResyncEvery(...).WithBareInformers(networkInformer.Informer()).WithInformers(informers...)
chain) and add an explicit ownership check inside reconcileMembers() that skips
learner add/promote operations when DualReplica external-etcd handover is
complete or when Pacemaker owns membership changes (i.e., gate the code paths
that perform learner add/promote with a check for the handover/ownership state
before making membership changes).

---

Outside diff comments:
In
`@pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go`:
- Around line 91-96: The TNF handover ownership guard that prevented this
controller from performing DualReplica post-handover member add/remove was
removed; restore an explicit ownership check so the removal path is skipped when
Pacemaker owns handover. Add the handover/topology check back into the
clustermemberremovalcontroller (e.g., at the start of syncer.Sync or in the
controller setup that uses factory.New()/WithSync) to return early when the
TNF/pacemaker handover flag indicates Pacemaker ownership (reinstating the
previous early-return and topology signal guard used around DualReplica member
removal), so the controller does not execute member-removal code while Pacemaker
is the owner.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 684872d1-13d7-4023-8554-0f21ac2958ba

📥 Commits

Reviewing files that changed from the base of the PR and between f570797 and ef3a8ae.

📒 Files selected for processing (5)

pkg/operator/ceohelpers/external_etcd_status.go
pkg/operator/ceohelpers/external_etcd_status_test.go
pkg/operator/clustermembercontroller/clustermembercontroller.go
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
pkg/operator/starter.go

💤 Files with no reviewable changes (3)

pkg/operator/ceohelpers/external_etcd_status.go
pkg/operator/starter.go
pkg/operator/ceohelpers/external_etcd_status_test.go

coderabbitai · 2026-03-13T13:26:28Z

 	return factory.New().ResyncEvery(time.Minute).
-		WithBareInformers(networkInformer.Informer(), infrastructureInformer.Informer()).
+		WithBareInformers(networkInformer.Informer()).
 		WithInformers(informers...).


⚠️ Potential issue | 🔴 Critical

Do not re-enable CEO learner add/promote after DualReplica handover without coordination.

According to PR #1534, once DualReplica external-etcd handover completes, Pacemaker owns membership changes. Removing the infra-based guard and informer here makes reconcileMembers() add/promote learners unconditionally again, with no replacement ownership check in this file. That can race Pacemaker on the same etcd membership state. (github.com)

Also applies to: 105-135

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/operator/clustermembercontroller/clustermembercontroller.go` around lines 93 - 95, The change removed the infra-based guard/informer handling and allows reconcileMembers() to add/promote learners unconditionally, which can race Pacemaker after DualReplica handover; restore the previous guard/informer behavior in the controller builder (the factory.New().ResyncEvery(...).WithBareInformers(networkInformer.Informer()).WithInformers(informers...) chain) and add an explicit ownership check inside reconcileMembers() that skips learner add/promote operations when DualReplica external-etcd handover is complete or when Pacemaker owns membership changes (i.e., gate the code paths that perform learner add/promote with a check for the handover/ownership state before making membership changes).

dusk125 · 2026-03-13T17:19:48Z

/lgtm

openshift-ci · 2026-03-13T17:20:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, fonta-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dusk125]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

eggfoobar · 2026-03-13T21:16:00Z

/retest-required

openshift-ci · 2026-03-14T00:34:51Z

@fonta-rh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-03-16T07:52:34Z

@fonta-rh: An error was encountered searching for bug OCPEDGE-2393 on the Jira server at https://issues.redhat.com. No known errors were detected, please see the full error message for details.

Full error message.


request failed. Please analyze the request body for more details. Status code: 403:

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

Summary

This reverts PR #1534 (merge commit b3d205a), which disabled etcd member management (add/remove) in DualReplica (TNF) clusters after the external etcd transition completed.

Removes ShouldSkipMemberManagementForDualReplica helper from ceohelpers

Removes DualReplica skip guards from ClusterMemberController and ClusterMemberRemovalController

Removes infrastructureInformer wiring from both controllers and starter.go

Reverts #1534

Summary by CodeRabbit

Refactoring

Removed dual-replica topology checks from cluster member management workflows, simplifying the member reconciliation logic

Eliminated conditional skip behavior previously used to gate member management based on infrastructure topology status

Cleaned up infrastructure observer dependencies from cluster member management and removal controllers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

fonta-rh · 2026-03-16T07:53:19Z

/jira refresh

openshift-ci-robot · 2026-03-16T07:53:23Z

@fonta-rh: An error was encountered searching for bug OCPEDGE-2393 on the Jira server at https://issues.redhat.com. No known errors were detected, please see the full error message for details.

Full error message.


request failed. Please analyze the request body for more details. Status code: 403:

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-03-16T07:53:24Z

@fonta-rh: The referenced Jira(s) [OCPEDGE-2393] could not be located, all automatically applied jira labels will be removed.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jaypoulz · 2026-03-17T11:51:06Z

/hold

After testing, this doesn't improve reliability on IPv4, which was the primary reason for exploring this revert.

Revert "Merge pull request openshift#1534 from clobrano/enhancement/d…

ef3a8ae

…isable-etcd-member-promotion-in-tnf" This reverts commit b3d205a, reversing changes made to 3f63dbe.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 13, 2026

openshift-ci Bot requested review from hasbro17 and ironcladlou March 13, 2026 13:19

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

openshift-ci Bot assigned dusk125 Mar 13, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 13, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 13, 2026

fonta-rh changed the title ~~Revert "NO-JIRA: [TNF] disable etcd member management after etcd handover"~~ OCPEDGE-2393: Revert "NO-JIRA: [TNF] disable etcd member management after etcd handover" Mar 16, 2026

openshift-ci-robot removed the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 16, 2026

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 17, 2026

fonta-rh closed this Mar 25, 2026

fonta-rh deleted the revert-pr-1534 branch March 25, 2026 09:29

Conversation

fonta-rh commented Mar 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Refactoring

Uh oh!

openshift-ci-robot commented Mar 13, 2026

Summary

Uh oh!

coderabbitai Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

dusk125 commented Mar 13, 2026

Uh oh!

openshift-ci Bot commented Mar 13, 2026

Uh oh!

eggfoobar commented Mar 13, 2026

Uh oh!

openshift-ci Bot commented Mar 14, 2026

Uh oh!

openshift-ci-robot commented Mar 16, 2026

Summary

Summary by CodeRabbit

Refactoring

Uh oh!

fonta-rh commented Mar 16, 2026

Uh oh!

openshift-ci-robot commented Mar 16, 2026

Uh oh!

openshift-ci-robot commented Mar 16, 2026

Uh oh!

jaypoulz commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fonta-rh commented Mar 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 13, 2026 •

edited

Loading