Skip to content

OCPEDGE-2393: Revert "NO-JIRA: [TNF] disable etcd member management after etcd handover"#1570

Closed
fonta-rh wants to merge 1 commit intoopenshift:mainfrom
fonta-rh:revert-pr-1534
Closed

OCPEDGE-2393: Revert "NO-JIRA: [TNF] disable etcd member management after etcd handover"#1570
fonta-rh wants to merge 1 commit intoopenshift:mainfrom
fonta-rh:revert-pr-1534

Conversation

@fonta-rh
Copy link
Copy Markdown
Contributor

@fonta-rh fonta-rh commented Mar 13, 2026

Summary

This reverts PR #1534 (merge commit b3d205a), which disabled etcd member management (add/remove) in DualReplica (TNF) clusters after the external etcd transition completed.

  • Removes ShouldSkipMemberManagementForDualReplica helper from ceohelpers
  • Removes DualReplica skip guards from ClusterMemberController and ClusterMemberRemovalController
  • Removes infrastructureInformer wiring from both controllers and starter.go

Reverts #1534

Summary by CodeRabbit

Refactoring

  • Removed dual-replica topology checks from cluster member management workflows, simplifying the member reconciliation logic
  • Eliminated conditional skip behavior previously used to gate member management based on infrastructure topology status
  • Cleaned up infrastructure observer dependencies from cluster member management and removal controllers

…isable-etcd-member-promotion-in-tnf"

This reverts commit b3d205a, reversing
changes made to 3f63dbe.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 13, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@fonta-rh: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

This reverts PR #1534 (merge commit b3d205a), which disabled etcd member management (add/remove) in DualReplica (TNF) clusters after the external etcd transition completed.

  • Removes ShouldSkipMemberManagementForDualReplica helper from ceohelpers
  • Removes DualReplica skip guards from ClusterMemberController and ClusterMemberRemovalController
  • Removes infrastructureInformer wiring from both controllers and starter.go

Reverts #1534

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 13, 2026

Walkthrough

This PR removes the dual-replica topology check infrastructure from cluster member management. It eliminates the ShouldSkipMemberManagementForDualReplica function, removes infrastructure lister dependencies from multiple controllers, and simplifies the member reconciliation flow by removing conditional member management gating based on topology.

Changes

Cohort / File(s) Summary
External ETCD Status Helpers
pkg/operator/ceohelpers/external_etcd_status.go, pkg/operator/ceohelpers/external_etcd_status_test.go
Removed ShouldSkipMemberManagementForDualReplica function definition and its complete test suite, including related imports and test infrastructure.
Cluster Member Controllers
pkg/operator/clustermembercontroller/clustermembercontroller.go, pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
Removed infrastructure lister field and informer parameter from both controllers. Eliminated dual-replica skip logic from reconciliation paths; removed conditional member management gating that previously checked topology status.
Operator Starter
pkg/operator/starter.go
Removed Infrastructure informer usage from envVarController and NewQuorumChecker initialization paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 6 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ❓ Inconclusive The custom check requests review of Ginkgo test code quality, but this codebase uses standard Go testing package with table-driven tests, not Ginkgo. Clarify whether the check should apply to standard Go testing patterns or confirm Ginkgo-specific review is required. The current tests follow existing codebase patterns.
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: reverting PR #1534 which disabled etcd member management in DualReplica clusters. This is directly supported by the file summaries showing removal of ShouldSkipMemberManagementForDualReplica logic and related infrastructure checks.
Stable And Deterministic Test Names ✅ Passed The custom check validates Ginkgo test names for stability; however, this repository uses standard Go testing framework, not Ginkgo, making this check not applicable.
Microshift Test Compatibility ✅ Passed This PR is a revert that removes previously added code and does not introduce any new Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR is a revert that removes code and unit tests, not adding any new Ginkgo e2e tests.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This pull request is a revert that removes previously added code and does not add any new Ginkgo e2e tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

@openshift-ci openshift-ci Bot requested review from hasbro17 and ironcladlou March 13, 2026 13:19
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go (1)

91-96: ⚠️ Potential issue | 🔴 Critical

Keep the TNF handover ownership guard in the removal path.

PR #1534 explicitly moved DualReplica post-handover member add/remove under Pacemaker, and this revert removes both the topology signal and the early return from this controller. After this change, sync() can call the member-removal paths again in the same state Pacemaker is supposed to own, which reintroduces a dual-writer/quorum-risking path on etcd membership. Please restore an explicit handover check here unless the Pacemaker-side ownership was removed in the same change. (github.com)

Also applies to: 105-171

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go`
around lines 91 - 96, The TNF handover ownership guard that prevented this
controller from performing DualReplica post-handover member add/remove was
removed; restore an explicit ownership check so the removal path is skipped when
Pacemaker owns handover. Add the handover/topology check back into the
clustermemberremovalcontroller (e.g., at the start of syncer.Sync or in the
controller setup that uses factory.New()/WithSync) to return early when the
TNF/pacemaker handover flag indicates Pacemaker ownership (reinstating the
previous early-return and topology signal guard used around DualReplica member
removal), so the controller does not execute member-removal code while Pacemaker
is the owner.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/clustermembercontroller/clustermembercontroller.go`:
- Around line 93-95: The change removed the infra-based guard/informer handling
and allows reconcileMembers() to add/promote learners unconditionally, which can
race Pacemaker after DualReplica handover; restore the previous guard/informer
behavior in the controller builder (the
factory.New().ResyncEvery(...).WithBareInformers(networkInformer.Informer()).WithInformers(informers...)
chain) and add an explicit ownership check inside reconcileMembers() that skips
learner add/promote operations when DualReplica external-etcd handover is
complete or when Pacemaker owns membership changes (i.e., gate the code paths
that perform learner add/promote with a check for the handover/ownership state
before making membership changes).

---

Outside diff comments:
In
`@pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go`:
- Around line 91-96: The TNF handover ownership guard that prevented this
controller from performing DualReplica post-handover member add/remove was
removed; restore an explicit ownership check so the removal path is skipped when
Pacemaker owns handover. Add the handover/topology check back into the
clustermemberremovalcontroller (e.g., at the start of syncer.Sync or in the
controller setup that uses factory.New()/WithSync) to return early when the
TNF/pacemaker handover flag indicates Pacemaker ownership (reinstating the
previous early-return and topology signal guard used around DualReplica member
removal), so the controller does not execute member-removal code while Pacemaker
is the owner.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 684872d1-13d7-4023-8554-0f21ac2958ba

📥 Commits

Reviewing files that changed from the base of the PR and between f570797 and ef3a8ae.

📒 Files selected for processing (5)
  • pkg/operator/ceohelpers/external_etcd_status.go
  • pkg/operator/ceohelpers/external_etcd_status_test.go
  • pkg/operator/clustermembercontroller/clustermembercontroller.go
  • pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
  • pkg/operator/starter.go
💤 Files with no reviewable changes (3)
  • pkg/operator/ceohelpers/external_etcd_status.go
  • pkg/operator/starter.go
  • pkg/operator/ceohelpers/external_etcd_status_test.go

Comment on lines 93 to 95
return factory.New().ResyncEvery(time.Minute).
WithBareInformers(networkInformer.Informer(), infrastructureInformer.Informer()).
WithBareInformers(networkInformer.Informer()).
WithInformers(informers...).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Do not re-enable CEO learner add/promote after DualReplica handover without coordination.

According to PR #1534, once DualReplica external-etcd handover completes, Pacemaker owns membership changes. Removing the infra-based guard and informer here makes reconcileMembers() add/promote learners unconditionally again, with no replacement ownership check in this file. That can race Pacemaker on the same etcd membership state. (github.com)

Also applies to: 105-135

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/clustermembercontroller/clustermembercontroller.go` around lines
93 - 95, The change removed the infra-based guard/informer handling and allows
reconcileMembers() to add/promote learners unconditionally, which can race
Pacemaker after DualReplica handover; restore the previous guard/informer
behavior in the controller builder (the
factory.New().ResyncEvery(...).WithBareInformers(networkInformer.Informer()).WithInformers(informers...)
chain) and add an explicit ownership check inside reconcileMembers() that skips
learner add/promote operations when DualReplica external-etcd handover is
complete or when Pacemaker owns membership changes (i.e., gate the code paths
that perform learner add/promote with a check for the handover/ownership state
before making membership changes).

@dusk125
Copy link
Copy Markdown
Contributor

dusk125 commented Mar 13, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 13, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, fonta-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 13, 2026
@eggfoobar
Copy link
Copy Markdown
Contributor

/retest-required

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 14, 2026

@fonta-rh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@fonta-rh fonta-rh changed the title Revert "NO-JIRA: [TNF] disable etcd member management after etcd handover" OCPEDGE-2393: Revert "NO-JIRA: [TNF] disable etcd member management after etcd handover" Mar 16, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@fonta-rh: An error was encountered searching for bug OCPEDGE-2393 on the Jira server at https://issues.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. request failed. Please analyze the request body for more details. Status code: 403:

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

Summary

This reverts PR #1534 (merge commit b3d205a), which disabled etcd member management (add/remove) in DualReplica (TNF) clusters after the external etcd transition completed.

  • Removes ShouldSkipMemberManagementForDualReplica helper from ceohelpers
  • Removes DualReplica skip guards from ClusterMemberController and ClusterMemberRemovalController
  • Removes infrastructureInformer wiring from both controllers and starter.go

Reverts #1534

Summary by CodeRabbit

Refactoring

  • Removed dual-replica topology checks from cluster member management workflows, simplifying the member reconciliation logic
  • Eliminated conditional skip behavior previously used to gate member management based on infrastructure topology status
  • Cleaned up infrastructure observer dependencies from cluster member management and removal controllers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 16, 2026
@fonta-rh
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown

@fonta-rh: An error was encountered searching for bug OCPEDGE-2393 on the Jira server at https://issues.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. request failed. Please analyze the request body for more details. Status code: 403:

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown

@fonta-rh: The referenced Jira(s) [OCPEDGE-2393] could not be located, all automatically applied jira labels will be removed.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jaypoulz
Copy link
Copy Markdown
Contributor

/hold

After testing, this doesn't improve reliability on IPv4, which was the primary reason for exploring this revert.

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 17, 2026
@fonta-rh fonta-rh closed this Mar 25, 2026
@fonta-rh fonta-rh deleted the revert-pr-1534 branch March 25, 2026 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants