CLOUDP-306333: Remove monitoring hosts on downscaling #652

Julien-Ben · 2025-12-17T16:02:03Z

Summary

Fixes CLOUDP-306333: When scaling down, removed hosts keep appearing in OM UI and monitoring agents keep trying to reach them.

Problem

The operator was not sending DELETE requests to the /hosts endpoint when scaling down. This affected multiple deployment types (the ticket was initially opened for AppDB).

Solution

On each reconcile, we now:

Get all monitored hosts from OM API
Compute desired hosts from the current desired state
Remove hosts that are monitored but not desired

When fetching monitored hosts, we rely on the assumption that one OM project = one deployment.
The goal of this design if to be indempotent. If the operator crashes in the middle of a reconciliation, we always compare what we have (OM state) with what we want.

Some previous approaches in controllers were doing a diff inside the reconciliation loop itself.
For example in RS controller:

hostsBefore := getAllHostsForReplicas(rs, membersNumberBefore)
hostsAfter := getAllHostsForReplicas(rs, scale.ReplicasThisReconciliation(rs))

Design decisions to note

We do not error out when host removal fails, only log a warning.
If a cluster is unreachable (unhealthy), we do not remove its hosts from monitoring.
The indempotent approach implies that if a host is added to monitoring manually (outside of kubernetes), the operator will clean it up on reconciliation.

New Tests

Unit tests for GetAllMonitoredHostnames and RemoveUndesiredMonitoringHosts
E2E tests verify host count after scale-down

Proof of Work

Tests pass.

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?

github-actions · 2025-12-17T16:02:58Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.6.2 Release Notes

Bug Fixes

Fix an issue to ensure that hosts are consistently removed from Ops Manager monitoring during MongoDB and AppDB scale-down events.

Julien-Ben · 2025-12-18T12:56:49Z

controllers/om/mockedomclient.go

 	controlledFeature     *controlledfeature.ControlledFeature
-	// hosts are used for both automation agents and monitoring endpoints.
-	// They are necessary for emulating "agents" are ready behavior as operator checks for hosts for agents to exist
+	// In Ops Manager, "hosts" and "automation agents" are two different things:


Without fixing the mock (separation between agents types), some tests for sharded cluster were failing after the changes:

TestMultiClusterShardedScalingWithOverrides

TestMultiClusterShardedScaling

TestReconcileCreateShardedCluster

[...]

Julien-Ben · 2025-12-18T13:00:49Z

controllers/operator/mongodbmultireplicaset_controller.go

-	}
+// getAllHostnames returns the hostnames of all replicas across all clusters.
+// Unhealthy clusters are ignored when reachableClustersOnly is set to true
+func (r *ReconcileMongoDbMultiReplicaSet) getAllHostnames(mrs mdbmultiv1.MongoDBMultiCluster, clusterSpecList mdb.ClusterSpecList, reachableClustersOnly bool, log *zap.SugaredLogger) ([]string, error) {


I extracted this logic that was in "updateDeploymentRs" into a subfunction

changelog/20251216_fix_remove_hosts_from_monitoring_on_scale_down.md

vinilage · 2025-12-18T20:46:29Z

changelog/20251216_fix_remove_hosts_from_monitoring_on_scale_down.md

+---
+kind: fix
+date: 2025-12-16
+---
+
+* Fix an issue to ensure that hosts are consistently removed from Ops Manager monitoring during MongoDB and AppDB scale-down events.


lsierant · 2025-12-19T11:40:07Z

controllers/operator/appdbreplicaset_controller_multi_test.go

+	}
+
+	for i := 0; i < 2; i++ {
+		reconciler, err = newAppDbMultiReconciler(ctx, kubeClient, opsManager, globalClusterMap, log, omConnectionFactory.GetConnectionFunc)


can we verify that only the hosts that are taken down in that particular reconcile loop are removed from the monitoring?

I would like to ensure we're checking we're not removing ALL hosts immediately.

lsierant

Awesome changes!
I like how extensively you've unit tested it!

Julien-Ben added 19 commits December 15, 2025 17:01

Add unit tests for host removal on downscaling

46aefdf

E2E test for host removal

1995665

Modify tests

9c914a5

Update mocked OM client

a01ff0e

Shared function

918bf9e

Fix for AppDB

a88fb9b

Fix for MC RS

fce8fcf

Fix for RS

0c6ef10

Fix for sharded clusters

5e03870

Changelog entry

8a9ce8d

Fix E2E test expected host count for multi-cluster mode

10b5b5f

Cleanup hosts on every reconciliation

f0f7932

RemoveUndesiredMonitoringHosts

e5bf6ea

Orphaned hosts test

84dd908

Rely on constraint one OM project = one deployment

c2947fb

Rely on OM API for sharded too

aa35ff1

Cleanup hosts in Multi replica set controller

cf3990a

Warn only when monitoring removal fails

a57ce56

Rephrase changelog

f71290f

Julien-Ben added 3 commits December 17, 2025 17:12

comment

7a70702

Fix mocked client

d6a2337

Lint

c79ff89

Julien-Ben commented Dec 18, 2025

View reviewed changes

Julien-Ben marked this pull request as ready for review December 18, 2025 13:09

Julien-Ben requested review from a team and vinilage as code owners December 18, 2025 13:09

Julien-Ben requested review from MaciejKaras and anandsyncs December 18, 2025 13:09

Merge branch 'master' into jben/CLOUDP-306333_downscaling-remove-hosts

db928da

Ignore removed shards

f3f9593

vinilage requested changes Dec 18, 2025

View reviewed changes

changelog/20251216_fix_remove_hosts_from_monitoring_on_scale_down.md Outdated Show resolved Hide resolved

Rephrase changelog

ce3bd97

vinilage approved these changes Dec 18, 2025

View reviewed changes

lsierant reviewed Dec 19, 2025

View reviewed changes

lsierant approved these changes Dec 19, 2025

View reviewed changes

m1kola approved these changes Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CLOUDP-306333: Remove monitoring hosts on downscaling #652

CLOUDP-306333: Remove monitoring hosts on downscaling #652

Uh oh!

Julien-Ben commented Dec 17, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

Julien-Ben Dec 18, 2025

Uh oh!

Julien-Ben Dec 18, 2025

Uh oh!

Uh oh!

vinilage Dec 18, 2025

Uh oh!

lsierant Dec 19, 2025

Uh oh!

lsierant left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLOUDP-306333: Remove monitoring hosts on downscaling #652

Are you sure you want to change the base?

CLOUDP-306333: Remove monitoring hosts on downscaling #652

Uh oh!

Conversation

Julien-Ben commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Design decisions to note

New Tests

Proof of Work

Checklist

Uh oh!

github-actions bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCK 1.6.2 Release Notes

Bug Fixes

Uh oh!

Julien-Ben Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Julien-Ben Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vinilage Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

lsierant Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

lsierant left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Julien-Ben commented Dec 17, 2025 •

edited

Loading

github-actions bot commented Dec 17, 2025 •

edited

Loading