-
Notifications
You must be signed in to change notification settings - Fork 32
CLOUDP-306333: Remove monitoring hosts on downscaling #652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
MCK 1.6.2 Release NotesBug Fixes
|
| controlledFeature *controlledfeature.ControlledFeature | ||
| // hosts are used for both automation agents and monitoring endpoints. | ||
| // They are necessary for emulating "agents" are ready behavior as operator checks for hosts for agents to exist | ||
| // In Ops Manager, "hosts" and "automation agents" are two different things: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without fixing the mock (separation between agents types), some tests for sharded cluster were failing after the changes:
- TestMultiClusterShardedScalingWithOverrides
- TestMultiClusterShardedScaling
- TestReconcileCreateShardedCluster
- [...]
| } | ||
| // getAllHostnames returns the hostnames of all replicas across all clusters. | ||
| // Unhealthy clusters are ignored when reachableClustersOnly is set to true | ||
| func (r *ReconcileMongoDbMultiReplicaSet) getAllHostnames(mrs mdbmultiv1.MongoDBMultiCluster, clusterSpecList mdb.ClusterSpecList, reachableClustersOnly bool, log *zap.SugaredLogger) ([]string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extracted this logic that was in "updateDeploymentRs" into a subfunction
changelog/20251216_fix_remove_hosts_from_monitoring_on_scale_down.md
Outdated
Show resolved
Hide resolved
| --- | ||
| kind: fix | ||
| date: 2025-12-16 | ||
| --- | ||
|
|
||
| * Fix an issue to ensure that hosts are consistently removed from Ops Manager monitoring during MongoDB and AppDB scale-down events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| } | ||
|
|
||
| for i := 0; i < 2; i++ { | ||
| reconciler, err = newAppDbMultiReconciler(ctx, kubeClient, opsManager, globalClusterMap, log, omConnectionFactory.GetConnectionFunc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we verify that only the hosts that are taken down in that particular reconcile loop are removed from the monitoring?
I would like to ensure we're checking we're not removing ALL hosts immediately.
lsierant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome changes!
I like how extensively you've unit tested it!
Summary
Fixes CLOUDP-306333: When scaling down, removed hosts keep appearing in OM UI and monitoring agents keep trying to reach them.
Problem
The operator was not sending DELETE requests to the /hosts endpoint when scaling down. This affected multiple deployment types (the ticket was initially opened for AppDB).
Solution
On each reconcile, we now:
When fetching monitored hosts, we rely on the assumption that one OM project = one deployment.
The goal of this design if to be indempotent. If the operator crashes in the middle of a reconciliation, we always compare what we have (OM state) with what we want.
Some previous approaches in controllers were doing a diff inside the reconciliation loop itself.
For example in RS controller:
Design decisions to note
New Tests
GetAllMonitoredHostnamesandRemoveUndesiredMonitoringHostsProof of Work
Tests pass.
Checklist