-
Notifications
You must be signed in to change notification settings - Fork 850
Description
Is your feature request related to a problem? Please describe.
Currently in the cortex alert managers, when an erroneous alert manager configuration is loaded that would crash the particular pod, the replica set extension tries to sync in more alert managers to maintain the replication factor. This causes more pods to receive the config which leads to even more errors
Describe the solution you'd like
There should be a mechanism that should help us limit the blast radius of potentially erroneous config problems or other similar issues. If an error occurs, there should only be the 3 alert managers that own the problem config user (or whatever the replication factor is set to) that are impacted, while the remaining fleet can still continue operations.
Describe alternatives you've considered
I've considered doing this as a permanent disabling replica set extension change, but there are cases where an alert manager might become unhealthy for legitimate reasons and disabling could impact the alert managers HA responses, so I think it would be better to do a feature flag instead. That way operators can choose between the trade off of safety versus high-availability
Additional context
This was based on an actual case where one erred config lead to the entire fleet of alert managers being completely down. There were attempts to restart the pods/edit stateful set, but nothing worked until we diagnosed the issue and manually intervened.