Feature flag for disabling the alert manager replica set extension

**Is your feature request related to a problem? Please describe.**
Currently in the cortex alert managers, when an erroneous alert manager configuration is loaded that would crash the particular pod, the replica set extension tries to sync in more alert managers to maintain the replication factor. This causes more pods to receive the config which leads to even more errors

**Describe the solution you'd like**
There should be a mechanism that should help us limit the blast radius of potentially erroneous config problems or other similar issues. If an error occurs, there should only be the 3 alert managers that own the problem config user (or whatever the replication factor is set to) that are impacted, while the remaining fleet can still continue operations.

**Describe alternatives you've considered**
I've considered doing this as a permanent disabling replica set extension change, but there are cases where an alert manager might become unhealthy for legitimate reasons and disabling could impact the alert managers HA responses, so I think it would be better to do a feature flag instead. That way operators can choose between the trade off of safety versus high-availability 

**Additional context**
This was based on an actual case where one erred config lead to the entire fleet of alert managers being completely down. There were attempts to restart the pods/edit stateful set, but nothing worked until we diagnosed the issue and manually intervened. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature flag for disabling the alert manager replica set extension #7200

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature flag for disabling the alert manager replica set extension #7200

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions