Skip to content

Feature flag for disabling the alert manager replica set extension #7200

@b-wu26

Description

@b-wu26

Is your feature request related to a problem? Please describe.
Currently in the cortex alert managers, when an erroneous alert manager configuration is loaded that would crash the particular pod, the replica set extension tries to sync in more alert managers to maintain the replication factor. This causes more pods to receive the config which leads to even more errors

Describe the solution you'd like
There should be a mechanism that should help us limit the blast radius of potentially erroneous config problems or other similar issues. If an error occurs, there should only be the 3 alert managers that own the problem config user (or whatever the replication factor is set to) that are impacted, while the remaining fleet can still continue operations.

Describe alternatives you've considered
I've considered doing this as a permanent disabling replica set extension change, but there are cases where an alert manager might become unhealthy for legitimate reasons and disabling could impact the alert managers HA responses, so I think it would be better to do a feature flag instead. That way operators can choose between the trade off of safety versus high-availability

Additional context
This was based on an actual case where one erred config lead to the entire fleet of alert managers being completely down. There were attempts to restart the pods/edit stateful set, but nothing worked until we diagnosed the issue and manually intervened.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions