Skip to content

Feature Request: PostSync Monitoring on all Config Sync managed resources #1886

@tylerreidwaze

Description

@tylerreidwaze

Checklist

  • I did not find a related open enhancement request.
  • I understand that enhancement requests filed in the GitHub repository are by default low priority.
  • If this request is time-sensitive, I have submitted a corresponding issue with GCP support.

Describe the feature

The current experimental post sync feature is scoped to only monitor RootSyncs and RepoSyncs. However, most issues that can be caused to those resources have been shifted left and are caught earlier in our CI pipelines using kubectl apply --dry-run=server. What would be really powerful is PostSync monitoring for the resources that our RootSyncs control.

We are consuming config sync as a part of config-controller, so our RootSyncs are managing KCC/GCP resources. Those resources are however a lot harder to dry run because there is a strong dependency on the GCP APIs willingness to accept a given change. We often have resources reaching UpdateFailed or DependencyNotFound states. Currently, the lowest granularity we can monitor/alert that at is a RootSync level using pipeline_error_observed metric. This is a huge downside because a RootSync can have > 1000 resources and we have to alert a centralized team rather than the owner of the breaking change. The ideal is to measure, monitor, and alert on this at the per resource level.

The feature request is that PostSync would also log a structured log for these resource failures, and then we can follow the postsync flow of logs -> pubsub -> workload (which would file a bug to the relevant assignee) on a per resource rather than per RootSync basis. There is some intricacy to work around here because a resource might be in DependencyNotFound state for a few minutes, then resolve when the dependency is created.

An example of those failures on the GKE resources

> kc get LoggingLink -n REDACTED
NAME                            AGE   READY   STATUS         STATUS AGE
fooservicefranc-stg-stg1-link   9d    False   UpdateFailed   9d
insightsprocess-stg-stg1-link   9d    False   UpdateFailed   9d

And we can actually see those failures from config sync/nomos

> nomos status --name REDACTED | grep logginglink
     REDACTED   logginglink.logging.cnrm.cloud.google.com/fooservicefranc-stg-stg1-link                                        InProgress   ac89d2a
     REDACTED   logginglink.logging.cnrm.cloud.google.com/insightsprocess-stg-stg1-link   InProgress   ac89d2a

Importance

This is a blocker for our adoption of PostSync. Monitoring only our RootSyncs doesnt provide us much value. If we can't get per resource monitoring from the system, we are reaching the scale where we would need to implement this feature ourselves.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions