Checklist
Describe the feature
The current experimental post sync feature is scoped to only monitor RootSyncs and RepoSyncs. However, most issues that can be caused to those resources have been shifted left and are caught earlier in our CI pipelines using kubectl apply --dry-run=server. What would be really powerful is PostSync monitoring for the resources that our RootSyncs control.
We are consuming config sync as a part of config-controller, so our RootSyncs are managing KCC/GCP resources. Those resources are however a lot harder to dry run because there is a strong dependency on the GCP APIs willingness to accept a given change. We often have resources reaching UpdateFailed or DependencyNotFound states. Currently, the lowest granularity we can monitor/alert that at is a RootSync level using pipeline_error_observed metric. This is a huge downside because a RootSync can have > 1000 resources and we have to alert a centralized team rather than the owner of the breaking change. The ideal is to measure, monitor, and alert on this at the per resource level.
The feature request is that PostSync would also log a structured log for these resource failures, and then we can follow the postsync flow of logs -> pubsub -> workload (which would file a bug to the relevant assignee) on a per resource rather than per RootSync basis. There is some intricacy to work around here because a resource might be in DependencyNotFound state for a few minutes, then resolve when the dependency is created.
An example of those failures on the GKE resources
> kc get LoggingLink -n REDACTED
NAME AGE READY STATUS STATUS AGE
fooservicefranc-stg-stg1-link 9d False UpdateFailed 9d
insightsprocess-stg-stg1-link 9d False UpdateFailed 9d
And we can actually see those failures from config sync/nomos
> nomos status --name REDACTED | grep logginglink
REDACTED logginglink.logging.cnrm.cloud.google.com/fooservicefranc-stg-stg1-link InProgress ac89d2a
REDACTED logginglink.logging.cnrm.cloud.google.com/insightsprocess-stg-stg1-link InProgress ac89d2a
Importance
This is a blocker for our adoption of PostSync. Monitoring only our RootSyncs doesnt provide us much value. If we can't get per resource monitoring from the system, we are reaching the scale where we would need to implement this feature ourselves.
Checklist
Describe the feature
The current experimental post sync feature is scoped to only monitor
RootSyncsandRepoSyncs. However, most issues that can be caused to those resources have been shifted left and are caught earlier in our CI pipelines usingkubectl apply --dry-run=server. What would be really powerful isPostSyncmonitoring for the resources that ourRootSyncscontrol.We are consuming config sync as a part of config-controller, so our
RootSyncsare managing KCC/GCP resources. Those resources are however a lot harder to dry run because there is a strong dependency on the GCP APIs willingness to accept a given change. We often have resources reachingUpdateFailedorDependencyNotFoundstates. Currently, the lowest granularity we can monitor/alert that at is aRootSynclevel usingpipeline_error_observedmetric. This is a huge downside because aRootSynccan have > 1000 resources and we have to alert a centralized team rather than the owner of the breaking change. The ideal is to measure, monitor, and alert on this at the per resource level.The feature request is that PostSync would also log a structured log for these resource failures, and then we can follow the postsync flow of logs -> pubsub -> workload (which would file a bug to the relevant assignee) on a per resource rather than per
RootSyncbasis. There is some intricacy to work around here because a resource might be inDependencyNotFoundstate for a few minutes, then resolve when the dependency is created.An example of those failures on the GKE resources
And we can actually see those failures from config sync/
nomosImportance
This is a blocker for our adoption of
PostSync. Monitoring only ourRootSyncs doesnt provide us much value. If we can't get per resource monitoring from the system, we are reaching the scale where we would need to implement this feature ourselves.