Remove readiness probe otel-agent sidecars#1791
Remove readiness probe otel-agent sidecars#1791google-oss-prow[bot] merged 9 commits intoGoogleContainerTools:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @tiffanny29631, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request enhances the robustness of the otel-agent deployment by implementing Kubernetes startupProbe definitions. This change provides a dedicated grace period for the agent to fully initialize, mitigating issues where slow startup times could lead to unnecessary pod restarts and improve overall stability of the service.
Highlights
- Kubernetes Probe Configuration: Introduced
startupProbeconfigurations for theotel-agentacross multiple Kubernetes manifest files. This new probe specifically targets port13133with anhttpGetcheck, configured to allow up to 5 minutes for the agent to initialize before standard readiness and liveness checks begin. - Reliability Improvement: The addition of the
startupProbeaims to prevent premature restarts andCrashLoopBackOfferrors for theotel-agent, especially in environments with heavy load or resource constraints where the agent might take longer to bind to its health check port.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
94c55d8 to
1fe01c7
Compare
1fe01c7 to
650e0f8
Compare
ab789e4 to
063cf7a
Compare
4fe1608 to
554ec98
Compare
554ec98 to
063cf7a
Compare
929eba4 to
7c97762
Compare
5bb86cc to
8695de5
Compare
901fede to
42b9ab4
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR removes readiness probes and health check configurations from otel-agent sidecars across reconciler, reconciler-manager, and resourcegroup controllers. The change prevents unnecessary pod unready states that can occur when the health check component fails under CPU throttling or port binding issues.
- Removes health_check extension configuration from otel-agent ConfigMaps
- Removes readiness probes and port 13133 from container specifications
- Updates test configurations and helper methods to align with the new installation approach
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/kustomization/expected.yaml | Updates expected test output removing health check configurations |
| manifests/templates/resourcegroup-manifest.yaml | Removes health check from resourcegroup otel-agent |
| manifests/templates/reconciler-manager.yaml | Removes readiness probe from reconciler-manager otel-agent |
| manifests/templates/reconciler-manager-configmap.yaml | Removes health check configuration from reconciler template |
| manifests/otel-agent-reconciler-cm.yaml | Removes health check from reconciler otel-agent ConfigMap |
| manifests/otel-agent-cm.yaml | Removes health check from base otel-agent ConfigMap |
| e2e/testcases/cli_test.go | Updates test methods to use new installation approach |
| e2e/nomostest/config_sync.go | Adds new installation method using direct manifest application |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
Hold this until we address the SSA->CSA issue that we discussed |
|
/unhold The test & deployment plan is finalized in go/cdcs-field-ownership |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Camila-B The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
To guarantee the e2e client claims ownership of all fields for objects that might have drifted, use client-side apply when reinstalling Config Sync and Webhook fter ConfigManagement was previously installed and removed.
The readiness probe on the otel-agent container was causing operational issues: - False alarms during slow cluster startup due to health check binding failures - Inconsistent with other containers (git-sync, reconciler) which don't use readiness probes - Redundant for a telemetry sidecar that doesn't provide direct user-facing services
42b9ab4 to
70e8d16
Compare
|
/lgtm |
e4fe2cd
into
GoogleContainerTools:main
* test: Add client side install method for restoring Config Sync To guarantee the e2e client claims ownership of all fields for objects that might have drifted, use client-side apply when reinstalling Config Sync and Webhook fter ConfigManagement was previously installed and removed. * Add server-side false flag * Add server-side false flag * Add go-client update as re-install option * Also re-install webhook after test done * Create object if not found when updating * Attach resourceVersion for updating existing objects * Remove readiness probe from otel-agent sidecar The readiness probe on the otel-agent container was causing operational issues: - False alarms during slow cluster startup due to health check binding failures - Inconsistent with other containers (git-sync, reconciler) which don't use readiness probes - Redundant for a telemetry sidecar that doesn't provide direct user-facing services * remove healthcheck from otel-agent
Remove the readiness probe and health check from otel-agent containers in reconciler, reconciler-manager, and resourcegroup controller to align container behavior. The healthcheck component can fail to bind to the port or respond under CPU throttling, causing unnecessary pod unready states even when the container is running. The otel-collector health check in the config-management-monitoring namespace is retained since it is tied to a Service.