NE-2418: Add haproxy_max_connections metric#728
NE-2418: Add haproxy_max_connections metric#728openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
Conversation
|
@alebedev87: This pull request references NE-2418 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
176bedf to
8693a02
Compare
Add a new haproxy_max_connections gauge metric that exposes the process-wide maximum connections configured for HAProxy. The metric is extracted from the public frontend's "slim" field (field 6) in HAProxy's "show stat" CSV output. Since the router configures both global and defaults sections with the same ROUTER_MAX_CONNECTIONS value, the public frontend's session limit reflects the process-wide maxconn setting. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
8693a02 to
a0ed627
Compare
|
The new e2e test from CIO is passing. /retitle NE-2418: Add haproxy_max_connections metric |
|
@alebedev87: This pull request references NE-2418 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@alebedev87: This pull request references NE-2418 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/assign |
| // The router configures both global and defaults sections with the same ROUTER_MAX_CONNECTIONS value, | ||
| // so the public frontend's limit (field 6/slim) reflects the process-wide maxconn setting. | ||
| // NOTE: If the defaults maxconn is ever configured differently from global maxconn, | ||
| // this approach will no longer accurately represent the process-wide limit. |
There was a problem hiding this comment.
This metric is already available via haproxy_frontend_current_sessions, its just hidden behind this configuration, its missing 6:
router/pkg/router/metrics/haproxy/haproxy.go
Lines 109 to 111 in d8ed355
Also, as you pointed this is a frontend metric. The global one is available via show info, and reading global metrics from there is preferable because it not only provides the correct one, but also provides the current global connections, which is the metric to be tracked along with maxconn to alert users about the availability of their connection limits.
There was a problem hiding this comment.
This metric is already available via haproxy_frontend_current_sessions, its just hidden behind this configuration, its missing 6:
Yes, we skiped it because it's fairly static. Maybe it was completely static at the time the decision was made, IngressController's max connection tuning option might have been added later.
Also, as you pointed this is a frontend metric. The global one is available via show info, and reading global metrics from there is preferable because it not only provides the correct one
Right. I was thinking of this approach too. What was puzzling me is the scraping behavior which can go one of 2 ways: http endpoint of admin socket. Since show info would always use the unix socket it would depart from this behavior by hard-wiring the max_connection metric to the "unix socket scraping". That's why I decided to go the easiest path which is getting the maxconn from the scraped data we already have (any frontend's maxconn would do the thing since we don't have any configuration knob to set maxconn on frontends). I didn't look up how to get the global maxconn from the stats webpage, maybe it's possible too. However, your remark made me think whether we are obliged to keep this as a requirement (being able to scrape from the http endpoint). CIO hardcodes the metrics type to haproxy which disables the stats webpage. I think I need to get more history data on this, may be @Miciah has a stronger opinion about whether we can scrape the max connections from the admin socket without implementing it from the http stats.
also provides the current global connections, which is the metric to be tracked along with maxconn to alert users about the availability of their connection limits.
Yes. That's another point I was thinking of. A haproxy_current_connections metric can be convenient, I agree. However the same data can be retrieved from haproxy_frontend_current_sessions metric. All frontends have to be used though.
There was a problem hiding this comment.
Yes, we skiped it because it's fairly static.
Indeed. It reports the current configuration only. This is the same data that this PR is providing if I'm not mistaken.
may be
@Miciahhas a stronger opinion about we can scrape the max connections from the admin socket without implementing it from the http stats.
... and
However the same data can be retrieved from
haproxy_frontend_current_sessionsmetric. All frontends have to be used though.
Just my 2c on it.
If I understood it correctly this effort comes from an issue in the client, due to a non monitored maxconn reached and causing an outage. My proposal is to expose the real data from the best source, and not only the max but current as well. Anything we calculate or infer ourselves might be wrong, maybe today, maybe in the future when we change some approach and start to expose non accurate data.
There was a problem hiding this comment.
After a discussion with @Miciah over Slack, he expressed his preference of using CSV data (what's returned by show stat) whenever the needed value is present there. Since with the current architecture frontend maxconn == global maxconn, CSV data can be used for the global maxconn metric.
I think I stick to this implementation then as it's the simplest. The point about using the best source (show info) is fair and we may need to come back to it in the future. Also, I think that this will have to paired with the decommission (or redesign) of the http endpoint.
|
/lgtm #728 (comment) is an unaddressed point though, we're choosing the simplest approach for now. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcmoraisjr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/assign @ShudiLi For verification. |
|
@alebedev87 LGTM overall, but after I updated the tuningOptions/maxConnections with 2000, the attached metrics picture showed the stats of the deleted router pods, maybe we could remove those lines of the deleted router pods. |
@ShudiLi: yes, metrics with Example (contrived one) of an alerting rule which fires when the cluster's ingress (all routers) is approaching the limit of connections:
Btw we can see some of the existing metrics giving |
|
@alebedev87 Thanks for the explanation. As we talked in slack, similar to haproxy_up, the None value is a standard Prometheus retention, and it won't be taken into account in the alert rule. |
|
/verified by @ShudiLi |
|
@ShudiLi: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@alebedev87: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |





Add a new haproxy_max_connections gauge metric that exposes the process-wide maximum connections configured for HAProxy.
The metric is extracted from the public frontend's "slim" field (field 6) in HAProxy's "show stat" CSV output. Since the router configures both global and defaults sections with the same ROUTER_MAX_CONNECTIONS value, the public frontend's session limit reflects the process-wide maxconn setting.
E2E test: openshift/cluster-ingress-operator#1361.