Skip to content

NE-2418: Add haproxy_max_connections metric#728

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
alebedev87:maxconn_metric
Mar 13, 2026
Merged

NE-2418: Add haproxy_max_connections metric#728
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
alebedev87:maxconn_metric

Conversation

@alebedev87
Copy link
Copy Markdown
Contributor

@alebedev87 alebedev87 commented Feb 6, 2026

Add a new haproxy_max_connections gauge metric that exposes the process-wide maximum connections configured for HAProxy.

The metric is extracted from the public frontend's "slim" field (field 6) in HAProxy's "show stat" CSV output. Since the router configures both global and defaults sections with the same ROUTER_MAX_CONNECTIONS value, the public frontend's session limit reflects the process-wide maxconn setting.

E2E test: openshift/cluster-ingress-operator#1361.

image

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Feb 6, 2026

@alebedev87: This pull request references NE-2418 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add a new haproxy_max_connections gauge metric that exposes the process-wide maximum connections configured for HAProxy.

The metric is extracted from the public frontend's "slim" field (field 6) in HAProxy's "show stat" CSV output. Since the router configures both global and defaults sections with the same ROUTER_MAX_CONNECTIONS value, the public frontend's session limit reflects the process-wide maxconn setting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@alebedev87 alebedev87 changed the title NE-2418: Add haproxy_max_connections metric [WIP] NE-2418: Add haproxy_max_connections metric Feb 6, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2026
Add a new haproxy_max_connections gauge metric that exposes the
process-wide maximum connections configured for HAProxy.

The metric is extracted from the public frontend's "slim" field
(field 6) in HAProxy's "show stat" CSV output. Since the router
configures both global and defaults sections with the same
ROUTER_MAX_CONNECTIONS value, the public frontend's session limit
reflects the process-wide maxconn setting.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@alebedev87
Copy link
Copy Markdown
Contributor Author

alebedev87 commented Feb 25, 2026

The new e2e test from CIO is passing.

/retitle NE-2418: Add haproxy_max_connections metric

@openshift-ci openshift-ci Bot changed the title [WIP] NE-2418: Add haproxy_max_connections metric NE-2418: Add haproxy_max_connections metric Feb 25, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 25, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Feb 25, 2026

@alebedev87: This pull request references NE-2418 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add a new haproxy_max_connections gauge metric that exposes the process-wide maximum connections configured for HAProxy.

The metric is extracted from the public frontend's "slim" field (field 6) in HAProxy's "show stat" CSV output. Since the router configures both global and defaults sections with the same ROUTER_MAX_CONNECTIONS value, the public frontend's session limit reflects the process-wide maxconn setting.

E2E test: openshift/cluster-ingress-operator#1361.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Feb 25, 2026

@alebedev87: This pull request references NE-2418 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add a new haproxy_max_connections gauge metric that exposes the process-wide maximum connections configured for HAProxy.

The metric is extracted from the public frontend's "slim" field (field 6) in HAProxy's "show stat" CSV output. Since the router configures both global and defaults sections with the same ROUTER_MAX_CONNECTIONS value, the public frontend's session limit reflects the process-wide maxconn setting.

E2E test: openshift/cluster-ingress-operator#1361.

image

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jcmoraisjr
Copy link
Copy Markdown
Member

/assign

// The router configures both global and defaults sections with the same ROUTER_MAX_CONNECTIONS value,
// so the public frontend's limit (field 6/slim) reflects the process-wide maxconn setting.
// NOTE: If the defaults maxconn is ever configured differently from global maxconn,
// this approach will no longer accurately represent the process-wide limit.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is already available via haproxy_frontend_current_sessions, its just hidden behind this configuration, its missing 6:

// defaultSelectedMetrics is the list of metrics included by default. These metrics are a subset
// of the metrics exposed by haproxy_exporter by default for performance reasons.
var defaultSelectedMetrics = []int{2, 4, 5, 7, 8, 9, 13, 14, 17, 21, 24, 33, 35, 39, 40, 41, 42, 43, 44, 58, 59, 60, 79, 85}

Also, as you pointed this is a frontend metric. The global one is available via show info, and reading global metrics from there is preferable because it not only provides the correct one, but also provides the current global connections, which is the metric to be tracked along with maxconn to alert users about the availability of their connection limits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is already available via haproxy_frontend_current_sessions, its just hidden behind this configuration, its missing 6:

Yes, we skiped it because it's fairly static. Maybe it was completely static at the time the decision was made, IngressController's max connection tuning option might have been added later.

Also, as you pointed this is a frontend metric. The global one is available via show info, and reading global metrics from there is preferable because it not only provides the correct one

Right. I was thinking of this approach too. What was puzzling me is the scraping behavior which can go one of 2 ways: http endpoint of admin socket. Since show info would always use the unix socket it would depart from this behavior by hard-wiring the max_connection metric to the "unix socket scraping". That's why I decided to go the easiest path which is getting the maxconn from the scraped data we already have (any frontend's maxconn would do the thing since we don't have any configuration knob to set maxconn on frontends). I didn't look up how to get the global maxconn from the stats webpage, maybe it's possible too. However, your remark made me think whether we are obliged to keep this as a requirement (being able to scrape from the http endpoint). CIO hardcodes the metrics type to haproxy which disables the stats webpage. I think I need to get more history data on this, may be @Miciah has a stronger opinion about whether we can scrape the max connections from the admin socket without implementing it from the http stats.

also provides the current global connections, which is the metric to be tracked along with maxconn to alert users about the availability of their connection limits.

Yes. That's another point I was thinking of. A haproxy_current_connections metric can be convenient, I agree. However the same data can be retrieved from haproxy_frontend_current_sessions metric. All frontends have to be used though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we skiped it because it's fairly static.

Indeed. It reports the current configuration only. This is the same data that this PR is providing if I'm not mistaken.

may be @Miciah has a stronger opinion about we can scrape the max connections from the admin socket without implementing it from the http stats.

... and

However the same data can be retrieved from haproxy_frontend_current_sessions metric. All frontends have to be used though.

Just my 2c on it.

If I understood it correctly this effort comes from an issue in the client, due to a non monitored maxconn reached and causing an outage. My proposal is to expose the real data from the best source, and not only the max but current as well. Anything we calculate or infer ourselves might be wrong, maybe today, maybe in the future when we change some approach and start to expose non accurate data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a discussion with @Miciah over Slack, he expressed his preference of using CSV data (what's returned by show stat) whenever the needed value is present there. Since with the current architecture frontend maxconn == global maxconn, CSV data can be used for the global maxconn metric.

I think I stick to this implementation then as it's the simplest. The point about using the best source (show info) is fair and we may need to come back to it in the future. Also, I think that this will have to paired with the decommission (or redesign) of the http endpoint.

@jcmoraisjr
Copy link
Copy Markdown
Member

/lgtm
/approve

#728 (comment) is an unaddressed point though, we're choosing the simplest approach for now.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 12, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcmoraisjr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2026
@alebedev87
Copy link
Copy Markdown
Contributor Author

alebedev87 commented Mar 12, 2026

/assign @ShudiLi

For verification.

@ShudiLi
Copy link
Copy Markdown

ShudiLi commented Mar 13, 2026

Screenshot 2026-03-13 at 10 54 49

@ShudiLi
Copy link
Copy Markdown

ShudiLi commented Mar 13, 2026

@alebedev87 LGTM overall, but after I updated the tuningOptions/maxConnections with 2000, the attached metrics picture showed the stats of the deleted router pods, maybe we could remove those lines of the deleted router pods.

@alebedev87
Copy link
Copy Markdown
Contributor Author

but after I updated the tuningOptions/maxConnections with 2000, the attached metrics picture showed the stats of the deleted router pods, maybe we could remove those lines of the deleted router pods.

@ShudiLi: yes, metrics with None value are old time series. Prometheus has a retention period which doesn't allow older metrics to disappear from the displayed immediately. They should not be a problem for alerting rules though because the rules will be using aggregate functions like sum() which don't take slate metrics into account.

Example (contrived one) of an alerting rule which fires when the cluster's ingress (all routers) is approaching the limit of connections:
image

image

Btw we can see some of the existing metrics giving None timeseries too:
image

@ShudiLi
Copy link
Copy Markdown

ShudiLi commented Mar 13, 2026

@alebedev87 Thanks for the explanation. As we talked in slack, similar to haproxy_up, the None value is a standard Prometheus retention, and it won't be taken into account in the alert rule.

@ShudiLi
Copy link
Copy Markdown

ShudiLi commented Mar 13, 2026

/verified by @ShudiLi

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 13, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@ShudiLi: This PR has been marked as verified by @ShudiLi.

Details

In response to this:

/verified by @ShudiLi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 13, 2026

@alebedev87: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants