feat: consecutive-window alerts#2472
Conversation
🦋 Changeset detectedLatest commit: 41b9820 The changes in this PR will be included in the next version bump. This PR includes changesets to release 4 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
@mrkaye97 is attempting to deploy a commit to the HyperDX Team on Vercel. A member of the Team first needs to authorize it. |
Greptile SummaryThis PR adds consecutive-window alerting, introducing a new
Confidence Score: 4/5The happy-path consecutive-window logic is well-tested and the implementation is sound, but several edge-case concerns in the core alert-processing file are still in active discussion and not yet resolved in code. The main logic paths — suppressing notifications until N consecutive violations, carrying forward the fired flag through PENDING records, and bounding the history lookback to exactly the right time window — are implemented consistently across all three alert-type branches and covered by the new test cases. The lower score reflects three open concerns from prior review threads that have been acknowledged but not yet committed: the temporal-proximity guard for consecutive windows after a service outage, the fired !== false legacy-record guard, and the spurious resolution notification when a PENDING alert resolves to OK after an outage-induced gap. packages/api/src/tasks/checkAlerts/index.ts — the shouldFireBasedOnConsecutiveWindows and sendNotificationIfResolved functions are the critical paths that still have open correctness concerns from prior threads. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Provider as DefaultAlertProvider
participant DB as MongoDB (AlertHistory)
participant PA as processAlert
participant CH as ClickHouse
Provider->>DB: getPreviousAlertHistories(alertIds, now, queue)
Provider->>DB: getConsecutiveWindowHistories(alerts, now, queue)
Note over Provider,DB: Both run in parallel via shared PQueue
Provider->>PA: "processAlert(now, {alert, previousMap, recentHistoryMap})"
PA->>CH: fetch metric data for window
alt threshold exceeded
PA->>PA: shouldFireBasedOnConsecutiveWindows(groupKey)
alt M-1 prior windows all ALERT/PENDING
PA->>PA: "history.state = ALERT, fired = true"
PA-->>Provider: trySendNotification (ALERT)
else insufficient consecutive windows
PA->>PA: "history.state = PENDING, fired = previous.fired === true"
end
else threshold not exceeded
PA->>PA: "history.state = OK"
alt "previousHistory.fired !== false"
PA-->>Provider: sendNotificationIfResolved (OK)
end
end
PA->>DB: updateAlertState(historyRecords)
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Provider as DefaultAlertProvider
participant DB as MongoDB (AlertHistory)
participant PA as processAlert
participant CH as ClickHouse
Provider->>DB: getPreviousAlertHistories(alertIds, now, queue)
Provider->>DB: getConsecutiveWindowHistories(alerts, now, queue)
Note over Provider,DB: Both run in parallel via shared PQueue
Provider->>PA: "processAlert(now, {alert, previousMap, recentHistoryMap})"
PA->>CH: fetch metric data for window
alt threshold exceeded
PA->>PA: shouldFireBasedOnConsecutiveWindows(groupKey)
alt M-1 prior windows all ALERT/PENDING
PA->>PA: "history.state = ALERT, fired = true"
PA-->>Provider: trySendNotification (ALERT)
else insufficient consecutive windows
PA->>PA: "history.state = PENDING, fired = previous.fired === true"
end
else threshold not exceeded
PA->>PA: "history.state = OK"
alt "previousHistory.fired !== false"
PA-->>Provider: sendNotificationIfResolved (OK)
end
end
PA->>DB: updateAlertState(historyRecords)
Reviews (20): Last reviewed commit: "fix: move to advanced settings" | Re-trigger Greptile |
0214704 to
897fa74
Compare
|
Hi @mrkaye97, thanks for the PR! We are happy to add this feature. On your three points above:
When ready, please also add a changeset via |
Great, thanks for the feedback! Will try to get these changes up today |
dcc7eaa to
a55f73e
Compare
|
@karl-power I think I've addressed those comments, let me know what you think when you get a chance, and thanks! |
|
Thanks for the changes @mrkaye97! I've tested it and it's looking great so far. I have a few pieces of feedback/requests:
I think this can be addressed by assigning the pending state here: And then with a small UI update to
Thanks! |
I think I've addressed those! Will see what Greptile thinks Edit: Hah, Greptile flagged something else - will fix that in a bit |
karl-power
left a comment
There was a problem hiding this comment.
Hi @mrkaye97 Thanks for the changes! This is in good shape now. There are a couple of related integration tests now failing however (dashboard.test.ts, alerts.test.ts). Could you take a look at these? Thanks!
|
I think I've fixed those! The linters and tests are passing locally, at least |
Thanks for the last changes @mrkaye97. I realised we could have many groups per alert, so refactored a bit to make one db query per alert to get all group's histories x alert windows. This also shares a queue with another set of queries we do for alerts. |
Good call! That makes sense to me 🙌 |
|
@mrkaye97 Thanks for working on this feature. Code and functionality wise this looks great. What are your thoughts on moving this setting under "Advanced Settings"? IMO its a nice feature, but not something people need when setting up every alert. We can add a new line item there:
Something like "Trigger only when the threshold is met for [ 1 ] consecutive windows." |
Sure, can do! |
|
Pushed that and updated the screenshot - hopefully closer to what you had in mind! |






Summary
Fixes #2469 (comment)
Adding consecutive-window configuration to alerts, so that you can specify a condition like "only fire this alert after some condition is met for N consecutive windows." This helps prevent flaky alerts (and pages), and cut down on alert noise in many cases.
Not too sure on a few things here, so would love some feedback:
My UX sense isn't amazing - I thought that doing this in-line made sense, but it may make sense to put the number of windows on its own line below the original alert config, e.g.One implementation detail I was wondering about is if we'd want to allow for alerting if e.g. three out of five windows alert, or similar. For instance "fire the alert if we see this log line at least once in three of five consecutive windows." I could see this being useful, but it also adds some overhead, and it feels like it'd be easy to add that feature in a non-breaking way in the future if someone would like to have it, so I figured I'd leave it off for now.In the current implementation, the alert in the UI shows as "firing" even if no notification has actually sent yet, because the condition has been met, but the threshold is not. Not sure if that's okay, or if I should figure out some way around it!Screenshots or video
Screenshot from my local of the new alert modal:
tooltip:
Testing Locally
I tested by creating a trivial alert that'd fire on any error log that we see at least once per minute for three minutes, and then sent a curl once per minute to trigger it:
Sent webhooks to
webhook.site, and saw logs matching what we'd expect:And then:
How to test on Vercel preview
Preview routes:
Wasn't sure what to put here!
References
Source Issue: #2469 (comment)