Skip to content

feat: consecutive-window alerts#2472

Open
mrkaye97 wants to merge 34 commits into
hyperdxio:mainfrom
mrkaye97:mk/add-multi-window-lookback-alerts
Open

feat: consecutive-window alerts#2472
mrkaye97 wants to merge 34 commits into
hyperdxio:mainfrom
mrkaye97:mk/add-multi-window-lookback-alerts

Conversation

@mrkaye97

@mrkaye97 mrkaye97 commented Jun 16, 2026

Copy link
Copy Markdown

Summary

Fixes #2469 (comment)

Adding consecutive-window configuration to alerts, so that you can specify a condition like "only fire this alert after some condition is met for N consecutive windows." This helps prevent flaky alerts (and pages), and cut down on alert noise in many cases.

Not too sure on a few things here, so would love some feedback:

  1. My UX sense isn't amazing - I thought that doing this in-line made sense, but it may make sense to put the number of windows on its own line below the original alert config, e.g.
  2. One implementation detail I was wondering about is if we'd want to allow for alerting if e.g. three out of five windows alert, or similar. For instance "fire the alert if we see this log line at least once in three of five consecutive windows." I could see this being useful, but it also adds some overhead, and it feels like it'd be easy to add that feature in a non-breaking way in the future if someone would like to have it, so I figured I'd leave it off for now.
  3. In the current implementation, the alert in the UI shows as "firing" even if no notification has actually sent yet, because the condition has been met, but the threshold is not. Not sure if that's okay, or if I should figure out some way around it!

Screenshots or video

Screenshot from my local of the new alert modal:

Screenshot 2026-06-26 at 12 57 57 PM

tooltip:

Screenshot 2026-06-26 at 1 08 57 PM

Testing Locally

I tested by creating a trivial alert that'd fire on any error log that we see at least once per minute for three minutes, and then sent a curl once per minute to trigger it:

 curl -s -X POST http://localhost:30996/v1/logs \
    -H "Content-Type: application/json" \
    -H "Authorization: << ingestion api key >>" \
    -d '{
      "resourceLogs": [{
        "resource": {
          "attributes": [{"key": "service.name", "value": {"stringValue": "test-service"}}]
        },
        "scopeLogs": [{
          "logRecords": [{
            "timeUnixNano": "'$(date +%s)000000000'",
            "severityText": "ERROR",
            "severityNumber": 17,
            "body": {"stringValue": "something went wrong"}
          }]
        }]  
      }]
    }'

Sent webhooks to webhook.site, and saw logs matching what we'd expect:

  "text": "🚨 Alert for \"test-alert\" - 1 lines found | \n1 lines found, which meets or exceeds the threshold of 1 lines\nTime Range (UTC): [Jun 16 2:52:00 PM - Jun 16 2:53:00 PM)\n\n```\n\"2026-06-16T14:52:00.001000000Z\",\"hdx-oss-dev-api\",\"info\",\"check-alerts started at Tue Jun 16 2026 10:52:00 GMT-0400 (Eastern Daylight Time)\"\n\"2026-06-16T14:52:00.004000000Z\",\"hdx-oss-dev-api\",\"info\",\"Connection established to MongoDB\"\n\"2026-06-16T14:52:00.004000000Z\",\"hdx-oss-dev-api\",\"debug\",\"finished provider initialization\"\n\"2026-06-16T14:52:00.016000000Z\",\"hdx-oss-dev-api\",\"debug\",\"Fetched 1 alert tasks to process\"\n\"2026-06-16T14:52:00.017000000Z\",\"hdx-oss-dev-api\",\"debug\",\"Obtained teams and webhooks for all alertTasks\"\n\n``` | http://localhost:30296/search/6a3162175459c7092c078df3?from=1781621520000&to=1781621580000&isLive=false | ALERT | 1781621520000 | 1781621580000 | 7e645d5a3c56111a86f2cba9d8f46af046943a6f"

And then:

{"text": "✅ Alert for \"test-alert\" - 0 lines found | The alert has been resolved.\nTime Range (UTC): [Jun 16 2:53:00 PM - Jun 16 2:54:00 PM)\n | http://localhost:30296/search/6a3162175459c7092c078df3?from=1781621580000&to=1781621640000&isLive=false | OK | 1781621580000 | 1781621640000 | 7e645d5a3c56111a86f2cba9d8f46af046943a6f"}

How to test on Vercel preview

Preview routes:

Wasn't sure what to put here!

References

Source Issue: #2469 (comment)

@changeset-bot

changeset-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 41b9820

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@hyperdx/common-utils Minor
@hyperdx/api Minor
@hyperdx/app Minor
@hyperdx/otel-collector Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

@mrkaye97 is attempting to deploy a commit to the HyperDX Team on Vercel.

A member of the Team first needs to authorize it.

@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds consecutive-window alerting, introducing a new PENDING state so that an alert only fires after its threshold condition is met for N consecutive evaluation windows. The feature is implemented end-to-end: a numConsecutiveWindows field on the IAlert model, a fired flag on AlertHistory records to track whether a notification was ever dispatched, a new getConsecutiveWindowHistories batch-fetcher that enforces a temporal lookback bound, and UI controls in both the saved-search and tile alert editors.

  • Core logic (checkAlerts/index.ts): shouldFireBasedOnConsecutiveWindows reads pre-fetched recent histories (newest-first, bounded to the exact lookback window) and requires all M-1 prior windows to be ALERT or PENDING before transitioning to ALERT and sending a notification; the fired flag is propagated through PENDING records so sendNotificationIfResolved correctly suppresses resolution webhooks when no notification was ever sent.
  • Provider integration (providers/default.ts): getPreviousAlertHistories and getConsecutiveWindowHistories are run in parallel with a shared PQueue to stay within the global connection-pool cap, and PENDING is now included in the alert's final state rollup.
  • UI (AlertsPage, AlertScheduleFields, AlertHistoryCards): A new "Pending" section sits between Alerting and OK in the alert list; history cards gain an orange color class; and a NumberInput in the advanced schedule section exposes the consecutive-window count.

Confidence Score: 4/5

The happy-path consecutive-window logic is well-tested and the implementation is sound, but several edge-case concerns in the core alert-processing file are still in active discussion and not yet resolved in code.

The main logic paths — suppressing notifications until N consecutive violations, carrying forward the fired flag through PENDING records, and bounding the history lookback to exactly the right time window — are implemented consistently across all three alert-type branches and covered by the new test cases. The lower score reflects three open concerns from prior review threads that have been acknowledged but not yet committed: the temporal-proximity guard for consecutive windows after a service outage, the fired !== false legacy-record guard, and the spurious resolution notification when a PENDING alert resolves to OK after an outage-induced gap.

packages/api/src/tasks/checkAlerts/index.ts — the shouldFireBasedOnConsecutiveWindows and sendNotificationIfResolved functions are the critical paths that still have open correctness concerns from prior threads.

Important Files Changed

Filename Overview
packages/api/src/tasks/checkAlerts/index.ts Core alert processing logic updated for consecutive-window support; adds shouldFireBasedOnConsecutiveWindows, getConsecutiveWindowHistories, and PENDING state transitions with fired flag propagation. Several edge-case concerns were raised in prior review threads and are still in various stages of acknowledgement by the author.
packages/api/src/tasks/checkAlerts/providers/default.ts DefaultAlertProvider updated to fetch consecutive-window histories in parallel with previous histories using a shared PQueue; PENDING state correctly propagated to alert's final state determination.
packages/api/src/models/alertHistory.ts Adds optional fired boolean field to IAlertHistory and its Mongoose schema to track whether a notification was ever dispatched for a given history record.
packages/api/src/tasks/checkAlerts/tests/checkAlerts.test.ts Comprehensive new tests cover numConsecutiveWindows=1 fire-immediately, suppression until N windows, no spurious resolution before any fire, per-group independence, and getConsecutiveWindowHistories unit tests with timestamp boundary assertions.
packages/app/src/components/AlertScheduleFields.tsx Adds a NumberInput for numConsecutiveWindows inside the advanced schedule collapsible; missing integer-only enforcement on the input allows decimal submission that fails only at API validation with no UI hint.
packages/app/src/components/alerts/AlertHistoryCards.tsx Adds PENDING color class via stateToBgColorClass helper and improves tooltip label to reflect pending state; logic is sound.
packages/common-utils/src/types.ts Adds PENDING to the shared AlertState enum and numConsecutiveWindows to AlertBaseObjectSchema and AlertsPageItemSchema.
packages/app/src/AlertsPage.tsx Adds a Pending section between Alerting and OK sections in the alert card list with an hourglass icon and orange badge.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Provider as DefaultAlertProvider
    participant DB as MongoDB (AlertHistory)
    participant PA as processAlert
    participant CH as ClickHouse

    Provider->>DB: getPreviousAlertHistories(alertIds, now, queue)
    Provider->>DB: getConsecutiveWindowHistories(alerts, now, queue)
    Note over Provider,DB: Both run in parallel via shared PQueue

    Provider->>PA: "processAlert(now, {alert, previousMap, recentHistoryMap})"
    PA->>CH: fetch metric data for window

    alt threshold exceeded
        PA->>PA: shouldFireBasedOnConsecutiveWindows(groupKey)
        alt M-1 prior windows all ALERT/PENDING
            PA->>PA: "history.state = ALERT, fired = true"
            PA-->>Provider: trySendNotification (ALERT)
        else insufficient consecutive windows
            PA->>PA: "history.state = PENDING, fired = previous.fired === true"
        end
    else threshold not exceeded
        PA->>PA: "history.state = OK"
        alt "previousHistory.fired !== false"
            PA-->>Provider: sendNotificationIfResolved (OK)
        end
    end

    PA->>DB: updateAlertState(historyRecords)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Provider as DefaultAlertProvider
    participant DB as MongoDB (AlertHistory)
    participant PA as processAlert
    participant CH as ClickHouse

    Provider->>DB: getPreviousAlertHistories(alertIds, now, queue)
    Provider->>DB: getConsecutiveWindowHistories(alerts, now, queue)
    Note over Provider,DB: Both run in parallel via shared PQueue

    Provider->>PA: "processAlert(now, {alert, previousMap, recentHistoryMap})"
    PA->>CH: fetch metric data for window

    alt threshold exceeded
        PA->>PA: shouldFireBasedOnConsecutiveWindows(groupKey)
        alt M-1 prior windows all ALERT/PENDING
            PA->>PA: "history.state = ALERT, fired = true"
            PA-->>Provider: trySendNotification (ALERT)
        else insufficient consecutive windows
            PA->>PA: "history.state = PENDING, fired = previous.fired === true"
        end
    else threshold not exceeded
        PA->>PA: "history.state = OK"
        alt "previousHistory.fired !== false"
            PA-->>Provider: sendNotificationIfResolved (OK)
        end
    end

    PA->>DB: updateAlertState(historyRecords)
Loading

Reviews (20): Last reviewed commit: "fix: move to advanced settings" | Re-trigger Greptile

Comment thread packages/api/src/tasks/checkAlerts/index.ts
Comment thread packages/api/src/tasks/checkAlerts/index.ts Outdated
Comment thread packages/api/src/tasks/checkAlerts/index.ts Outdated
@mrkaye97 mrkaye97 force-pushed the mk/add-multi-window-lookback-alerts branch from 0214704 to 897fa74 Compare June 16, 2026 15:01
@mrkaye97 mrkaye97 marked this pull request as ready for review June 16, 2026 15:03
Comment thread packages/api/src/tasks/checkAlerts/index.ts Outdated
@karl-power

karl-power commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Hi @mrkaye97, thanks for the PR! We are happy to add this feature.

On your three points above:

  1. UX is ok for now.
  2. Sounds good. We're happy to leave it out for now and potentially add it in the future.
  3. Can we add a new "pending" (yellow) state to the alert? An alert would be in pending state when breaching the threshold, but not yet for enough windows to be in "alerting" state.

When ready, please also add a changeset via yarn changeset.

@mrkaye97

Copy link
Copy Markdown
Author

Hi @mrkaye97, thanks for the PR! We are happy to add this feature.

On your three points above:

  1. UX is ok for now.
  2. Sounds good. We're happy to leave it out for now and potentially add it in the future.
  3. Can we add a new "pending" (yellow) state to the alert? An alert would be in pending state when breaching the threshold, but not yet for enough windows to be in "alerting" state.

When ready, please also add a changeset via yarn changeset.

Great, thanks for the feedback! Will try to get these changes up today

@mrkaye97 mrkaye97 force-pushed the mk/add-multi-window-lookback-alerts branch 2 times, most recently from dcc7eaa to a55f73e Compare June 22, 2026 18:21
@mrkaye97 mrkaye97 requested a review from karl-power June 22, 2026 18:39
@mrkaye97

Copy link
Copy Markdown
Author

@karl-power I think I've addressed those comments, let me know what you think when you get a chance, and thanks!

@karl-power

karl-power commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Thanks for the changes @mrkaye97! I've tested it and it's looking great so far. I have a few pieces of feedback/requests:

  1. I should have asked for the pending state to be orange instead of yellow. I missed that we already use yellow for the ok state.

  2. I noticed that the pending state isn't getting applied to alert history records, but only to the alert itself. That leaves the UI in a strange position where we show a green history in the AlertHistoryCard component for a history record that should have been pending:

Screenshot 2026-06-23 at 16 09 22

I think this can be addressed by assigning the pending state here:

state: group.states.includes(AlertState.ALERT)

And then with a small UI update to AlertHistoryCard we'll be able to show the history like this:
Screenshot 2026-06-23 at 15 27 42

  1. Can you also please address the new point in the Greptile feedback about resolving stale pending records for disappeared group keys?

Thanks!

@mrkaye97

mrkaye97 commented Jun 23, 2026

Copy link
Copy Markdown
Author

Thanks for the changes @mrkaye97! I've tested it and it's looking great so far. I have a few pieces of feedback/requests:

  1. I should have asked for the pending state to be orange instead of yellow. I missed that we already use yellow for the ok state.
  2. I noticed that the pending state isn't getting applied to alert history records, but only to the alert itself. That leaves the UI in a strange position where we show a green history in the AlertHistoryCard component for a history record that should have been pending:
Screenshot 2026-06-23 at 16 09 22 I think this can be addressed by assigning the pending state here:

state: group.states.includes(AlertState.ALERT)

And then with a small UI update to AlertHistoryCard we'll be able to show the history like this: Screenshot 2026-06-23 at 15 27 42

  1. Can you also please address the new point in the Greptile feedback about resolving stale pending records for disappeared group keys?

Thanks!

I think I've addressed those! Will see what Greptile thinks

Edit: Hah, Greptile flagged something else - will fix that in a bit

Comment thread packages/api/src/tasks/checkAlerts/index.ts Outdated

@karl-power karl-power left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mrkaye97 Thanks for the changes! This is in good shape now. There are a couple of related integration tests now failing however (dashboard.test.ts, alerts.test.ts). Could you take a look at these? Thanks!

@mrkaye97

Copy link
Copy Markdown
Author

I think I've fixed those! The linters and tests are passing locally, at least

@karl-power

Copy link
Copy Markdown
Contributor

I think I've fixed those! The linters and tests are passing locally, at least

Thanks for the last changes @mrkaye97. I realised we could have many groups per alert, so refactored a bit to make one db query per alert to get all group's histories x alert windows. This also shares a queue with another set of queries we do for alerts.

@mrkaye97

Copy link
Copy Markdown
Author

I think I've fixed those! The linters and tests are passing locally, at least

Thanks for the last changes @mrkaye97. I realised we could have many groups per alert, so refactored a bit to make one db query per alert to get all group's histories x alert windows. This also shares a queue with another set of queries we do for alerts.

Good call! That makes sense to me 🙌

@fleon

fleon commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@mrkaye97 Thanks for working on this feature. Code and functionality wise this looks great. What are your thoughts on moving this setting under "Advanced Settings"?

IMO its a nice feature, but not something people need when setting up every alert. We can add a new line item there:

CleanShot 2026-06-26 at 14 28 28@2x

Something like "Trigger only when the threshold is met for [ 1 ] consecutive windows."

@mrkaye97

Copy link
Copy Markdown
Author

@mrkaye97 Thanks for working on this feature. Code and functionality wise this looks great. What are your thoughts on moving this setting under "Advanced Settings"?

IMO its a nice feature, but not something people need when setting up every alert. We can add a new line item there:

CleanShot 2026-06-26 at 14 28 28@2x Something like "Trigger only when the threshold is met for [ 1 ] consecutive windows."

Sure, can do!

@mrkaye97

Copy link
Copy Markdown
Author

Pushed that and updated the screenshot - hopefully closer to what you had in mind!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature request: generalize alert configuration to allow for "fire alert after N occurrences in M consecutive windows"

3 participants