Skip to content

fm: make sp_ereport_ingester status output less gaslighty#10382

Open
hawkw wants to merge 3 commits intomainfrom
eliza/stop-ingester-gaslighting
Open

fm: make sp_ereport_ingester status output less gaslighty#10382
hawkw wants to merge 3 commits intomainfrom
eliza/stop-ingester-gaslighting

Conversation

@hawkw
Copy link
Copy Markdown
Member

@hawkw hawkw commented May 6, 2026

Originally, when implementing the sp_ereport_ingester background task, I had the foolish and bad idea to make the task's status output omit entries for SPs which we communicated with successfully but which had no new ereports, thinking that it was "not interesting" to list them. Unfortunately, this also means that the total count of SPs that the task talked to, and the count of total HTTP requests sent, will also not count any SPs which had no ereports. This behavior is incredibly misleading to anyone who isn't aware that they are being intentionally excluded, which, as it turns out, includes me (since I had forgotten that I did this).

This commit fixes this by not going out of our way to exclude these SPs from the status output. It turns out that this actually makes the code somewhat simpler, which is another argument in favor of the idea that I never should have done this. Sigh.

Fixes #10380

Originally, when implementing the `sp_ereport_ingester` background task,
I had the foolish and bad idea to make the task's status output omit
entries for SPs which we communicated with successfully but which had no
new ereports, thinking that it was "not interesting" to list them.
Unfortunately, this also means that the total count of SPs that the task
talked to, and the count of total HTTP requests sent, will also not
count any SPs which had no ereports. This behavior is *incredibly
misleading* to anyone who isn't aware that they are being intentionally
excluded, which, as it turns out, includes me (since I had forgotten
that I did this).

This commit fixes this by not going out of our way to exclude these SPs
from the status output. It turns out that this actually makes the code
somewhat simpler, which is another argument in favor of the idea that I
never should have done this. Sigh.

Fixes #10380
@hawkw hawkw requested review from mergeconflict and smklein May 6, 2026 03:05
@hawkw hawkw added Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more) fault-management Everything related to the fault-management initiative (RFD480 and others) labels May 6, 2026
@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented May 6, 2026

@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented May 6, 2026

Huh, well, this is weird and upsetting: https://buildomat.eng.oxide.computer/wg/0/details/01KQXPZCBW2NHNXNEH0BGF1EQB/BcTcBzPAAKSDAzJ3U3i0Pi7K6b8Y806PLnRRLXofAduFgCIf/01KQXPZSQB0ZRAP1X61NM18WMQ#S8248

and of course, it's a flake: https://buildomat.eng.oxide.computer/wg/0/details/01KQXPZCBW2NHNXNEH0BGF1EQB/BcTcBzPAAKSDAzJ3U3i0Pi7K6b8Y806PLnRRLXofAduFgCIf/01KQXPZSQB0ZRAP1X61NM18WMQ#S8617

As far as I can tell, what happened here is that one of the simulated SP's ereport listeners in this test received a UDP packet from...something, and it was not the correct length to be an ereport request header, so the simulated SP task panicked. This is odd to me because the ereport ingester task isn't even enabled in this test, so I'm really not sure where that UDP packet came from. It was on localhost port 47491, but there is no mention of that port in any of the logs, and it could presumably have been anything.

I think maybe the solution is just to have the simulated SPs not panic upon receipt of invalid ereport requests. but...it's really weird that this happened and I'd like to understand where it came from. I'd feel a bit gross about not having the test fail if MGS actually sends a malformed request, but I think that's something that other testing (such as, say, the ones where we actually try to collect ereports). I just want to know where the random UDP packet came from.

@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented May 6, 2026

Okay, so I am fairly confident that the test flake is not in any way related to this change, and we are just unlucky enough to have hit a vaguely ereport-related test flake in a branch that makes changes that are also vaguely ereport-related. I've opened #10387 for the test flake and will re-run CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more) fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sp_ereport_ingester's status is misleading

1 participant