Skip to content

[omicron-nexus] add an error injection test for deleting local storage#9997

Merged
sunshowers merged 2 commits intomainfrom
sunshowers/spr/omicron-nexus-add-an-error-injection-test-for-deleting-local-storage
Mar 10, 2026
Merged

[omicron-nexus] add an error injection test for deleting local storage#9997
sunshowers merged 2 commits intomainfrom
sunshowers/spr/omicron-nexus-add-an-error-injection-test-for-deleting-local-storage

Conversation

@sunshowers
Copy link
Copy Markdown
Contributor

@sunshowers sunshowers commented Mar 7, 2026

In #9993 we found that we'd accidentally dropped the indefinite retry logic. Add a test that fails if we retry fewer than 8 times. I've confirmed that the test fails before #9993.

I tried bumping the number of tries up to 32, but the test took far too long even with Tokio's time::pause support. The reason for that is that all the background tasks get kicked off many times. 8 seems like a decent number of retries to check against.

Created using spr 1.3.6-beta.1
sunshowers added a commit to oxidecomputer/progenitor-extras that referenced this pull request Mar 9, 2026
…ly (#3)

See #2.

I've tested this with the new test added in
oxidecomputer/omicron#9997. I've also tested a
local change which enables the clippy::disallowed_methods lint in
omicron for `retry_operation` and `retry_operation_while`, suggesting
use of `_indefinitely` instead. I believe that should be a complete
structural fix for #2.
sunshowers added a commit that referenced this pull request Mar 9, 2026
…9998)

While working on #9997 I saw this error in the logs:

```
saga ACTION error at node "delete_local_storage": deserialize failed: unknown variant failed to delete local storage: failed at attempt 4: retries exhausted: Error Response: status: 503 Service Unavailable; headers: {\"content-type\": \"application/json\", \"x-request-id\": \"4a21fdec-b7b2-4f37-a99d-c218efa1c701\", \"content-length\": \"94\", \"date\": \"Sat, 07 Mar 2026 05:34:47 GMT\"}; value: Error { error_code: None, message: \"Service Unavailable\", request_id: \"4a21fdec-b7b2-4f37-a99d-c218efa1c701\" }, expected one of ObjectNotFound, ObjectAlreadyExists, InvalidRequest, Unauthenticated, InvalidValue, Forbidden, InternalError, ServiceUnavailable, InsufficientCapacity, TypeVersionMismatch, Conflict, NotFound, Gone
```

The root cause for that was that we were passing in a string as an error
rather than a structured omicron-common error. That's because steno's
`ActionError::action_failed` accepts
[anything](https://docs.rs/steno/latest/steno/trait.ActionData.html)
that implements `Debug + DeserializeOwned`.

Fix this by:

* adding `nexus_types::saga::saga_action_failed`
* banning most uses of `ActionError::action_failed` through clippy
disallowed-methods
* updating all the call sites

There are two instances of `ActionError::action_failed` that remain,
both of which are marked with `expect(clippy::disallowed_methods)`:

1. The invocation inside `saga_action_failed`.
2. In `nexus/src/app/sagas/instance_update/mod.rs`, the instance updater
lock error which is handled specially.
Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) March 10, 2026 00:31
@sunshowers sunshowers merged commit 836493d into main Mar 10, 2026
16 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/omicron-nexus-add-an-error-injection-test-for-deleting-local-storage branch March 10, 2026 02:02
sunshowers added a commit that referenced this pull request Mar 10, 2026
…10021)

Use `clippy::disallowed_methods` to ban non-indefinite retries.

The iteration test is no longer necessary in omicron-common since
there's both a unit test inside progenitor-extras for it, and the
integration test added in #9997.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants