Skip to content

nexus: allow instance updates to complete when one switch is down#10389

Merged
hawkw merged 10 commits intomainfrom
eliza/instance-update-dead-switch
May 7, 2026
Merged

nexus: allow instance updates to complete when one switch is down#10389
hawkw merged 10 commits intomainfrom
eliza/instance-update-dead-switch

Conversation

@hawkw
Copy link
Copy Markdown
Member

@hawkw hawkw commented May 6, 2026

Currently, the instance_update saga will unwind if it attempts to update an instance whose active VMM has transitioned to Destroyed or Failed while one switch zone is unreachable. This means that the instance will remain in the Stopping state, and will not transition to Stopped or Failed, until both switch zones have come back. This is unfortunate, since it prevents the instance from being deleted or restarted, either manually by a user or automatically by instance reincarnation, until the disappeared switch reappears. This is described in #10343.

This commit fixes this issue by changing the instance_networking::instance_delete_dpd_config function to return Ok(()) even if an error occurs while notifying dendrite of the NAT state change. This is safe to do because the latest NAT state is propagated to both switches by an RPW, so when the switch that has disappeared returns, it will still receive the latest state even if we were not able to notify it in the saga. This is identical to what instance_ensure_dpd_config has done since PR #8914 --- we just needed to make the same change in instance_delete_dpd_config.

I've also added a test ensuring that the instance update saga can complete while one Dendrite is gone, which failed prior to this change.

Fixes #10343.

@hawkw hawkw self-assigned this May 6, 2026
@hawkw hawkw added bug Something that isn't working. nexus Related to nexus labels May 6, 2026
@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented May 6, 2026

A note for reviewers: https://github.com/oxidecomputer/omicron/pull/10389/changes#diff-a91e7ca4fc6c9b747df5d8b1b79c49f1595b9179369c1f9e7cd5f3ea00e1bc45 is pretty much the entire fix here. The rest of this PR is just adding tests.

@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented May 6, 2026

While I was working on the instance_networking change, I noticed that the delete_dpd_config_by_entry() function also calls notify_dendrite_nat_state() and returns any Err from that function, just like what instance_delete_dpd_config() was doing prior to this change:

notify_dendrite_nat_state(datastore, log, resolver, opctx_alloc, None).await

I thought that this also seemed sketchy, so I went to check the the one place where this function is used, an unwinding action in the instance_ip_attach saga. At first glance, this looked scary, because we map_err it into an ActionFailed error, which made me worry that the saga might get stuck if it unwinds while a switch is down...but then I noticed that after we map it into an ActionFailed error, we just log a warning and return Ok(()):

if let Err(e) = osagactx
.nexus()
.delete_dpd_config_by_entry(&opctx, &nat_entry)
.await
.map_err(saga_action_failed)
{
error!(log, "siia_nat_undo: failed to notify DPD: {e}");
}

So, I guess that's fine, even though I had a brief moment of panic. I kind of feel like maybe we should nonetheless change delete_dpd_config_by_entry to not return an error if the notify_dendrite_nat_state() call fails, and log it instead, so that we can have a convention that all the instance_networking functions don't return an error if they can't talk to a switch zone. This wouldn't actually change any current behavior but it would make me feel less worried about a future change using delete_dpd_config_by_entry() someplace else and slap a ? on it. Maybe that's worth doing in a separate branch?

@hawkw hawkw requested review from bnaecker and removed request for jgallagher May 6, 2026 22:58
@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented May 6, 2026

this test failure is an unrelated flake, and i don't see any existing issues for it. I`ve opened #10398 for that.

Comment thread nexus/src/app/instance_network.rs Outdated
@hawkw hawkw requested a review from iximeow May 7, 2026 17:36
Copy link
Copy Markdown
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I agree keeping the error-handling policy consistent within this module is better.

@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented May 7, 2026

Looks good to me, I agree keeping the error-handling policy consistent within this module is better.

Cool, I'll probably open a follow-up PR to also change delete_dpd_config_by_entry().

@hawkw hawkw merged commit 5af1f71 into main May 7, 2026
16 checks passed
@hawkw hawkw deleted the eliza/instance-update-dead-switch branch May 7, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something that isn't working. nexus Related to nexus

Projects

None yet

Development

Successfully merging this pull request may close these issues.

instance_update sagas cannot complete while one switch zone is down

4 participants