nexus: allow instance updates to complete when one switch is down#10389
nexus: allow instance updates to complete when one switch is down#10389
Conversation
|
A note for reviewers: https://github.com/oxidecomputer/omicron/pull/10389/changes#diff-a91e7ca4fc6c9b747df5d8b1b79c49f1595b9179369c1f9e7cd5f3ea00e1bc45 is pretty much the entire fix here. The rest of this PR is just adding tests. |
|
While I was working on the omicron/nexus/src/app/instance_network.rs Line 986 in 3681845 I thought that this also seemed sketchy, so I went to check the the one place where this function is used, an unwinding action in the omicron/nexus/src/app/sagas/instance_ip_attach.rs Lines 243 to 250 in cb99c45 So, I guess that's fine, even though I had a brief moment of panic. I kind of feel like maybe we should nonetheless change |
|
this test failure is an unrelated flake, and i don't see any existing issues for it. I`ve opened #10398 for that. |
bnaecker
left a comment
There was a problem hiding this comment.
Looks good to me, I agree keeping the error-handling policy consistent within this module is better.
Cool, I'll probably open a follow-up PR to also change |
Currently, the
instance_updatesaga will unwind if it attempts to update an instance whose active VMM has transitioned toDestroyedorFailedwhile one switch zone is unreachable. This means that the instance will remain in theStoppingstate, and will not transition toStoppedorFailed, until both switch zones have come back. This is unfortunate, since it prevents the instance from being deleted or restarted, either manually by a user or automatically by instance reincarnation, until the disappeared switch reappears. This is described in #10343.This commit fixes this issue by changing the
instance_networking::instance_delete_dpd_configfunction to returnOk(())even if an error occurs while notifying dendrite of the NAT state change. This is safe to do because the latest NAT state is propagated to both switches by an RPW, so when the switch that has disappeared returns, it will still receive the latest state even if we were not able to notify it in the saga. This is identical to whatinstance_ensure_dpd_confighas done since PR #8914 --- we just needed to make the same change ininstance_delete_dpd_config.I've also added a test ensuring that the instance update saga can complete while one Dendrite is gone, which failed prior to this change.
Fixes #10343.