Summary
TestE2E_MultiClientIBRL_RouteLiveness intermittently fails with a route-convergence timeout. After the test blocks and then unblocks a path between IBRL clients, it waits (via require.Eventually) for connectivity to be restored, and that condition never becomes true within the timeout. Surfaced on an unrelated dependency-bump PR (#3924, tabled — a table-formatting crate with no connection to routing), so it is not caused by that change.
This is a third, distinct e2e flake class, separate from the multicast one (#3907) and the Docker network address-pool one (TestE2E_BackwardCompatibility, "Pool overlaps").
Failing assertion
multi_client_ibrl_liveness_test.go:449:
Error: Condition never satisfied
Test: TestE2E_MultiClientIBRL_RouteLiveness
Messages: pass %d: unblock c1: c1->c3 restored
Error Trace:
e2e/multi_client_ibrl_test.go:514
e2e/multi_client_ibrl_liveness_test.go:402
e2e/multi_client_ibrl_liveness_test.go:449
e2e/multi_client_ibrl_liveness_test.go:249
--- FAIL: TestE2E_MultiClientIBRL_RouteLiveness (338.65s)
The condition that never satisfied is the c1->c3 restored check after un-blocking the c1 path — i.e. the route did not reconverge in time.
Diagnostic context
At failure, the affected client's diagnostic dump showed a healthy tunnel and BGP session:
doublezero status: tunnel doublezero0, "session_status":"BGP Session Up", user_type":"IBRL"
So the daemon/BGP session was up; what lagged was the specific route restoration between clients after the unblock.
Evidence
To investigate
Open question for the investigator: is this purely a timing flake (the Eventually window at multi_client_ibrl_liveness_test.go:402/:449 too tight for reconvergence under -parallel=12 load), or a real reconvergence gap (route genuinely not restored after unblock)? The diagnostic dump (BGP up, but path not restored) leans toward a real-but-slow reconvergence worth confirming before just widening the timeout.
- Reproduce by running
TestE2E_MultiClientIBRL_RouteLiveness repeatedly under shard load (it's in e2e shard 2).
- Inspect the block/unblock + restore poll logic at
e2e/multi_client_ibrl_liveness_test.go:249,402,449 and the helper at e2e/multi_client_ibrl_test.go:514.
Acceptance criteria
TestE2E_MultiClientIBRL_RouteLiveness passes reliably across repeated runs.
- If it's a timing issue: the restore-wait is hardened (appropriate timeout/interval) with a clear rationale.
- If it's a real reconvergence delay/bug: the product issue is fixed (or documented), not just the test.
Summary
TestE2E_MultiClientIBRL_RouteLivenessintermittently fails with a route-convergence timeout. After the test blocks and then unblocks a path between IBRL clients, it waits (viarequire.Eventually) for connectivity to be restored, and that condition never becomes true within the timeout. Surfaced on an unrelated dependency-bump PR (#3924,tabled— a table-formatting crate with no connection to routing), so it is not caused by that change.This is a third, distinct e2e flake class, separate from the multicast one (#3907) and the Docker network address-pool one (
TestE2E_BackwardCompatibility, "Pool overlaps").Failing assertion
The condition that never satisfied is the
c1->c3 restoredcheck after un-blocking thec1path — i.e. the route did not reconverge in time.Diagnostic context
At failure, the affected client's diagnostic dump showed a healthy tunnel and BGP session:
So the daemon/BGP session was up; what lagged was the specific route restoration between clients after the unblock.
Evidence
/run-e2e, shard 2; job 82813640956)To investigate
Open question for the investigator: is this purely a timing flake (the
Eventuallywindow atmulti_client_ibrl_liveness_test.go:402/:449too tight for reconvergence under-parallel=12load), or a real reconvergence gap (route genuinely not restored after unblock)? The diagnostic dump (BGP up, but path not restored) leans toward a real-but-slow reconvergence worth confirming before just widening the timeout.TestE2E_MultiClientIBRL_RouteLivenessrepeatedly under shard load (it's in e2e shard 2).e2e/multi_client_ibrl_liveness_test.go:249,402,449and the helper ate2e/multi_client_ibrl_test.go:514.Acceptance criteria
TestE2E_MultiClientIBRL_RouteLivenesspasses reliably across repeated runs.