Skip to content

e2e: flaky TestE2E_MultiClientIBRL_RouteLiveness — route restoration 'Condition never satisfied' #3935

Description

@ben-dz

Summary

TestE2E_MultiClientIBRL_RouteLiveness intermittently fails with a route-convergence timeout. After the test blocks and then unblocks a path between IBRL clients, it waits (via require.Eventually) for connectivity to be restored, and that condition never becomes true within the timeout. Surfaced on an unrelated dependency-bump PR (#3924, tabled — a table-formatting crate with no connection to routing), so it is not caused by that change.

This is a third, distinct e2e flake class, separate from the multicast one (#3907) and the Docker network address-pool one (TestE2E_BackwardCompatibility, "Pool overlaps").

Failing assertion

multi_client_ibrl_liveness_test.go:449:
    Error:    Condition never satisfied
    Test:     TestE2E_MultiClientIBRL_RouteLiveness
    Messages: pass %d: unblock c1: c1->c3 restored
    Error Trace:
        e2e/multi_client_ibrl_test.go:514
        e2e/multi_client_ibrl_liveness_test.go:402
        e2e/multi_client_ibrl_liveness_test.go:449
        e2e/multi_client_ibrl_liveness_test.go:249
--- FAIL: TestE2E_MultiClientIBRL_RouteLiveness (338.65s)

The condition that never satisfied is the c1->c3 restored check after un-blocking the c1 path — i.e. the route did not reconverge in time.

Diagnostic context

At failure, the affected client's diagnostic dump showed a healthy tunnel and BGP session:

doublezero status: tunnel doublezero0, "session_status":"BGP Session Up", user_type":"IBRL"

So the daemon/BGP session was up; what lagged was the specific route restoration between clients after the unblock.

Evidence

To investigate

Open question for the investigator: is this purely a timing flake (the Eventually window at multi_client_ibrl_liveness_test.go:402/:449 too tight for reconvergence under -parallel=12 load), or a real reconvergence gap (route genuinely not restored after unblock)? The diagnostic dump (BGP up, but path not restored) leans toward a real-but-slow reconvergence worth confirming before just widening the timeout.

  • Reproduce by running TestE2E_MultiClientIBRL_RouteLiveness repeatedly under shard load (it's in e2e shard 2).
  • Inspect the block/unblock + restore poll logic at e2e/multi_client_ibrl_liveness_test.go:249,402,449 and the helper at e2e/multi_client_ibrl_test.go:514.

Acceptance criteria

  • TestE2E_MultiClientIBRL_RouteLiveness passes reliably across repeated runs.
  • If it's a timing issue: the restore-wait is hardened (appropriate timeout/interval) with a clear rationale.
  • If it's a real reconvergence delay/bug: the product issue is fixed (or documented), not just the test.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions