Skip to content

[TRACKER] GRE Tunnel Capacity Study #3744

Description

@elitegreg

Track the work to find the per-device user-GRE-tunnel ceiling on DoubleZero's physical Arista switches and pick a new MaxUserTunnelSlots default. Full design: Notion doc.

Scope at a glance

  • Control-plane stress only on isolated test switches (dzd8 7130LBR, dzd10 7280CR3A). No data plane.
  • Scale onchain user records against a DUT and measure the provisioning pipeline (ledger → stress controller → agent pull → eAPI commit).
  • Production controller and dzd1–dzd4 untouched; a parallel stress controller serves the DUTs.
  • Restore DUTs to pre-study state via EOS configure replace checkpoint: at the end.

Exit criteria

  • Raw orchestrator + observer outputs archived for each DUT.
  • Breaking-point step and trigger identified for each DUT.
  • New MaxUserTunnelSlots default merged with a headroom-based justification.
  • Both DUTs restored to standalone EOS; pre/post show running-config diff is empty (or fully explained).

Child issues (this batch)

Deferred (not yet filed)

  • Phase 1 execution run — 7130LBR (dzd8)
  • Phase 2 execution run — 7280CR3A (dzd10)
  • Adopt new MaxUserTunnelSlots default

Related but separate work on this milestone

Out of scope (deferred — see design doc § "Future work")

tools/stress/device-report post-run analyzer, multicast boundary list scale sweep, agent poll-interval sweep, full-fabric scale sweep, regression test added to CI, redesign of the agent's full-config-every-5s pull, control-plane-to-client and data-plane-to-client testing.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions