Skip to content

fix(l1): fix discv5 Hive test failures#6401

Merged
iovoid merged 7 commits intomainfrom
fix/discv5-hive-test-failures
Mar 30, 2026
Merged

fix(l1): fix discv5 Hive test failures#6401
iovoid merged 7 commits intomainfrom
fix/discv5-hive-test-failures

Conversation

@edg-l
Copy link
Copy Markdown
Contributor

@edg-l edg-l commented Mar 24, 2026

Summary

Fixes all 7 discv5 Hive test failures (Ping, PingLargeRequestID, PingMultiIP, HandshakeResend, TalkRequest, FindnodeZeroDistance, FindnodeResults).

Root causes and fixes

Rate limiting:

  • WHOAREYOU rate limit was keyed per IP, blocking distinct nodes behind the same IP (Docker/NAT). Changed to key on (IP, node_id).
  • Skip WHOAREYOU rate limit entirely for private/local IPs since amplification attacks aren't a concern on local networks.
  • Bypass rate limit on session IP mismatch since the sender proved identity via successful decryption.

Session key handling:

  • set_session_info used and_modify which silently did nothing if the contact didn't exist yet, causing responses to be encrypted with a zeroed key. Fixed by passing the outbound key directly from the handshake through to message handlers.
  • Sessions now stored independently of contacts so they're retrievable even when no contact ENR is known yet.

Protocol compliance:

  • TalkReq was silently ignored; spec requires an empty TalkRes response.
  • HandshakeResend: store encoded WHOAREYOU bytes and resend verbatim when peer retransmits before completing handshake.
  • PingMultiIP: track session source IP and send WHOAREYOU on IP mismatch.
  • PingLargeRequestID: drop messages with req_id > 8 bytes per spec.
  • FindnodeZeroDistance: include local ENR when distance 0 is requested.
  • FindnodeResults: use spec-correct log-distance (bits of XOR) instead of off-by-one.

CI:

  • Enabled discv5 in PR CI (discv4|discv5|eth|snap).

Test plan

  • All existing discv5 unit tests pass (30/30)
  • make run-hive SIMULATION=devp2p TEST_PATTERN=discv5 passes all 7 tests
  • PR CI Devp2p Hive job passes with discv5 included

@edg-l edg-l changed the title fix(p2p): fix discv5 Hive test failures (Ping, MultiIP, Handshake, TalkReq, FindNode) fix(l1): fix discv5 Hive test failures (Ping, MultiIP, Handshake, TalkReq, FindNode) Mar 24, 2026
@edg-l edg-l changed the title fix(l1): fix discv5 Hive test failures (Ping, MultiIP, Handshake, TalkReq, FindNode) fix(l1): fix discv5 Hive test failures Mar 24, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 24, 2026

Lines of code report

Total lines added: 137
Total lines removed: 0
Total lines changed: 137

Detailed view
+-----------------------------------------------+-------+------+
| File                                          | Lines | Diff |
+-----------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/discv5/server.rs | 974   | +125 |
+-----------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_table.rs    | 1334  | +12  |
+-----------------------------------------------+-------+------+

@github-actions github-actions bot added the L1 Ethereum client label Mar 24, 2026
@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. Medium: the new standalone discv5 session state is never reclaimed, so this PR introduces an unbounded memory-growth path. PeerTableServer::prune() only removes contacts, and RemovePeer only removes peers; neither clears sessions. On the server side, session_ips is also append-only. A remote peer can repeatedly complete handshakes under fresh node IDs and retain both symmetric keys and IP metadata indefinitely. Refs: peer_table.rs#L764, peer_table.rs#L814, peer_table.rs#L1448, server.rs#L129, server.rs#L915

  2. Medium: the new IP-change invalidation is only recorded for handshakes where we are the recipient. In the initiator flow (handle_who_are_you), the code stores the session but never writes session_ips, so later packets on that session bypass the new IP-mismatch check entirely. That means the PingMultiIP hardening is only half-applied. Refs: server.rs#L252, server.rs#L341, server.rs#L445

  3. Medium: discv5 distance semantics are now inconsistent between requester and responder. get_nodes_at_distances() was changed to use spec distance bits(XOR) with 0 reserved for self, but get_random_find_node_message() still uses the old helper returning bits(XOR) - 1. That shifts outbound FINDNODE queries by one bucket and can quietly degrade lookup coverage/correctness. Refs: server.rs#L500, peer_table.rs#L927

I couldn’t run the Rust tests in this sandbox because cargo/rustup need writable home directories and network-backed dependency access. Aside from the three points above, the TALKREQ/TALKRESP handling and req_id validation changes look reasonable.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have everything I need for a thorough review.


PR Review: fix(p2p): fix discv5 Hive test failures

The PR correctly identifies and addresses all five root causes. The changes are well-reasoned and the comments are thorough. A few issues need attention before merge.


Bugs / Correctness

1. session_ips is never cleaned up (server.rs)

session_ips is populated in handle_handshake but there is no corresponding removal in cleanup_stale_entries, contact eviction, or peer removal paths. Over time in a long-running node this map grows without bound. It should be pruned alongside pending_challenges or keyed to session lifetime.

// cleanup_stale_entries does not touch session_ips at all
self.pending_challenges.retain(...);
self.whoareyou_rate_limit.retain(...);
// session_ips: no cleanup

2. sessions map in PeerTableServer is never cleaned up (peer_table.rs)

sessions.insert(node_id, ...) is called on every SetSessionInfo cast, but RemovePeer only removes from self.peers and contact eviction in update_contacts_liveness only removes from self.contacts and discarded_contacts. Neither touches sessions. Same unbounded-growth problem as Point 1.

CastMessage::RemovePeer { node_id } => {
    self.peers.swap_remove(&node_id);
    // self.sessions.remove(&node_id); ← missing
}

3. Silent zeroed-key fallback in resolve_outbound_key (server.rs, line ~744)

.map_or([0; 16], |s| s.outbound_key)

If no session exists, the outbound key is silently set to all-zeros. This produces a ciphertext the peer cannot decrypt, causing a silent protocol failure that is hard to diagnose. At minimum this should emit a warn!; ideally it should return an Err so callers can handle the missing-session case explicitly. The same pre-existing issue exists in send_ordinary at line ~715.

4. Rate-limit check for HandshakeResend bypasses expiry of pending challenge (server.rs, send_who_are_you)

if let Some((_, _, raw_bytes)) = self.pending_challenges.get(&src_id) {
    self.udp_socket.send_to(raw_bytes, addr).await?;
    return Ok(());
}

The timestamp in the tuple (second element) is ignored here. If cleanup_stale_entries hasn't run yet and the challenge is older than MESSAGE_CACHE_TIMEOUT, the stale bytes are re-sent unconditionally. The peer will reject the handshake (since the challenge data is expired on our end), but we still spend bandwidth. Consider checking now.duration_since(*timestamp) < MESSAGE_CACHE_TIMEOUT before re-sending, consistent with how you guard other stale entries.


Design / Minor Issues

5. Double encoding in send_who_are_you (server.rs)

The packet is encoded twice: first by who_are_you.encode(...) (to build challenge_data), then again by packet.encode(&mut raw_buf, &src_id)? to produce the bytes that are actually sent. Previously send_packet handled transmission; now the caller takes over. This adds an unnecessary second encoding pass. Consider building raw_bytes once and deriving challenge_data from those bytes (or passing the raw buffer through build_challenge_data).

6. handle_ping adds PONG to pending_by_nonce in one code path but not the other (server.rs)

When outbound_key.is_none() and a contact exists, the code uses send_ordinary, which appends the PONG to pending_by_nonce. For the post-handshake path (or unknown-contact path) it uses send_ordinary_to, which does not. PONGs are responses and shouldn't need retry tracking, so the send_ordinary_to path is more correct. The send_ordinary path for known contacts silently accumulates stale PONG entries in pending_by_nonce. This is pre-existing but the asymmetry is now more visible.

7. pending_challenges tuple readability (server.rs)

The tuple (Vec<u8>, Instant, Vec<u8>) now carries three elements where two are both Vec<u8> with very different semantics (challenge_data vs raw_packet). A small named struct would prevent argument-order mistakes:

struct PendingChallenge {
    challenge_data: Vec<u8>,
    sent_at: Instant,
    raw_packet: Vec<u8>,
}

What Looks Good

  • The (IpAddr, H256) rate-limit key is the correct fix for the Docker/Hive IP-sharing problem and is well-documented.
  • The set_session_info and_modify → always-insert fix is correct; the standalone sessions map cleanly decouples session state from contact existence.
  • The discv5 log-distance correction (U256::bits() of XOR) matches the spec definition exactly, and distance-0 local-ENR inclusion is handled safely outside get_nodes_at_distances.
  • TalkReq → empty TalkRes is spec-compliant.
  • req_id > 8 bytes guards are correct per the spec.
  • The HandshakeResend verbatim re-send (keeping the echoed nonce consistent) is the right approach.
  • UnknownContact being non-fatal in handle_find_node is safe: the message was already decrypted with a valid session key, proving identity.
  • Test update for rate-limiting correctly reflects the new (IP, node_id) semantics.

Summary

Points 1 and 2 (memory leaks in session_ips and sessions) are the most important to fix before merge — both maps grow without bound in a production node. Point 3 (silent zeroed-key fallback) is a latent debugging hazard. Points 4–7 are lower-priority improvements.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@edg-l edg-l marked this pull request as ready for review March 24, 2026 18:50
@edg-l edg-l requested a review from a team as a code owner March 24, 2026 18:50
@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. High: the WHOAREYOU resend path can reflect packets to a different address than the one that triggered the challenge. In crates/networking/p2p/discv5/server.rs:828 the cached challenge is keyed only by src_id, and on a retry it blindly does send_to(raw_bytes, addr). Because the original address is not stored with the pending challenge, any later spoofed packet using the same src_id can make us resend that cached WHOAREYOU to an arbitrary victim IP while bypassing the normal rate-limit branch. Store the original SocketAddr alongside the cached bytes and only resend when the source matches; otherwise rate-limit or mint a fresh challenge.

  2. High: the new standalone sessions map is never cleaned up, so pruning a contact no longer actually revokes its discv5 session. prune() removes only contacts in crates/networking/p2p/peer_table.rs:814, but GetSessionInfo now prefers the separate sessions store in crates/networking/p2p/peer_table.rs:1408 and SetSessionInfo always inserts into it in crates/networking/p2p/peer_table.rs:1448. That means a discarded peer can still decrypt ordinary packets via crates/networking/p2p/discv5/server.rs:240 and still get responses on the new unknown-contact paths, e.g. PONG/TALKRESP in crates/networking/p2p/discv5/server.rs:555 and crates/networking/p2p/discv5/server.rs:1118. This is both a memory leak and a correctness/security regression. Sessions need lifecycle cleanup when contacts are pruned/discarded or explicitly invalidated.

  3. Medium: the new IP-change protection is only installed for one half of the handshake flow. handle_ordinary() checks session_ips before accepting a packet in crates/networking/p2p/discv5/server.rs:252, but session_ips.insert(...) only happens in handle_handshake() at crates/networking/p2p/discv5/server.rs:444. When we are the initiator, handle_who_are_you() stores the session in crates/networking/p2p/discv5/server.rs:345 and never records the peer IP, so later packets from a different IP will skip the new invalidation logic entirely. If PingMultiIP behavior is intended generally, the initiator path needs to populate session_ips too.

I couldn’t run the Rust tests in this environment because cargo tried to write temp files under read-only /home/runner/.rustup.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have a thorough picture of the PR. Let me write the review.


Code Review: fix(l1): fix discv5 Hive test failures (PR #6401)

This PR fixes 7 discv5 Hive test failures across rate limiting, session key handling, and protocol compliance. The overall approach is sound and the fixes are well-motivated. Several issues warrant attention before merge.


Memory Leaks (Unbounded Growth)

session_ips in server.rs — never reclaimed

server.rs adds session_ips: FxHashMap<H256, IpAddr> to track per-session source IPs, but the cleanup loop in revalidate / cleanup only retains entries in pending_challenges and whoareyou_rate_limit. session_ips grows forever. A session entry should be removed when the session itself is evicted or when the contact is removed.

sessions in peer_table.rs — never reclaimed

PeerTableServer gets a standalone sessions: FxHashMap<H256, Session> field. There is no corresponding removal path when a contact is pruned (e.g. CastMessage::RemovePeer, kbucket eviction). The entry accumulates a dead session indefinitely. At minimum, sessions.remove(&node_id) should mirror any contact removal.


Protocol Correctness

resolve_outbound_key silently falls back to the zero key

server.rs (around line 740 in the new file):

Ok(self.peer_table.get_session_info(*node_id).await?
    .map_or([0; 16], |s| s.outbound_key))

A zeroed 16-byte key is a valid [u8; 16] — any message "encrypted" with it will decode as garbage on the other side, causing a silent protocol failure that is extremely hard to diagnose. The function should return Err(...) (or at minimum a warn! log) when no session exists. The callers (handle_ping, handle_find_node, handle_message for TalkReq) can then decide how to proceed.

Distance computation inconsistency

get_nodes_at_distances is now correctly fixed to use U256::bits() (spec log-distance), but get_random_find_node_message still calls the old distance() helper which is off by one. This means outbound FINDNODE queries will ask for nodes in the wrong bucket, quietly degrading routing table construction even after this fix. The old helper should either be updated or removed.

IP-change detection only wired on the recipient handshake path

handle_handshake correctly stores session_ips.insert(src_id, addr.ip()), but handle_who_are_you (the initiator path, where we sent the WHOAREYOU and now receive the Handshake response) does not record a session IP. If the node we initiated with responds from a different IP, the mismatch will never be detected for that session. For the current Hive tests this may not matter, but it is half a fix if the intent is to cover both directions.

pending_by_nonce not updated in the send_ordinary_to path

send_ordinary (the normal path) registers the outgoing message in pending_by_nonce, which is used to retry or correlate responses. send_ordinary_to (the new keyless path) does not. Any PONG or NODES response to a message sent via send_ordinary_to will not be matched by the pending-nonce lookup, potentially silently losing the response. This should be documented at minimum, or the tracking added.

HandshakeResend re-sends without staleness check

In send_who_are_you:

if let Some((_, _, raw_bytes)) = self.pending_challenges.get(&src_id) {
    self.udp_socket.send_to(raw_bytes, addr).await?;
    return Ok(());
}

There is no check against MESSAGE_CACHE_TIMEOUT. If a peer retransmits just before the challenge expires, this re-sends a challenge that will be evicted momentarily (or has already been evicted between iterations). The timestamp stored in the tuple should be checked here.


Code Quality

pending_challenges tuple has two Vec<u8> with unrelated semantics

pub pending_challenges: FxHashMap<H256, (Vec<u8>, Instant, Vec<u8>)>,

Two Vec<u8> fields — one is challenge_data (for HMAC/signature verification), the other is raw_bytes (for verbatim resend). A named struct would make this significantly clearer and reduce destructuring noise throughout:

struct PendingChallenge {
    challenge_data: Vec<u8>,
    sent_at: Instant,
    raw_packet: Vec<u8>,
}

handle_ping control flow is hard to follow

The condition:

if outbound_key.is_none()
    && let Some(contact) = self.peer_table.get_contact(sender_id).await?
{
    return self.send_ordinary(pong, &contact.node).await;
}
let key = self.resolve_outbound_key(...).await?;
self.send_ordinary_to(pong, &sender_id, sender_addr, &key).await?;

The logic "if we have no outbound key AND we know the contact, use the contact path; otherwise use direct send" is non-obvious and has a subtle asymmetry: when outbound_key.is_none() but contact does not exist, resolve_outbound_key will reach its zero-key fallback. A comment at minimum would help; restructuring into two explicit branches (known contact vs. direct send) would be cleaner.


What Looks Good

  • (IpAddr, H256) rate-limit key correctly avoids blocking distinct nodes behind shared IPs (Docker/NAT).
  • Standalone sessions map in PeerTable fixing the and_modify silent no-op is the right structural fix.
  • Spec-correct U256::bits() log-distance in get_nodes_at_distances.
  • Empty TalkRes response for TalkReq per spec.
  • req_id > 8 bytes guard applied consistently to Ping, FindNode, and TalkReq.
  • Verbatim WHOAREYOU resend for HandshakeResend is the correct approach.
  • UnknownContact treated as non-fatal in handle_find_node post-handshake.
  • CI addition of discv5 to the devp2p Hive job.
  • Updated test_whoareyou_rate_limiting correctly uses a public IP and validates the new (IP, node_id) semantics.

Summary of Action Items

  • Fix session_ips and sessions cleanup to prevent unbounded memory growth.
  • Make resolve_outbound_key return an error (or log a warning) instead of silently using [0; 16].
  • Fix the distance formula in get_random_find_node_message to match the corrected get_nodes_at_distances.
  • Add sent_at staleness check in the HandshakeResend path.
  • Consider a named struct for PendingChallenge to replace the untyped tuple.
  • Record session IP on the initiator path (handle_who_are_you) for symmetric IP-change detection.

Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 24, 2026

Greptile Summary

This PR fixes all 7 failing discv5 Hive tests by addressing a cluster of interconnected protocol compliance bugs in the discv5 server and peer table. CI is updated to include discv5 in the devp2p Hive filter.

Key changes:

  • The WHOAREYOU rate-limit map key is changed from IpAddr to (IpAddr, H256), and private IPs are exempt to accommodate Docker environments where many nodes share an address.
  • pending_challenges is extended with raw encoded packet bytes so the original challenge can be re-sent verbatim on HandshakeResend.
  • A new session_ips map tracks the source IP for each session; an IP change triggers a fresh challenge (PingMultiIP fix).
  • PeerTableServer gains a standalone sessions map so sessions persist even when the contact ENR is absent. A dedicated GetSessionInfo call message is added.
  • handle_message gains an outbound_key parameter so a just-completed handshake passes the encryption key directly without a round-trip to the peer table.
  • get_nodes_at_distances now uses U256::bits() on the XOR value, matching the spec log-distance definition; the local ENR is included when distance 0 is requested.
  • TalkReq now responds with an empty TalkRes as required by the spec, and oversized req_id fields are dropped across Ping, FindNode, and TalkReq.

Confidence Score: 4/5

  • Safe to merge; all identified issues are non-blocking memory-growth concerns rather than correctness bugs.
  • The PR correctly addresses seven distinct protocol compliance failures with well-scoped, targeted changes. The logic for HandshakeResend, PingMultiIP, session-key threading, and distance calculation is sound. Two unbounded maps (session_ips in the server and sessions in PeerTableServer) grow without cleanup, which matters for long-running nodes but does not break the fixed test scenarios. The silent zeroed-key fallback in resolve_outbound_key is pre-existing and partially mitigated. These are suitable follow-up issues rather than blockers.
  • crates/networking/p2p/discv5/server.rs (session_ips cleanup) and crates/networking/p2p/peer_table.rs (sessions map cleanup on RemovePeer).

Important Files Changed

Filename Overview
crates/networking/p2p/discv5/server.rs Core discv5 server: fixes WHOAREYOU rate-limit key, session-IP tracking, HandshakeResend re-send, TalkReq empty response, req_id length validation, FindNode distance-0 inclusion, and IP-mismatch re-challenge. Two unbounded maps (session_ips, fallback zeroed key) are minor follow-up concerns.
crates/networking/p2p/peer_table.rs Adds a standalone sessions FxHashMap so sessions survive even without a contact ENR, fixes set_session_info silent no-op, adds GetSessionInfo call message, and fixes get_nodes_at_distances to use spec-correct XOR bit-length distance. The sessions map is not pruned on RemovePeer.
test/tests/p2p/discovery/discv5_server_tests.rs Unit test updated to match new (IP, node_id)-keyed rate limit and private-IP exemption; behaviour assertions are mostly correct but one comment misidentifies the HandshakeResend path as rate limiting.
.github/workflows/pr-main_l1.yaml Adds discv5 to the devp2p Hive test filter so the 7 newly fixed tests run in CI.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Packet arrives] --> B{Has session?}
    B -- No --> C[send_who_are_you]
    B -- Yes --> D{Decryption OK?}
    D -- No --> C
    D -- Yes --> E{IP matches stored?}
    E -- No --> F[Clear rate entry, send WAY]
    E -- Yes --> G[handle_message outbound=None]

    C --> H{pending_challenges?}
    H -- Yes --> I[Resend original bytes verbatim]
    H -- No --> J{Private addr?}
    J -- Yes --> K[Send WAY + store raw bytes]
    J -- No --> L{Rate limited?}
    L -- Yes --> M[Drop]
    L -- No --> K

    subgraph Handshake flow
    N[Handshake pkt] --> O[Verify + derive keys]
    O --> P[store session in sessions map]
    P --> Q[record session_ips entry]
    Q --> R[handle_message outbound=Some]
    end

    G --> S{Message}
    R --> S
    S -- Ping --> T[Reply PONG]
    S -- FindNode --> U[Reply NODES with XOR dist + local ENR at dist 0]
    S -- TalkReq --> V[Reply empty TALKRES]
Loading

Comments Outside Diff (1)

  1. crates/networking/p2p/peer_table.rs, line 1456-1458 (link)

    P2 Standalone sessions map leaks entries on peer removal

    RemovePeer removes from self.peers but not from the newly added self.sessions. Because SetSessionInfo always inserts into sessions, once a session is created its entry persists indefinitely even after the contact and peer are evicted.

    A matching removal should be added here:

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/networking/p2p/peer_table.rs
    Line: 1456-1458
    
    Comment:
    **Standalone `sessions` map leaks entries on peer removal**
    
    `RemovePeer` removes from `self.peers` but not from the newly added `self.sessions`. Because `SetSessionInfo` always inserts into `sessions`, once a session is created its entry persists indefinitely even after the contact and peer are evicted.
    
    A matching removal should be added here:
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/discv5/server.rs
Line: 129

Comment:
**`session_ips` map is never pruned**

`session_ips` is written to on every successful handshake (`handle_handshake` line ~449) but there is no corresponding cleanup in `cleanup_stale_entries`. Over time, a long-lived node will accumulate an entry for every peer it has ever established a session with, growing without bound.

Consider adding a retention pass in `cleanup_stale_entries` tied to session lifetime, or removing the entry when the session is evicted.

```suggestion
    pub session_ips: FxHashMap<H256, IpAddr>,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 1456-1458

Comment:
**Standalone `sessions` map leaks entries on peer removal**

`RemovePeer` removes from `self.peers` but not from the newly added `self.sessions`. Because `SetSessionInfo` always inserts into `sessions`, once a session is created its entry persists indefinitely even after the contact and peer are evicted.

A matching removal should be added here:

```suggestion
            CastMessage::RemovePeer { node_id } => {
                self.peers.swap_remove(&node_id);
                self.sessions.remove(&node_id);
            }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/networking/p2p/discv5/server.rs
Line: 747-751

Comment:
**Silent fallback to zeroed key on missing session**

If `get_session_info` returns `None` (e.g., session was evicted between the ordinary packet being decrypted and this call), `resolve_outbound_key` silently returns `[0u8; 16]`. Any response (PONG, NODES, TALKRES) encrypted with a zeroed key will be unreadable by the peer, causing silent communication failures rather than a visible error.

Consider logging a warning or returning an error instead:

```suggestion
        Ok(self
            .peer_table
            .get_session_info(*node_id)
            .await?
            .map(|s| s.outbound_key)
            .unwrap_or_else(|| {
                trace!(protocol = "discv5", node_id = %node_id, "No session found, using zeroed key");
                [0; 16]
            }))
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: test/tests/p2p/discovery/discv5_server_tests.rs
Line: 79-81

Comment:
**Misleading comment — not actually rate limited**

The comment says `src_id1` is "rate limited", but the real reason the second call short-circuits is the HandshakeResend path: `pending_challenges` already contains an entry for `src_id1`, so `send_who_are_you` re-sends the original WHOAREYOU verbatim and returns before the rate-limit check. The rate-limit key `(addr.ip(), src_id1)` would also block a fresh WHOAREYOU, but that branch is never reached here.

Consider updating the comment and adding an assertion to capture the intended behaviour:

```suggestion
        // Second call for src_id1 returns early via the HandshakeResend path
        // (pending_challenges already has an entry) — not via rate limiting.
        let _ = server.send_who_are_you(nonce, src_id1, addr).await;
        // Rate limit entry count is unchanged because we returned early
        assert_eq!(server.whoareyou_rate_limit.len(), 2);
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(p2p): skip WHOAREYOU rate limit for ..." | Re-trigger Greptile

/// Tracks the source IP that each session was established from.
/// Used to detect IP changes: if a packet arrives from a different IP than the session was
/// established with, we invalidate the session by sending WHOAREYOU (PingMultiIP behaviour).
pub session_ips: FxHashMap<H256, IpAddr>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 session_ips map is never pruned

session_ips is written to on every successful handshake (handle_handshake line ~449) but there is no corresponding cleanup in cleanup_stale_entries. Over time, a long-lived node will accumulate an entry for every peer it has ever established a session with, growing without bound.

Consider adding a retention pass in cleanup_stale_entries tied to session lifetime, or removing the entry when the session is evicted.

Suggested change
pub session_ips: FxHashMap<H256, IpAddr>,
pub session_ips: FxHashMap<H256, IpAddr>,
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/discv5/server.rs
Line: 129

Comment:
**`session_ips` map is never pruned**

`session_ips` is written to on every successful handshake (`handle_handshake` line ~449) but there is no corresponding cleanup in `cleanup_stale_entries`. Over time, a long-lived node will accumulate an entry for every peer it has ever established a session with, growing without bound.

Consider adding a retention pass in `cleanup_stale_entries` tied to session lifetime, or removing the entry when the session is evicted.

```suggestion
    pub session_ips: FxHashMap<H256, IpAddr>,
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread crates/networking/p2p/discv5/server.rs Outdated
Comment on lines +747 to +751
Ok(self
.peer_table
.get_session_info(*node_id)
.await?
.map_or([0; 16], |s| s.outbound_key))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent fallback to zeroed key on missing session

If get_session_info returns None (e.g., session was evicted between the ordinary packet being decrypted and this call), resolve_outbound_key silently returns [0u8; 16]. Any response (PONG, NODES, TALKRES) encrypted with a zeroed key will be unreadable by the peer, causing silent communication failures rather than a visible error.

Consider logging a warning or returning an error instead:

Suggested change
Ok(self
.peer_table
.get_session_info(*node_id)
.await?
.map_or([0; 16], |s| s.outbound_key))
Ok(self
.peer_table
.get_session_info(*node_id)
.await?
.map(|s| s.outbound_key)
.unwrap_or_else(|| {
trace!(protocol = "discv5", node_id = %node_id, "No session found, using zeroed key");
[0; 16]
}))
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/discv5/server.rs
Line: 747-751

Comment:
**Silent fallback to zeroed key on missing session**

If `get_session_info` returns `None` (e.g., session was evicted between the ordinary packet being decrypted and this call), `resolve_outbound_key` silently returns `[0u8; 16]`. Any response (PONG, NODES, TALKRES) encrypted with a zeroed key will be unreadable by the peer, causing silent communication failures rather than a visible error.

Consider logging a warning or returning an error instead:

```suggestion
        Ok(self
            .peer_table
            .get_session_info(*node_id)
            .await?
            .map(|s| s.outbound_key)
            .unwrap_or_else(|| {
                trace!(protocol = "discv5", node_id = %node_id, "No session found, using zeroed key");
                [0; 16]
            }))
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +79 to +81
// Same node_id and same IP should be rate limited
let _ = server.send_who_are_you(nonce, src_id1, addr).await;
// pending_challenges entry for src_id1 should not be updated (still the first one)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading comment — not actually rate limited

The comment says src_id1 is "rate limited", but the real reason the second call short-circuits is the HandshakeResend path: pending_challenges already contains an entry for src_id1, so send_who_are_you re-sends the original WHOAREYOU verbatim and returns before the rate-limit check. The rate-limit key (addr.ip(), src_id1) would also block a fresh WHOAREYOU, but that branch is never reached here.

Consider updating the comment and adding an assertion to capture the intended behaviour:

Suggested change
// Same node_id and same IP should be rate limited
let _ = server.send_who_are_you(nonce, src_id1, addr).await;
// pending_challenges entry for src_id1 should not be updated (still the first one)
// Second call for src_id1 returns early via the HandshakeResend path
// (pending_challenges already has an entry) — not via rate limiting.
let _ = server.send_who_are_you(nonce, src_id1, addr).await;
// Rate limit entry count is unchanged because we returned early
assert_eq!(server.whoareyou_rate_limit.len(), 2);
Prompt To Fix With AI
This is a comment left during a code review.
Path: test/tests/p2p/discovery/discv5_server_tests.rs
Line: 79-81

Comment:
**Misleading comment — not actually rate limited**

The comment says `src_id1` is "rate limited", but the real reason the second call short-circuits is the HandshakeResend path: `pending_challenges` already contains an entry for `src_id1`, so `send_who_are_you` re-sends the original WHOAREYOU verbatim and returns before the rate-limit check. The rate-limit key `(addr.ip(), src_id1)` would also block a fresh WHOAREYOU, but that branch is never reached here.

Consider updating the comment and adding an assertion to capture the intended behaviour:

```suggestion
        // Second call for src_id1 returns early via the HandshakeResend path
        // (pending_challenges already has an entry) — not via rate limiting.
        let _ = server.send_who_are_you(nonce, src_id1, addr).await;
        // Rate limit entry count is unchanged because we returned early
        assert_eq!(server.whoareyou_rate_limit.len(), 2);
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor

@ElFantasma ElFantasma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, maybe for item 1 below we can open a new ticket instead ok addressing it in this PR.

On the bot comments:

  1. session_ips never pruned (greptile) — Valid but low severity. The map is keyed by node_id (one entry per peer, overwritten on re-handshake), so growth is bounded by the number of unique peers ever seen. At ~52 bytes/entry, even 100K peers is ~5MB. Still, adding a session_ips.retain(|id, _| self.contacts.contains_key(id) || self.pending_challenges.contains_key(id)) in cleanup_stale_entries would be a clean fix.

  2. Zeroed key fallback in resolve_outbound_key (greptile) — Valid concern, but unlikely to trigger now that sessions are stored independently of contacts. If get_session_info returns None after a successful handshake, something is very wrong. A warn! log instead of silent [0; 16] would help debug if it ever happens.

  3. Misleading test comment (greptile) — Valid. The second send_who_are_you(src_id1) call hits the HandshakeResend early return (pending_challenges already has src_id1), not the rate limit check. The comment should reflect this.

Comment thread crates/networking/p2p/peer_table.rs Outdated
edg-l and others added 5 commits March 26, 2026 11:06
- Rate limit WHOAREYOU per (IP, node_id) instead of per IP, so distinct
  nodes behind the same IP (Docker, NAT) can each complete handshakes
- Pass outbound session key directly through handle_message after
  handshake completion, since the peer table may not have the contact
  yet when send_ordinary_to needs to encrypt the response
- Respond to TalkReq with empty TalkRes as required by the spec
- HandshakeResend: store encoded WHOAREYOU bytes in pending_challenges and
  re-send them verbatim when the peer retransmits its ordinary packet before
  completing the handshake, ensuring the echoed nonce matches what the peer expects
- PingMultiIP: track session source IP in session_ips and send WHOAREYOU when
  an ordinary packet arrives from a different IP than the session was established on
- PingLargeRequestID: silently drop PING/FINDNODE/TALKREQ with req_id > 8 bytes
- FindnodeZeroDistance: include local ENR in response when distance 0 is requested
- FindnodeResults: use spec-correct log-distance (bits(XOR)) instead of off-by-one
- TalkRequest: store sessions independently of contacts so sessions are retrievable
  even when no contact ENR is known yet
When a node sends a packet from a different IP than its session was
established on, the IP mismatch triggers a WHOAREYOU. But the rate
limit could suppress it if we recently sent a WHOAREYOU to the same
(IP, node_id). Since the sender proved identity via successful
decryption, this is not an amplification attack, so clear the rate
limit entry before sending.
Amplification attacks are not a concern on local networks. Skip the
rate limit for private, loopback, and link-local IPs. Update the rate
limit test to use public IPs so it still exercises the rate limiter.
@edg-l
Copy link
Copy Markdown
Contributor Author

edg-l commented Mar 26, 2026

LGTM, maybe for item 1 below we can open a new ticket instead ok addressing it in this PR.

On the bot comments:

1. **`session_ips` never pruned** (greptile) — Valid but low severity. The map is keyed by node_id (one entry per peer, overwritten on re-handshake), so growth is bounded by the number of unique peers ever seen. At ~52 bytes/entry, even 100K peers is ~5MB. Still, adding a `session_ips.retain(|id, _| self.contacts.contains_key(id) || self.pending_challenges.contains_key(id))` in `cleanup_stale_entries` would be a clean fix.

2. **Zeroed key fallback in `resolve_outbound_key`** (greptile) — Valid concern, but unlikely to trigger now that sessions are stored independently of contacts. If `get_session_info` returns None after a successful handshake, something is very wrong. A `warn!` log instead of silent `[0; 16]` would help debug if it ever happens.

3. **Misleading test comment** (greptile) — Valid. The second `send_who_are_you(src_id1)` call hits the HandshakeResend early return (pending_challenges already has src_id1), not the rate limit check. The comment should reflect this.

opened an issue #6404

- Fix utils::distance to use bits() instead of bits().saturating_sub(1),
  matching the discv5 spec and go-ethereum implementation
- Use utils::distance in get_nodes_at_distances instead of inline calc
- Add warn! logs when falling back to zeroed encryption key
@edg-l edg-l requested a review from iovoid March 26, 2026 10:18
@iovoid iovoid added this pull request to the merge queue Mar 30, 2026
Merged via the queue into main with commit bda16af Mar 30, 2026
54 checks passed
@iovoid iovoid deleted the fix/discv5-hive-test-failures branch March 30, 2026 18:43
edg-l added a commit that referenced this pull request Apr 8, 2026
## Summary

Fixes all 7 discv5 Hive test failures (Ping, PingLargeRequestID,
PingMultiIP, HandshakeResend, TalkRequest, FindnodeZeroDistance,
FindnodeResults).

### Root causes and fixes

**Rate limiting:**
- WHOAREYOU rate limit was keyed per IP, blocking distinct nodes behind
the same IP (Docker/NAT). Changed to key on `(IP, node_id)`.
- Skip WHOAREYOU rate limit entirely for private/local IPs since
amplification attacks aren't a concern on local networks.
- Bypass rate limit on session IP mismatch since the sender proved
identity via successful decryption.

**Session key handling:**
- `set_session_info` used `and_modify` which silently did nothing if the
contact didn't exist yet, causing responses to be encrypted with a
zeroed key. Fixed by passing the outbound key directly from the
handshake through to message handlers.
- Sessions now stored independently of contacts so they're retrievable
even when no contact ENR is known yet.

**Protocol compliance:**
- TalkReq was silently ignored; spec requires an empty TalkRes response.
- HandshakeResend: store encoded WHOAREYOU bytes and resend verbatim
when peer retransmits before completing handshake.
- PingMultiIP: track session source IP and send WHOAREYOU on IP
mismatch.
- PingLargeRequestID: drop messages with req_id > 8 bytes per spec.
- FindnodeZeroDistance: include local ENR when distance 0 is requested.
- FindnodeResults: use spec-correct log-distance (bits of XOR) instead
of off-by-one.

**CI:**
- Enabled discv5 in PR CI (`discv4|discv5|eth|snap`).

## Test plan

- [x] All existing discv5 unit tests pass (30/30)
- [ ] `make run-hive SIMULATION=devp2p TEST_PATTERN=discv5` passes all 7
tests
- [ ] PR CI Devp2p Hive job passes with discv5 included

---------

Co-authored-by: Esteve Soler Arderiu <azteca1998@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants