Conversation
Lines of code reportTotal lines added: Detailed view |
🤖 Codex Code Review
I couldn’t run the Rust tests in this sandbox because Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
🤖 Claude Code ReviewNow I have everything I need for a thorough review. PR Review:
|
🤖 Codex Code Review
I couldn’t run the Rust tests in this environment because Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
🤖 Claude Code ReviewNow I have a thorough picture of the PR. Let me write the review. Code Review:
|
Greptile SummaryThis PR fixes all 7 failing discv5 Hive tests by addressing a cluster of interconnected protocol compliance bugs in the discv5 server and peer table. CI is updated to include Key changes:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/discv5/server.rs | Core discv5 server: fixes WHOAREYOU rate-limit key, session-IP tracking, HandshakeResend re-send, TalkReq empty response, req_id length validation, FindNode distance-0 inclusion, and IP-mismatch re-challenge. Two unbounded maps (session_ips, fallback zeroed key) are minor follow-up concerns. |
| crates/networking/p2p/peer_table.rs | Adds a standalone sessions FxHashMap so sessions survive even without a contact ENR, fixes set_session_info silent no-op, adds GetSessionInfo call message, and fixes get_nodes_at_distances to use spec-correct XOR bit-length distance. The sessions map is not pruned on RemovePeer. |
| test/tests/p2p/discovery/discv5_server_tests.rs | Unit test updated to match new (IP, node_id)-keyed rate limit and private-IP exemption; behaviour assertions are mostly correct but one comment misidentifies the HandshakeResend path as rate limiting. |
| .github/workflows/pr-main_l1.yaml | Adds discv5 to the devp2p Hive test filter so the 7 newly fixed tests run in CI. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Packet arrives] --> B{Has session?}
B -- No --> C[send_who_are_you]
B -- Yes --> D{Decryption OK?}
D -- No --> C
D -- Yes --> E{IP matches stored?}
E -- No --> F[Clear rate entry, send WAY]
E -- Yes --> G[handle_message outbound=None]
C --> H{pending_challenges?}
H -- Yes --> I[Resend original bytes verbatim]
H -- No --> J{Private addr?}
J -- Yes --> K[Send WAY + store raw bytes]
J -- No --> L{Rate limited?}
L -- Yes --> M[Drop]
L -- No --> K
subgraph Handshake flow
N[Handshake pkt] --> O[Verify + derive keys]
O --> P[store session in sessions map]
P --> Q[record session_ips entry]
Q --> R[handle_message outbound=Some]
end
G --> S{Message}
R --> S
S -- Ping --> T[Reply PONG]
S -- FindNode --> U[Reply NODES with XOR dist + local ENR at dist 0]
S -- TalkReq --> V[Reply empty TALKRES]
Comments Outside Diff (1)
-
crates/networking/p2p/peer_table.rs, line 1456-1458 (link)Standalone
sessionsmap leaks entries on peer removalRemovePeerremoves fromself.peersbut not from the newly addedself.sessions. BecauseSetSessionInfoalways inserts intosessions, once a session is created its entry persists indefinitely even after the contact and peer are evicted.A matching removal should be added here:
Prompt To Fix With AI
This is a comment left during a code review. Path: crates/networking/p2p/peer_table.rs Line: 1456-1458 Comment: **Standalone `sessions` map leaks entries on peer removal** `RemovePeer` removes from `self.peers` but not from the newly added `self.sessions`. Because `SetSessionInfo` always inserts into `sessions`, once a session is created its entry persists indefinitely even after the contact and peer are evicted. A matching removal should be added here: How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/discv5/server.rs
Line: 129
Comment:
**`session_ips` map is never pruned**
`session_ips` is written to on every successful handshake (`handle_handshake` line ~449) but there is no corresponding cleanup in `cleanup_stale_entries`. Over time, a long-lived node will accumulate an entry for every peer it has ever established a session with, growing without bound.
Consider adding a retention pass in `cleanup_stale_entries` tied to session lifetime, or removing the entry when the session is evicted.
```suggestion
pub session_ips: FxHashMap<H256, IpAddr>,
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 1456-1458
Comment:
**Standalone `sessions` map leaks entries on peer removal**
`RemovePeer` removes from `self.peers` but not from the newly added `self.sessions`. Because `SetSessionInfo` always inserts into `sessions`, once a session is created its entry persists indefinitely even after the contact and peer are evicted.
A matching removal should be added here:
```suggestion
CastMessage::RemovePeer { node_id } => {
self.peers.swap_remove(&node_id);
self.sessions.remove(&node_id);
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: crates/networking/p2p/discv5/server.rs
Line: 747-751
Comment:
**Silent fallback to zeroed key on missing session**
If `get_session_info` returns `None` (e.g., session was evicted between the ordinary packet being decrypted and this call), `resolve_outbound_key` silently returns `[0u8; 16]`. Any response (PONG, NODES, TALKRES) encrypted with a zeroed key will be unreadable by the peer, causing silent communication failures rather than a visible error.
Consider logging a warning or returning an error instead:
```suggestion
Ok(self
.peer_table
.get_session_info(*node_id)
.await?
.map(|s| s.outbound_key)
.unwrap_or_else(|| {
trace!(protocol = "discv5", node_id = %node_id, "No session found, using zeroed key");
[0; 16]
}))
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: test/tests/p2p/discovery/discv5_server_tests.rs
Line: 79-81
Comment:
**Misleading comment — not actually rate limited**
The comment says `src_id1` is "rate limited", but the real reason the second call short-circuits is the HandshakeResend path: `pending_challenges` already contains an entry for `src_id1`, so `send_who_are_you` re-sends the original WHOAREYOU verbatim and returns before the rate-limit check. The rate-limit key `(addr.ip(), src_id1)` would also block a fresh WHOAREYOU, but that branch is never reached here.
Consider updating the comment and adding an assertion to capture the intended behaviour:
```suggestion
// Second call for src_id1 returns early via the HandshakeResend path
// (pending_challenges already has an entry) — not via rate limiting.
let _ = server.send_who_are_you(nonce, src_id1, addr).await;
// Rate limit entry count is unchanged because we returned early
assert_eq!(server.whoareyou_rate_limit.len(), 2);
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(p2p): skip WHOAREYOU rate limit for ..." | Re-trigger Greptile
| /// Tracks the source IP that each session was established from. | ||
| /// Used to detect IP changes: if a packet arrives from a different IP than the session was | ||
| /// established with, we invalidate the session by sending WHOAREYOU (PingMultiIP behaviour). | ||
| pub session_ips: FxHashMap<H256, IpAddr>, |
There was a problem hiding this comment.
session_ips map is never pruned
session_ips is written to on every successful handshake (handle_handshake line ~449) but there is no corresponding cleanup in cleanup_stale_entries. Over time, a long-lived node will accumulate an entry for every peer it has ever established a session with, growing without bound.
Consider adding a retention pass in cleanup_stale_entries tied to session lifetime, or removing the entry when the session is evicted.
| pub session_ips: FxHashMap<H256, IpAddr>, | |
| pub session_ips: FxHashMap<H256, IpAddr>, |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/discv5/server.rs
Line: 129
Comment:
**`session_ips` map is never pruned**
`session_ips` is written to on every successful handshake (`handle_handshake` line ~449) but there is no corresponding cleanup in `cleanup_stale_entries`. Over time, a long-lived node will accumulate an entry for every peer it has ever established a session with, growing without bound.
Consider adding a retention pass in `cleanup_stale_entries` tied to session lifetime, or removing the entry when the session is evicted.
```suggestion
pub session_ips: FxHashMap<H256, IpAddr>,
```
How can I resolve this? If you propose a fix, please make it concise.| Ok(self | ||
| .peer_table | ||
| .get_session_info(*node_id) | ||
| .await? | ||
| .map_or([0; 16], |s| s.outbound_key)) |
There was a problem hiding this comment.
Silent fallback to zeroed key on missing session
If get_session_info returns None (e.g., session was evicted between the ordinary packet being decrypted and this call), resolve_outbound_key silently returns [0u8; 16]. Any response (PONG, NODES, TALKRES) encrypted with a zeroed key will be unreadable by the peer, causing silent communication failures rather than a visible error.
Consider logging a warning or returning an error instead:
| Ok(self | |
| .peer_table | |
| .get_session_info(*node_id) | |
| .await? | |
| .map_or([0; 16], |s| s.outbound_key)) | |
| Ok(self | |
| .peer_table | |
| .get_session_info(*node_id) | |
| .await? | |
| .map(|s| s.outbound_key) | |
| .unwrap_or_else(|| { | |
| trace!(protocol = "discv5", node_id = %node_id, "No session found, using zeroed key"); | |
| [0; 16] | |
| })) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/discv5/server.rs
Line: 747-751
Comment:
**Silent fallback to zeroed key on missing session**
If `get_session_info` returns `None` (e.g., session was evicted between the ordinary packet being decrypted and this call), `resolve_outbound_key` silently returns `[0u8; 16]`. Any response (PONG, NODES, TALKRES) encrypted with a zeroed key will be unreadable by the peer, causing silent communication failures rather than a visible error.
Consider logging a warning or returning an error instead:
```suggestion
Ok(self
.peer_table
.get_session_info(*node_id)
.await?
.map(|s| s.outbound_key)
.unwrap_or_else(|| {
trace!(protocol = "discv5", node_id = %node_id, "No session found, using zeroed key");
[0; 16]
}))
```
How can I resolve this? If you propose a fix, please make it concise.| // Same node_id and same IP should be rate limited | ||
| let _ = server.send_who_are_you(nonce, src_id1, addr).await; | ||
| // pending_challenges entry for src_id1 should not be updated (still the first one) |
There was a problem hiding this comment.
Misleading comment — not actually rate limited
The comment says src_id1 is "rate limited", but the real reason the second call short-circuits is the HandshakeResend path: pending_challenges already contains an entry for src_id1, so send_who_are_you re-sends the original WHOAREYOU verbatim and returns before the rate-limit check. The rate-limit key (addr.ip(), src_id1) would also block a fresh WHOAREYOU, but that branch is never reached here.
Consider updating the comment and adding an assertion to capture the intended behaviour:
| // Same node_id and same IP should be rate limited | |
| let _ = server.send_who_are_you(nonce, src_id1, addr).await; | |
| // pending_challenges entry for src_id1 should not be updated (still the first one) | |
| // Second call for src_id1 returns early via the HandshakeResend path | |
| // (pending_challenges already has an entry) — not via rate limiting. | |
| let _ = server.send_who_are_you(nonce, src_id1, addr).await; | |
| // Rate limit entry count is unchanged because we returned early | |
| assert_eq!(server.whoareyou_rate_limit.len(), 2); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: test/tests/p2p/discovery/discv5_server_tests.rs
Line: 79-81
Comment:
**Misleading comment — not actually rate limited**
The comment says `src_id1` is "rate limited", but the real reason the second call short-circuits is the HandshakeResend path: `pending_challenges` already contains an entry for `src_id1`, so `send_who_are_you` re-sends the original WHOAREYOU verbatim and returns before the rate-limit check. The rate-limit key `(addr.ip(), src_id1)` would also block a fresh WHOAREYOU, but that branch is never reached here.
Consider updating the comment and adding an assertion to capture the intended behaviour:
```suggestion
// Second call for src_id1 returns early via the HandshakeResend path
// (pending_challenges already has an entry) — not via rate limiting.
let _ = server.send_who_are_you(nonce, src_id1, addr).await;
// Rate limit entry count is unchanged because we returned early
assert_eq!(server.whoareyou_rate_limit.len(), 2);
```
How can I resolve this? If you propose a fix, please make it concise.
ElFantasma
left a comment
There was a problem hiding this comment.
LGTM, maybe for item 1 below we can open a new ticket instead ok addressing it in this PR.
On the bot comments:
-
session_ipsnever pruned (greptile) — Valid but low severity. The map is keyed by node_id (one entry per peer, overwritten on re-handshake), so growth is bounded by the number of unique peers ever seen. At ~52 bytes/entry, even 100K peers is ~5MB. Still, adding asession_ips.retain(|id, _| self.contacts.contains_key(id) || self.pending_challenges.contains_key(id))incleanup_stale_entrieswould be a clean fix. -
Zeroed key fallback in
resolve_outbound_key(greptile) — Valid concern, but unlikely to trigger now that sessions are stored independently of contacts. Ifget_session_inforeturns None after a successful handshake, something is very wrong. Awarn!log instead of silent[0; 16]would help debug if it ever happens. -
Misleading test comment (greptile) — Valid. The second
send_who_are_you(src_id1)call hits the HandshakeResend early return (pending_challenges already has src_id1), not the rate limit check. The comment should reflect this.
- Rate limit WHOAREYOU per (IP, node_id) instead of per IP, so distinct nodes behind the same IP (Docker, NAT) can each complete handshakes - Pass outbound session key directly through handle_message after handshake completion, since the peer table may not have the contact yet when send_ordinary_to needs to encrypt the response - Respond to TalkReq with empty TalkRes as required by the spec
- HandshakeResend: store encoded WHOAREYOU bytes in pending_challenges and re-send them verbatim when the peer retransmits its ordinary packet before completing the handshake, ensuring the echoed nonce matches what the peer expects - PingMultiIP: track session source IP in session_ips and send WHOAREYOU when an ordinary packet arrives from a different IP than the session was established on - PingLargeRequestID: silently drop PING/FINDNODE/TALKREQ with req_id > 8 bytes - FindnodeZeroDistance: include local ENR in response when distance 0 is requested - FindnodeResults: use spec-correct log-distance (bits(XOR)) instead of off-by-one - TalkRequest: store sessions independently of contacts so sessions are retrievable even when no contact ENR is known yet
When a node sends a packet from a different IP than its session was established on, the IP mismatch triggers a WHOAREYOU. But the rate limit could suppress it if we recently sent a WHOAREYOU to the same (IP, node_id). Since the sender proved identity via successful decryption, this is not an amplification attack, so clear the rate limit entry before sending.
Amplification attacks are not a concern on local networks. Skip the rate limit for private, loopback, and link-local IPs. Update the rate limit test to use public IPs so it still exercises the rate limiter.
705809c to
9719a4c
Compare
opened an issue #6404 |
- Fix utils::distance to use bits() instead of bits().saturating_sub(1), matching the discv5 spec and go-ethereum implementation - Use utils::distance in get_nodes_at_distances instead of inline calc - Add warn! logs when falling back to zeroed encryption key
## Summary Fixes all 7 discv5 Hive test failures (Ping, PingLargeRequestID, PingMultiIP, HandshakeResend, TalkRequest, FindnodeZeroDistance, FindnodeResults). ### Root causes and fixes **Rate limiting:** - WHOAREYOU rate limit was keyed per IP, blocking distinct nodes behind the same IP (Docker/NAT). Changed to key on `(IP, node_id)`. - Skip WHOAREYOU rate limit entirely for private/local IPs since amplification attacks aren't a concern on local networks. - Bypass rate limit on session IP mismatch since the sender proved identity via successful decryption. **Session key handling:** - `set_session_info` used `and_modify` which silently did nothing if the contact didn't exist yet, causing responses to be encrypted with a zeroed key. Fixed by passing the outbound key directly from the handshake through to message handlers. - Sessions now stored independently of contacts so they're retrievable even when no contact ENR is known yet. **Protocol compliance:** - TalkReq was silently ignored; spec requires an empty TalkRes response. - HandshakeResend: store encoded WHOAREYOU bytes and resend verbatim when peer retransmits before completing handshake. - PingMultiIP: track session source IP and send WHOAREYOU on IP mismatch. - PingLargeRequestID: drop messages with req_id > 8 bytes per spec. - FindnodeZeroDistance: include local ENR when distance 0 is requested. - FindnodeResults: use spec-correct log-distance (bits of XOR) instead of off-by-one. **CI:** - Enabled discv5 in PR CI (`discv4|discv5|eth|snap`). ## Test plan - [x] All existing discv5 unit tests pass (30/30) - [ ] `make run-hive SIMULATION=devp2p TEST_PATTERN=discv5` passes all 7 tests - [ ] PR CI Devp2p Hive job passes with discv5 included --------- Co-authored-by: Esteve Soler Arderiu <azteca1998@users.noreply.github.com>
Summary
Fixes all 7 discv5 Hive test failures (Ping, PingLargeRequestID, PingMultiIP, HandshakeResend, TalkRequest, FindnodeZeroDistance, FindnodeResults).
Root causes and fixes
Rate limiting:
(IP, node_id).Session key handling:
set_session_infousedand_modifywhich silently did nothing if the contact didn't exist yet, causing responses to be encrypted with a zeroed key. Fixed by passing the outbound key directly from the handshake through to message handlers.Protocol compliance:
CI:
discv4|discv5|eth|snap).Test plan
make run-hive SIMULATION=devp2p TEST_PATTERN=discv5passes all 7 tests