fix(pd): add timeout and null-safety to getLeaderGrpcAddress() by bitflicker64 · Pull Request #2961 · apache/hugegraph

bitflicker64 · 2026-03-05T08:15:06Z

Purpose of the PR

close [Bug] 3-node PD cluster fails when pd0 is not raft leader — getLeaderGrpcAddress() NPE in bridge network mode #2959

In a 3-node PD cluster running in Docker bridge network mode, getLeaderGrpcAddress() makes a bolt RPC call to discover the leader's gRPC address when the current node is a follower. This call fails in bridge mode — the TCP connection establishes but the bolt RPC response never returns properly, causing CompletableFuture.get() to return null and throw NPE.
This causes:

redirectToLeader() fails with NPE
Store registration requests landing on follower PDs are never forwarded
Stores register but partitions are never distributed (partitionCount:0)
HugeGraph servers stuck in DEADLINE_EXCEEDED loop indefinitely

The cluster only works when pd0 wins raft leader election (since isLeader() returns true and the broken code path is skipped). If pd1 or pd2 wins, the NPE fires on every redirect attempt.

Related PR: #2952

Main Changes

Cache leader PeerId after waitingForLeader() and null-check to prevent NPE when leader election times out
Wire config.getRpcTimeout() into RaftRpcClient's RpcOptions so Bolt transport timeout is consistent with future.get() caller timeout
Add bounded timeout to the bolt RPC call using config.getRpcTimeout() instead of unbounded .get()
Split catch (TimeoutException | ExecutionException) into separate blocks to avoid double-wrapping root cause
Restore best-effort fallback to leaderIp + grpcPort with warn logging when RPC fails, times out, or returns null
Improve waitingForLeader() to respect sub-second timeouts using Math.min(1000, remaining)
Add unit tests covering all failure paths with mocked RaftRpcClient

Verifying these changes

Trivial rework / code cleanup without any test coverage. (No Need)
Already covered by existing tests, such as (please modify tests here).
Done testing and can be verified as follows:
- Deploy 3-node PD cluster in Docker bridge network mode
- Verify cluster works regardless of which PD node wins raft leader election
- Confirm stores show partitionCount:12 on all 3 nodes when pd1 or pd2 is leader
- Confirm no NPE in pd logs at getLeaderGrpcAddress
- Unit tests: RaftEngineLeaderAddressTest covers timeout, RPC exception, null response, and null leader branches

Does this PR potentially affect the following parts?

Dependencies ([add/update license](https://hugegraph.apache.org/docs/contribution-guidelines/contribute/#321-check-licenses) info & regenerate_known_dependencies.sh)
Modify configurations
The public API
Other affects (typed here)
Nope

Documentation Status

Doc - TODO
Doc - Done
Doc - No Need

bitflicker64 · 2026-03-05T20:43:14Z

How I tested:

Built a local Docker image from source with this fix applied
Brought up the 3-node cluster (3 PD + 3 Store + 3 Server) in bridge network mode
Confirmed cluster was healthy with pd0 as initial leader
Restarted pd0 to force a new leader election — pd1 won
Checked partition distribution and cluster health with pd1 as leader

Results with pd1 as leader:

partitionCount:12 on all 3 stores ✅
leaderCount:12 on all 3 stores ✅
{"graphs":["hugegraph"]} ✅
All 9 containers healthy ✅

Confirmed fallback triggered in pd1 logs:

[WARN] RaftEngine - Failed to get leader gRPC address via RPC, falling back to endpoint derivation
java.util.concurrent.ExecutionException: com.alipay.remoting.exception.RemotingException:
Create connection failed. The address is 172.20.0.10:8610
    at RaftEngine.getLeaderGrpcAddress(RaftEngine.java:247)
    at PDService.redirectToLeader(PDService.java:1275)

Before this fix: RPC returns null → NPE → follower PDs can't redirect requests to leader → cluster only worked when pd0 won leader election since it never hit the broken code path.

After this fix: RPC failure caught with bounded timeout → fallback to endpoint IP + gRPC port derivation → follower PDs correctly redirect to leader regardless of which PD node wins election.

Related docker bridge networking PR: #2952

Copilot

Pull request overview

This PR addresses PD follower redirects failing in Docker bridge mode by making RaftEngine.getLeaderGrpcAddress() resilient to stalled/failed bolt RPC lookups, preventing NPEs and enabling store registration/partition distribution regardless of which PD becomes Raft leader.

Changes:

Add a bounded timeout (config.getRpcTimeout()) to the bolt RPC CompletableFuture.get(...).
Add null-safety around the RPC response before reading getGrpcAddress().
Add a fallback that derives the leader gRPC address from the Raft endpoint IP/host plus the configured gRPC port.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java

imbajin · 2026-03-09T10:58:35Z

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java

+
+        // Fallback: derive from raft endpoint IP + local gRPC port (best effort)
+        String leaderIp = raftNode.getLeaderId().getEndpoint().getIp();
+        return leaderIp + ":" + config.getGrpcPort();


‼️ This fallback is still incorrect for clusters where PD nodes use different grpc.port values. In this repo's own multi-node test configs, application-server1.yml, application-server2.yml, and application-server3.yml advertise 8686, 8687, and 8688 respectively, so a follower on 8687 will redirect to leader-ip:8687 even when the elected leader is actually listening on 8686 or 8688. That turns the original NPE into a silent misroute.

If we can't recover the leader's advertised gRPC endpoint here, I think it's safer to fail fast than to synthesize an address from the local port, for example:

Suggested change

return leaderIp + ":" + config.getGrpcPort();

} catch (TimeoutException | ExecutionException e) {

throw new ExecutionException(

String.format("Failed to resolve leader gRPC address for %s", raftNode.getLeaderId()),

e);

}

A more complete fix would need a source of truth for the leader's actual grpcAddress, not the local node's port.

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java

The bolt RPC call in getLeaderGrpcAddress() returns null in Docker bridge network mode, causing NPE when a follower PD node attempts to discover the leader's gRPC address. This breaks store registration and partition distribution when any node other than pd0 wins the raft leader election. Add a bounded timeout using the configured rpc-timeout, null-check the RPC response, and fall back to deriving the address from the raft endpoint IP when the RPC fails. Closes apache#2959

…rics

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java

… for getLeaderGrpcAddress()

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java

...ph-pd/hg-pd-test/src/main/java/org/apache/hugegraph/pd/raft/RaftEngineLeaderAddressTest.java

… test

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java

...ph-pd/hg-pd-test/src/main/java/org/apache/hugegraph/pd/raft/RaftEngineLeaderAddressTest.java

…og verbosity - Use Math.min(1000, remaining) so waitingForLeader() respects sub-second timeouts - Safe change: all callers use 5000ms+ and no notifyAll() exists on RaftEngine - Downgrade derived-address log from warn to info to reduce noise on hot paths - Fix misleading comment in test setUp()

imbajin

THX

This was referenced Mar 5, 2026

fix(pd): resolve hostname entries in IpAuthHandler allowlist #2962

Merged

refactor(docker): migrate single-node compose from host to bridge networking #2952

Merged

imbajin requested a review from Copilot March 9, 2026 10:55

Copilot started reviewing on behalf of imbajin March 9, 2026 10:56 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java Outdated Show resolved Hide resolved

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java Outdated Show resolved Hide resolved

imbajin reviewed Mar 9, 2026

View reviewed changes

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java Outdated Show resolved Hide resolved

imbajin reviewed Mar 10, 2026

View reviewed changes

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java Show resolved Hide resolved

bitflicker64 requested a review from imbajin March 12, 2026 13:30

imbajin requested a review from Copilot March 13, 2026 09:38

Copilot started reviewing on behalf of imbajin March 13, 2026 09:39 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

bitflicker64 added 3 commits March 18, 2026 00:22

ci: trigger rerun

9a13191

fix: split ExecutionException catch and remove duplicate setEnableMet…

8576da0

…rics

bitflicker64 force-pushed the fix/raft-engine-leader-grpc-address branch from ed30777 to 8576da0 Compare March 17, 2026 19:05

imbajin requested a review from Copilot March 18, 2026 06:12

Copilot started reviewing on behalf of imbajin March 18, 2026 06:13 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java Show resolved Hide resolved

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java Outdated Show resolved Hide resolved

imbajin reviewed Mar 18, 2026

View reviewed changes

hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java Show resolved Hide resolved

fix: restore best-effort fallback, improve logging and add unit tests…

596bee1

… for getLeaderGrpcAddress()

bitflicker64 requested a review from Copilot March 19, 2026 08:24

Copilot AI reviewed Mar 19, 2026

View reviewed changes

fix: address Copilot nits - fix comment typo and speed up null-leader…

b957bea

… test

bitflicker64 requested a review from Copilot March 19, 2026 14:12

Copilot AI reviewed Mar 19, 2026

View reviewed changes

bitflicker64 requested a review from imbajin March 19, 2026 14:33

imbajin approved these changes Mar 20, 2026

View reviewed changes

imbajin merged commit ab12b35 into apache:master Mar 20, 2026
13 checks passed

-        return leaderIp + ":" + config.getGrpcPort();
+        } catch (TimeoutException | ExecutionException e) {
+            throw new ExecutionException(
+                    String.format("Failed to resolve leader gRPC address for %s", raftNode.getLeaderId()),
+                    e);
+        }

Conversation

bitflicker64 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the PR

Main Changes

Verifying these changes

Does this PR potentially affect the following parts?

Documentation Status

Uh oh!

bitflicker64 commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

imbajin Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bitflicker64 commented Mar 5, 2026 •

edited

Loading