Search before asking
Fluss version
0.8.0 (latest release)
Please describe the bug 🐞
After restarting the tablet server, I found that it remained in the initialization state because the ReplicaFetcherThread was still hanging. Although we have a 30-second timeout configured, it appears to be ineffective in this scenario. Moreover, when I reviewed the logs and thread dumps on server-1, I did not find any relevant connection attempts or stack traces related to this issue.
From my understanding, the goal of the fetcher is to synchronize the follower with the latest log records from the leader (Prior to the restart, this machine was the leader for the table). Is my understanding correct?
jstack like the following pic
"ReplicaFetcherThread-0-1" #76 prio=5 os_prio=0 cpu=1001.56ms elapsed=3682.24s tid=0x00007f7c78030f10 nid=0x2c77c3 waiting on condition [0x00007f7e9ea1d000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.16/Native Method)
- parking to wait for <0x00000000c30285e0> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.16/LockSupport.java:252)
at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.16/CompletableFuture.java:1866)
at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.16/ForkJoinPool.java:3465)
at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.16/ForkJoinPool.java:3436)
at java.util.concurrent.CompletableFuture.timedGet(java.base@17.0.16/CompletableFuture.java:1939)
at java.util.concurrent.CompletableFuture.get(java.base@17.0.16/CompletableFuture.java:2095)
at org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread.processFetchLogRequest(ReplicaFetcherThread.java:227)
at org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread$$Lambda$546/0x00007f7e504ccb88.accept(Unknown Source)
at java.util.Optional.ifPresent(java.base@17.0.16/Optional.java:178)
at org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread.maybeFetch(ReplicaFetcherThread.java:153)
at org.apache.fluss.server.replica.fetcher.ReplicaFetcherThread.doWork(ReplicaFetcherThread.java:124)
at org.apache.fluss.utils.concurrent.ShutdownableThread.run(ShutdownableThread.java:96)
Solution
No response
Are you willing to submit a PR?
Search before asking
Fluss version
0.8.0 (latest release)
Please describe the bug 🐞
After restarting the tablet server, I found that it remained in the initialization state because the ReplicaFetcherThread was still hanging. Although we have a 30-second timeout configured, it appears to be ineffective in this scenario. Moreover, when I reviewed the logs and thread dumps on server-1, I did not find any relevant connection attempts or stack traces related to this issue.
From my understanding, the goal of the fetcher is to synchronize the follower with the latest log records from the leader (Prior to the restart, this machine was the leader for the table). Is my understanding correct?
jstack like the following pic
Solution
No response
Are you willing to submit a PR?