You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(replication): prevent WAL exhaustion from slow consumers
The replication feed thread could block indefinitely when sending data
to a slow replica. If the replica wasn't consuming data fast enough,
the TCP send buffer would fill and the feed thread would block on
write() with no timeout. During this time, WAL files would rotate and
be pruned, leaving the replica's sequence unavailable when the thread
eventually unblocked or the connection dropped.
This commit adds three mechanisms to address the issue:
1. Socket send timeout: New SockSendWithTimeout() function that uses
poll() to wait for socket writability with a configurable timeout
(default 30 seconds). This prevents indefinite blocking.
2. Replication lag detection: At the start of each loop iteration,
check if the replica has fallen too far behind (configurable via
max-replication-lag). If exceeded, disconnect the slow consumer
before WAL is exhausted, allowing psync on reconnect.
Disabled by default (0), set to a positive value to enable.
3. Exponential backoff on reconnection: When a replica is disconnected,
it now waits with exponential backoff (1s, 2s, 4s... up to 60s) before
reconnecting. This prevents rapid reconnection loops for persistently
slow replicas. The backoff resets on successful psync or fullsync.
New configuration options:
- max-replication-lag: Maximum sequence lag before disconnecting (default: 0 = disabled)
- replication-send-timeout-ms: Socket send timeout in ms (default: 30000)
Fixes#3356
0 commit comments