-
Notifications
You must be signed in to change notification settings - Fork 185
Description
Describe the bug
The retry mechanism introduced in PR #336 has a race condition that causes unnecessary shutdown of a newly promoted master during a normal Sentinel failover.
When a pod is in the elif branch of the split-brain check (meaning it's a replica) and becomes the new master during the sleep period, the script incorrectly triggers reinit() → shutdown because identify_redis_master() returns an empty string for master nodes (masters don't have master_host in redis-cli info output).
This results in a second unnecessary failover immediately after a successful one.
To Reproduce
- Deploy redis-ha with 3 replicas and
splitBrainDetection.enabled: true - Wait for cluster to stabilize
- Trigger a node eviction or delete the master pod
- Sentinel promotes a replica to master (first failover - correct)
- If
split-brain-fixon the newly promoted master happens to be in theelifbranch during failover, it will shutdown the new master ~10 seconds later - Sentinel triggers a second failover (unnecessary)
Expected behavior
When a pod becomes master during the retry wait period, the script should recognize this and skip the reinit(). There should be only one failover, not two.
Additional context
Root Cause Analysis
The problem is in the elif branch of fix-split-brain.sh:
elif [ "${MASTER}" ]; then
identify_redis_master
if [ "$REDIS_MASTER" != "$MASTER" ]; then
echo "Redis master and local master are not the same. waiting."
sleep {{ .Values.splitBrainDetection.retryInterval }}
identify_master # MASTER now points to THIS pod (new master)
identify_redis_master # REDIS_MASTER = "" (empty!)
if [ "${REDIS_MASTER}" != "${MASTER}" ]; then # "" != "172.20.x.x" → TRUE
reinit # ← INCORRECTLY SHUTS DOWN THE NEW MASTER
fi
fi
fiThe function identify_redis_master() extracts master_host from redis-cli info:
REDIS_MASTER=$(redis-cli -p "${REDIS_PORT}" info | grep master_host | sed 's/master_host://' | sed 's/\r//')A master node doesn't have a master_host field, so REDIS_MASTER is set to an empty string. The comparison "" != "172.20.191.124" evaluates to true, triggering an unnecessary reinit().
Evidence from logs
Sentinel logs (first failover completed successfully):
17:05:53.715 # +sdown master mymaster 172.20.117.83 6379
17:05:53.774 # +odown master mymaster 172.20.117.83 6379 #quorum 2/2
17:05:54.214 # +promoted-slave slave 172.20.191.124:6379 @ mymaster 172.20.117.83 6379
17:05:54.851 # +switch-master mymaster 172.20.117.83 6379 172.20.191.124 6379
split-brain-fix logs (showing the bug):
17:05:54 - Found redis master (172.20.117.83) ← stale, old master
17:05:54 - Redis master and local master are not the same. waiting.
↓ sleep 10 ↓
17:06:04 - Found redis master (172.20.191.124) ← this pod is now master
17:06:04 - Redis master is 172.20.191.124, expected master is . No need to reinitialize.
17:06:04 - Redis master is 172.20.191.124, expected master is , reinitializing ← BUG!
17:06:04 - Start...
Note the empty string after "expected master is" - this is REDIS_MASTER being empty because the pod is now a master.
Redis logs (unnecessary shutdown):
17:05:53.947 * MASTER MODE enabled (user request from sentinel)
17:06:04.373 * User requested shutdown...
17:06:27.150 # Redis is now ready to exit, bye bye...
Sentinel logs (second unnecessary failover):
17:06:10.550 # +sdown master mymaster 172.20.191.124 6379
17:06:10.634 # +odown master mymaster 172.20.191.124 6379 #quorum 2/2
17:06:11.785 # +switch-master mymaster 172.20.191.124 6379 172.20.177.155 6379
Proposed Fix
After the sleep and re-identification, check if this pod has become the master:
elif [ "${MASTER}" ]; then
identify_redis_master
if [ "$REDIS_MASTER" != "$MASTER" ]; then
echo "Redis master and local master are not the same. waiting."
sleep {{ .Values.splitBrainDetection.retryInterval }}
identify_master
# Check if we became master during the wait
if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
echo "This pod became master during wait, skipping reinit"
continue
fi
identify_redis_master
if [ "${REDIS_MASTER}" != "${MASTER}" ]; then
echo "Redis master is ${MASTER}, expected master is ${REDIS_MASTER}, reinitializing"
reinit
fi
fi
fiChart Version
redis-ha-4.35.7