Skip to content

[chart/redis-ha][BUG] split-brain-fix causes unnecessary master shutdown during failover #383

@peter-plochan

Description

@peter-plochan

Describe the bug

The retry mechanism introduced in PR #336 has a race condition that causes unnecessary shutdown of a newly promoted master during a normal Sentinel failover.

When a pod is in the elif branch of the split-brain check (meaning it's a replica) and becomes the new master during the sleep period, the script incorrectly triggers reinit()shutdown because identify_redis_master() returns an empty string for master nodes (masters don't have master_host in redis-cli info output).

This results in a second unnecessary failover immediately after a successful one.

To Reproduce

  1. Deploy redis-ha with 3 replicas and splitBrainDetection.enabled: true
  2. Wait for cluster to stabilize
  3. Trigger a node eviction or delete the master pod
  4. Sentinel promotes a replica to master (first failover - correct)
  5. If split-brain-fix on the newly promoted master happens to be in the elif branch during failover, it will shutdown the new master ~10 seconds later
  6. Sentinel triggers a second failover (unnecessary)

Expected behavior

When a pod becomes master during the retry wait period, the script should recognize this and skip the reinit(). There should be only one failover, not two.

Additional context

Root Cause Analysis

The problem is in the elif branch of fix-split-brain.sh:

elif [ "${MASTER}" ]; then
    identify_redis_master
    if [ "$REDIS_MASTER" != "$MASTER" ]; then
        echo "Redis master and local master are not the same. waiting."
        sleep {{ .Values.splitBrainDetection.retryInterval }}
        identify_master          # MASTER now points to THIS pod (new master)
        identify_redis_master    # REDIS_MASTER = "" (empty!)
        if [ "${REDIS_MASTER}" != "${MASTER}" ]; then   # "" != "172.20.x.x" → TRUE
            reinit   # ← INCORRECTLY SHUTS DOWN THE NEW MASTER
        fi
    fi
fi

The function identify_redis_master() extracts master_host from redis-cli info:

REDIS_MASTER=$(redis-cli -p "${REDIS_PORT}" info | grep master_host | sed 's/master_host://' | sed 's/\r//')

A master node doesn't have a master_host field, so REDIS_MASTER is set to an empty string. The comparison "" != "172.20.191.124" evaluates to true, triggering an unnecessary reinit().

Evidence from logs

Sentinel logs (first failover completed successfully):

17:05:53.715 # +sdown master mymaster 172.20.117.83 6379
17:05:53.774 # +odown master mymaster 172.20.117.83 6379 #quorum 2/2
17:05:54.214 # +promoted-slave slave 172.20.191.124:6379 @ mymaster 172.20.117.83 6379
17:05:54.851 # +switch-master mymaster 172.20.117.83 6379 172.20.191.124 6379

split-brain-fix logs (showing the bug):

17:05:54 - Found redis master (172.20.117.83)           ← stale, old master
17:05:54 - Redis master and local master are not the same. waiting.
           ↓ sleep 10 ↓
17:06:04 - Found redis master (172.20.191.124)          ← this pod is now master
17:06:04 - Redis master is 172.20.191.124, expected master is . No need to reinitialize.
17:06:04 - Redis master is 172.20.191.124, expected master is , reinitializing   ← BUG!
17:06:04 - Start...

Note the empty string after "expected master is" - this is REDIS_MASTER being empty because the pod is now a master.

Redis logs (unnecessary shutdown):

17:05:53.947 * MASTER MODE enabled (user request from sentinel)
17:06:04.373 * User requested shutdown...
17:06:27.150 # Redis is now ready to exit, bye bye...

Sentinel logs (second unnecessary failover):

17:06:10.550 # +sdown master mymaster 172.20.191.124 6379
17:06:10.634 # +odown master mymaster 172.20.191.124 6379 #quorum 2/2
17:06:11.785 # +switch-master mymaster 172.20.191.124 6379 172.20.177.155 6379

Proposed Fix

After the sleep and re-identification, check if this pod has become the master:

elif [ "${MASTER}" ]; then
    identify_redis_master
    if [ "$REDIS_MASTER" != "$MASTER" ]; then
        echo "Redis master and local master are not the same. waiting."
        sleep {{ .Values.splitBrainDetection.retryInterval }}
        identify_master
        
        # Check if we became master during the wait
        if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
            echo "This pod became master during wait, skipping reinit"
            continue
        fi
        
        identify_redis_master
        if [ "${REDIS_MASTER}" != "${MASTER}" ]; then
            echo "Redis master is ${MASTER}, expected master is ${REDIS_MASTER}, reinitializing"
            reinit
        fi
    fi
fi

Chart Version

  • redis-ha-4.35.7

Metadata

Metadata

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions