Split-Brain Recovery
What is Split-Brain?
A network partition where multiple nodes believe they are the leader, potentially accepting conflicting writes.
Symptoms
- Multiple nodes reporting
raft_state="leader"in metrics - Clients seeing different data depending on which node they connect to
tensor_chain_partition_detectedmetric > 0- Gossip reporting different membership views
How tensor_chain Prevents Split-Brain
Raft consensus requires majority quorum:
- 3 nodes: 2 required (only 1 partition can have leader)
- 5 nodes: 3 required
Split-brain can only occur with symmetric partition where old leader is isolated but doesn’t realize it.
Automatic Recovery (Partition Merge Protocol)
When partitions heal, tensor_chain automatically reconciles:
Phase 1: Detection
- Gossip detects new reachable nodes
- Compare Raft terms and log lengths
Phase 2: Leader Resolution
- Higher term wins
- If same term, longer log wins
- Losing leader steps down
Phase 3: State Reconciliation
- Semantic conflict detection on divergent entries
- Orthogonal changes: vector-add merge
- Conflicting changes: reject newer (requires manual resolution)
Phase 4: Log Synchronization
- Follower truncates divergent suffix
- Leader replicates correct entries
Phase 5: Membership Merge
- Gossip merges LWW membership states
- Higher incarnation wins for each node
Phase 6: Checkpoint
- Create snapshot post-merge for fast recovery
Manual Intervention (When Automatic Fails)
Scenario: Conflicting Writes
# 1. Identify conflicts
neumann-admin conflicts list --since "2h ago"
# 2. Export conflicting transactions
neumann-admin conflicts export --tx-id 12345 --output conflict.json
# 3. Choose resolution
neumann-admin conflicts resolve --tx-id 12345 --keep-version A
# 4. Or merge manually
neumann-admin conflicts resolve --tx-id 12345 --merge-custom merge.json
Scenario: Completely Diverged State
# 1. Stop all nodes
systemctl stop neumann
# 2. Identify authoritative node (longest log, highest term)
for node in node1 node2 node3; do
ssh $node "neumann-admin raft-info"
done
# 3. On non-authoritative nodes, clear state
rm -rf /var/lib/neumann/raft/*
# 4. Restart authoritative node first
systemctl start neumann
# 5. Restart other nodes (will sync from leader)
systemctl start neumann
Post-Recovery Verification
# Verify single leader
curl -s http://node1:9090/metrics | grep 'raft_state{state="leader"}'
# Verify all nodes in sync
neumann-admin cluster-status
# Check for unresolved conflicts
neumann-admin conflicts list
# Verify recent transactions
neumann-admin tx-log --last 100
Prevention
- Network design: Avoid symmetric partitions
- Monitoring: Alert on partition detection
- Testing: Regularly run chaos engineering tests
- Backups: Regular snapshots enable point-in-time recovery