Node Recovery

Recovery Scenarios

Scenario	Recovery Method	Data Loss Risk
Process crash	WAL replay	None
Node reboot	WAL replay	None
Disk failure	Snapshot + log from leader	Possible (uncommitted)
Data corruption	Snapshot from leader	Possible (uncommitted)

Automatic Recovery Flow

flowchart TD
    A[Node Starts] --> B{WAL Exists?}
    B -->|Yes| C[Replay WAL]
    B -->|No| D[Request Snapshot]
    C --> E{Caught Up?}
    E -->|Yes| F[Join as Follower]
    E -->|No| D
    D --> G[Install Snapshot]
    G --> H[Replay Logs After Snapshot]
    H --> F
    F --> I[Healthy]

Manual Recovery Steps

1. Crash Recovery (WAL Intact)

# Just restart - WAL replay is automatic
systemctl start neumann

# Monitor recovery
journalctl -u neumann -f | grep -E "(recovery|replay|caught_up)"

2. Recovery from Snapshot

# 1. Stop node
systemctl stop neumann

# 2. Clear corrupted state
rm -rf /var/lib/neumann/raft/wal/*

# 3. Keep or clear snapshots (keep if valid)
ls -la /var/lib/neumann/raft/snapshots/

# 4. Restart - will fetch snapshot from leader
systemctl start neumann

# 5. Monitor snapshot transfer
watch -n1 'curl -s localhost:9090/metrics | grep snapshot_transfer'

3. Full State Rebuild

# 1. Stop node
systemctl stop neumann

# 2. Clear all Raft state
rm -rf /var/lib/neumann/raft/*

# 3. Clear tensor store (will be rebuilt)
rm -rf /var/lib/neumann/store/*

# 4. Restart
systemctl start neumann

Monitoring Recovery Progress

# Check sync status
curl -s localhost:9090/metrics | grep -E "(commit_index|applied_index|leader_commit)"

# Calculate lag
LEADER_COMMIT=$(curl -s http://leader:9090/metrics | grep tensor_chain_commit_index | awk '{print $2}')
MY_APPLIED=$(curl -s localhost:9090/metrics | grep tensor_chain_applied_index | awk '{print $2}')
echo "Lag: $((LEADER_COMMIT - MY_APPLIED)) entries"

# Estimated time to catch up (entries/sec)
watch -n5 'curl -s localhost:9090/metrics | grep tensor_chain_applied_index'

Troubleshooting

Recovery Stuck

Symptom: Node not catching up, applied_index not increasing

Causes:

Network issue to leader
Leader overloaded
Snapshot transfer failing

Solution:

# Check leader connectivity
curl -v http://leader:7878/health

# Check snapshot transfer errors
grep "snapshot" /var/log/neumann/tensor_chain.log | grep -i error

# Manually trigger snapshot
curl -X POST http://leader:9090/admin/snapshot

Repeated Crashes During Recovery

Symptom: Node crashes while replaying WAL

Causes:

Corrupted WAL entry
Out of memory during replay
Incompatible schema

Solution:

# Skip corrupted entries (data loss!)
neumann-admin wal-repair --skip-corrupted

# Or full rebuild
rm -rf /var/lib/neumann/raft/*
systemctl start neumann

Keyboard shortcuts

Neumann