Cluster Upgrade

This runbook covers upgrading tensor_chain clusters with minimal downtime.

Upgrade Types

Type	Downtime	Complexity	Use Case
Rolling	None	Low	Minor version upgrades
Blue-Green	Minimal	Medium	Major version upgrades
Canary	None	High	Risk-sensitive environments

Rolling Upgrade

Upgrade nodes one at a time while maintaining cluster availability.

Prerequisites

Cluster has 3+ nodes for quorum during upgrades
New version is backwards compatible with current version
Upgrade tested in staging environment
Backup of cluster state completed

Symptoms (Why Upgrade)

Security patches available
New features required
Bug fixes needed
Performance improvements available

Upgrade Sequence

sequenceDiagram
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant L as Leader
    participant A as Admin

    Note over A: Start rolling upgrade
    A->>F1: upgrade
    F1->>F1: restart with new version
    F1->>L: rejoin cluster
    Note over F1: Follower 1 upgraded

    A->>F2: upgrade
    F2->>F2: restart with new version
    F2->>L: rejoin cluster
    Note over F2: Follower 2 upgraded

    A->>L: transfer leadership
    L->>F1: leadership transferred
    A->>L: upgrade (now follower)
    L->>F1: rejoin cluster
    Note over L: All nodes upgraded

Procedure

Step 1: Pre-upgrade checks

# Verify cluster health
neumann cluster health

# Check current versions
neumann cluster versions

# Verify backup is current
neumann backup status

Step 2: Upgrade followers first

# For each follower node:

# 1. Drain the node
neumann node drain node2

# 2. Stop the service
ssh node2 "systemctl stop neumann"

# 3. Upgrade the binary
ssh node2 "cargo install neumann --version X.Y.Z"

# 4. Start the service
ssh node2 "systemctl start neumann"

# 5. Verify rejoin
neumann cluster members

# 6. Wait for replication catch-up
neumann metrics node2 | grep replication_lag

Step 3: Upgrade the leader

# Transfer leadership to an upgraded follower
neumann raft transfer-leadership --to node2

# Verify leadership transferred
neumann raft status

# Now upgrade the old leader (same steps as followers)
neumann node drain node1
ssh node1 "systemctl stop neumann"
ssh node1 "cargo install neumann --version X.Y.Z"
ssh node1 "systemctl start neumann"

Step 4: Post-upgrade verification

# Verify all nodes on new version
neumann cluster versions

# Expected output:
# ID     VERSION
# node1  X.Y.Z
# node2  X.Y.Z
# node3  X.Y.Z

# Run health checks
neumann cluster health

# Verify functionality with test transactions
neumann test-transaction

Version Compatibility

Compatibility Matrix

From Version	To Version	Compatible	Notes
0.9.x	0.10.x	Yes	Rolling upgrade supported
0.10.x	0.11.x	Yes	Rolling upgrade supported
0.8.x	0.10.x	No	Blue-green required
0.x.x	1.0.x	No	Blue-green required

Version Skew Policy

Maximum skew: 1 minor version during rolling upgrades
Leader version: Must be >= follower versions
Upgrade order: Always followers first, then leader

Rollback Procedure

If issues are discovered after upgrade:

Symptoms Requiring Rollback

Transaction failures after upgrade
Performance degradation
Consensus failures
Data corruption detected

Rollback Steps

# 1. Stop accepting new requests
neumann cluster pause

# 2. Identify problematic nodes
neumann cluster health --verbose

# 3. Rollback affected nodes
ssh node1 "cargo install neumann --version X.Y.Z-OLD"
ssh node1 "systemctl restart neumann"

# 4. Verify rollback
neumann cluster versions

# 5. Resume operations
neumann cluster resume

Rollback Limitations

Cannot rollback if schema changes were applied
Cannot rollback if new features were used
Always test rollback in staging first

Canary Upgrade

For risk-sensitive environments, upgrade a single node first and monitor.

Procedure

# 1. Select canary node (typically a follower)
CANARY=node3

# 2. Upgrade canary
neumann node drain $CANARY
ssh $CANARY "cargo install neumann --version X.Y.Z"
ssh $CANARY "systemctl restart neumann"

# 3. Monitor canary for 24-48 hours
neumann metrics $CANARY --watch

# 4. Compare metrics with non-canary nodes
neumann metrics compare $CANARY node1

# 5. If healthy, proceed with rolling upgrade
# If unhealthy, rollback canary

Canary Success Criteria

Metric	Threshold	Action if Exceeded
Error rate	< 0.1%	Rollback
Latency p99	< 2x baseline	Investigate
Replication lag	< 100ms	Investigate
Memory usage	< 1.5x baseline	Investigate

Automated Upgrade Script

#!/bin/bash
# rolling-upgrade.sh - Automated rolling upgrade script

set -e

NEW_VERSION=$1
NODES=$(neumann cluster members --format json | jq -r '.[] | .id')
LEADER=$(neumann raft status --format json | jq -r '.leader')

echo "Upgrading cluster to version $NEW_VERSION"

# Upgrade followers first
for node in $NODES; do
    if [ "$node" == "$LEADER" ]; then
        continue
    fi

    echo "Upgrading follower: $node"
    neumann node drain $node
    ssh $node "cargo install neumann --version $NEW_VERSION"
    ssh $node "systemctl restart neumann"

    # Wait for rejoin
    sleep 10
    neumann cluster wait-healthy --timeout 120
done

# Transfer leadership and upgrade old leader
echo "Transferring leadership from $LEADER"
NEW_LEADER=$(echo $NODES | tr ' ' '\n' | grep -v $LEADER | head -1)
neumann raft transfer-leadership --to $NEW_LEADER

sleep 5

echo "Upgrading old leader: $LEADER"
neumann node drain $LEADER
ssh $LEADER "cargo install neumann --version $NEW_VERSION"
ssh $LEADER "systemctl restart neumann"

# Final verification
neumann cluster wait-healthy --timeout 120
neumann cluster versions

echo "Upgrade complete"

Neumann

Cluster Upgrade

Upgrade Types

Rolling Upgrade

Prerequisites

Symptoms (Why Upgrade)

Upgrade Sequence

Procedure

Step 1: Pre-upgrade checks

Step 2: Upgrade followers first

Step 3: Upgrade the leader

Step 4: Post-upgrade verification

Version Compatibility

Compatibility Matrix

Version Skew Policy

Rollback Procedure

Symptoms Requiring Rollback

Rollback Steps

Rollback Limitations

Canary Upgrade

Procedure

Canary Success Criteria

Automated Upgrade Script

See Also

Keyboard shortcuts

Neumann