Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Cluster Upgrade

This runbook covers upgrading tensor_chain clusters with minimal downtime.

Upgrade Types

TypeDowntimeComplexityUse Case
RollingNoneLowMinor version upgrades
Blue-GreenMinimalMediumMajor version upgrades
CanaryNoneHighRisk-sensitive environments

Rolling Upgrade

Upgrade nodes one at a time while maintaining cluster availability.

Prerequisites

  • Cluster has 3+ nodes for quorum during upgrades
  • New version is backwards compatible with current version
  • Upgrade tested in staging environment
  • Backup of cluster state completed

Symptoms (Why Upgrade)

  • Security patches available
  • New features required
  • Bug fixes needed
  • Performance improvements available

Upgrade Sequence

sequenceDiagram
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant L as Leader
    participant A as Admin

    Note over A: Start rolling upgrade
    A->>F1: upgrade
    F1->>F1: restart with new version
    F1->>L: rejoin cluster
    Note over F1: Follower 1 upgraded

    A->>F2: upgrade
    F2->>F2: restart with new version
    F2->>L: rejoin cluster
    Note over F2: Follower 2 upgraded

    A->>L: transfer leadership
    L->>F1: leadership transferred
    A->>L: upgrade (now follower)
    L->>F1: rejoin cluster
    Note over L: All nodes upgraded

Procedure

Step 1: Pre-upgrade checks

# Verify cluster health
neumann cluster health

# Check current versions
neumann cluster versions

# Verify backup is current
neumann backup status

Step 2: Upgrade followers first

# For each follower node:

# 1. Drain the node
neumann node drain node2

# 2. Stop the service
ssh node2 "systemctl stop neumann"

# 3. Upgrade the binary
ssh node2 "cargo install neumann --version X.Y.Z"

# 4. Start the service
ssh node2 "systemctl start neumann"

# 5. Verify rejoin
neumann cluster members

# 6. Wait for replication catch-up
neumann metrics node2 | grep replication_lag

Step 3: Upgrade the leader

# Transfer leadership to an upgraded follower
neumann raft transfer-leadership --to node2

# Verify leadership transferred
neumann raft status

# Now upgrade the old leader (same steps as followers)
neumann node drain node1
ssh node1 "systemctl stop neumann"
ssh node1 "cargo install neumann --version X.Y.Z"
ssh node1 "systemctl start neumann"

Step 4: Post-upgrade verification

# Verify all nodes on new version
neumann cluster versions

# Expected output:
# ID     VERSION
# node1  X.Y.Z
# node2  X.Y.Z
# node3  X.Y.Z

# Run health checks
neumann cluster health

# Verify functionality with test transactions
neumann test-transaction

Version Compatibility

Compatibility Matrix

From VersionTo VersionCompatibleNotes
0.9.x0.10.xYesRolling upgrade supported
0.10.x0.11.xYesRolling upgrade supported
0.8.x0.10.xNoBlue-green required
0.x.x1.0.xNoBlue-green required

Version Skew Policy

  • Maximum skew: 1 minor version during rolling upgrades
  • Leader version: Must be >= follower versions
  • Upgrade order: Always followers first, then leader

Rollback Procedure

If issues are discovered after upgrade:

Symptoms Requiring Rollback

  • Transaction failures after upgrade
  • Performance degradation
  • Consensus failures
  • Data corruption detected

Rollback Steps

# 1. Stop accepting new requests
neumann cluster pause

# 2. Identify problematic nodes
neumann cluster health --verbose

# 3. Rollback affected nodes
ssh node1 "cargo install neumann --version X.Y.Z-OLD"
ssh node1 "systemctl restart neumann"

# 4. Verify rollback
neumann cluster versions

# 5. Resume operations
neumann cluster resume

Rollback Limitations

  • Cannot rollback if schema changes were applied
  • Cannot rollback if new features were used
  • Always test rollback in staging first

Canary Upgrade

For risk-sensitive environments, upgrade a single node first and monitor.

Procedure

# 1. Select canary node (typically a follower)
CANARY=node3

# 2. Upgrade canary
neumann node drain $CANARY
ssh $CANARY "cargo install neumann --version X.Y.Z"
ssh $CANARY "systemctl restart neumann"

# 3. Monitor canary for 24-48 hours
neumann metrics $CANARY --watch

# 4. Compare metrics with non-canary nodes
neumann metrics compare $CANARY node1

# 5. If healthy, proceed with rolling upgrade
# If unhealthy, rollback canary

Canary Success Criteria

MetricThresholdAction if Exceeded
Error rate< 0.1%Rollback
Latency p99< 2x baselineInvestigate
Replication lag< 100msInvestigate
Memory usage< 1.5x baselineInvestigate

Automated Upgrade Script

#!/bin/bash
# rolling-upgrade.sh - Automated rolling upgrade script

set -e

NEW_VERSION=$1
NODES=$(neumann cluster members --format json | jq -r '.[] | .id')
LEADER=$(neumann raft status --format json | jq -r '.leader')

echo "Upgrading cluster to version $NEW_VERSION"

# Upgrade followers first
for node in $NODES; do
    if [ "$node" == "$LEADER" ]; then
        continue
    fi

    echo "Upgrading follower: $node"
    neumann node drain $node
    ssh $node "cargo install neumann --version $NEW_VERSION"
    ssh $node "systemctl restart neumann"

    # Wait for rejoin
    sleep 10
    neumann cluster wait-healthy --timeout 120
done

# Transfer leadership and upgrade old leader
echo "Transferring leadership from $LEADER"
NEW_LEADER=$(echo $NODES | tr ' ' '\n' | grep -v $LEADER | head -1)
neumann raft transfer-leadership --to $NEW_LEADER

sleep 5

echo "Upgrading old leader: $LEADER"
neumann node drain $LEADER
ssh $LEADER "cargo install neumann --version $NEW_VERSION"
ssh $LEADER "systemctl restart neumann"

# Final verification
neumann cluster wait-healthy --timeout 120
neumann cluster versions

echo "Upgrade complete"

See Also