Node Management
This runbook covers adding and removing nodes from a tensor_chain cluster.
Adding a Node
Prerequisites Checklist
- New node has network connectivity to existing cluster members
- TLS certificates are configured (if using TLS)
- Node has sufficient disk space for snapshot transfer
- Firewall rules allow traffic on cluster port (default: 9100)
- DNS/hostname resolution configured for the new node
Symptoms (Why Add a Node)
- Cluster capacity insufficient for workload
- Need additional replicas for fault tolerance
- Geographic distribution requirements
- Performance scaling requirements
Procedure
Step 1: Prepare the new node
# Install Neumann on the new node
cargo install neumann --version X.Y.Z
# Create configuration directory
mkdir -p /etc/neumann
mkdir -p /var/lib/neumann/data
# Copy TLS certificates (if using TLS)
scp admin@existing-node:/etc/neumann/ca.crt /etc/neumann/
# Generate node-specific certificates
./scripts/generate-node-cert.sh node4
Step 2: Configure the new node
Create /etc/neumann/config.toml:
[node]
id = "node4"
data_dir = "/var/lib/neumann/data"
[cluster]
# Existing cluster members for initial discovery
seeds = ["node1:9100", "node2:9100", "node3:9100"]
port = 9100
[tls]
cert_path = "/etc/neumann/node4.crt"
key_path = "/etc/neumann/node4.key"
ca_cert_path = "/etc/neumann/ca.crt"
Step 3: Join the cluster
# Start the node in join mode
neumann start --join
# Monitor the join process
neumann status --watch
Step 4: Verify cluster membership
# On any existing node
neumann cluster members
# Expected output:
# ID ADDRESS STATE ROLE
# node1 10.0.1.1:9100 healthy leader
# node2 10.0.1.2:9100 healthy follower
# node3 10.0.1.3:9100 healthy follower
# node4 10.0.1.4:9100 healthy follower <-- new node
Post-Addition Verification
# Verify snapshot transfer completed
neumann status node4 --verbose
# Check replication lag
neumann metrics node4 | grep replication_lag
# Verify the node participates in consensus
neumann raft status
Removing a Node
Prerequisites Checklist
- Cluster will maintain quorum after removal
- Node is not the current leader (trigger election first)
- Data has been replicated to other nodes
- No in-flight transactions involving this node
Symptoms (Why Remove a Node)
- Hardware failure requiring decommission
- Cluster right-sizing
- Node relocation to different region
- Maintenance requiring extended downtime
Pre-Removal Verification
# Check current cluster state
neumann cluster members
# Verify quorum will be maintained
# For N nodes, quorum = (N/2) + 1
# 5 nodes -> quorum = 3, can remove 2
# 3 nodes -> quorum = 2, can remove 1
Procedure
Step 1: Drain the node (graceful removal)
# Mark node as draining (stops accepting new requests)
neumann node drain node3
# Wait for in-flight transactions to complete
neumann node wait-drain node3 --timeout 300
Step 2: Transfer leadership if necessary
# Check if node is leader
neumann raft status
# If leader, trigger election
neumann raft transfer-leadership --to node1
Step 3: Remove from cluster
# Remove the node from cluster configuration
neumann cluster remove node3
# Verify removal
neumann cluster members
Step 4: Stop the node
# On the removed node
neumann stop
# Clean up data (optional)
rm -rf /var/lib/neumann/data/*
Post-Removal Verification
# Verify cluster health
neumann cluster health
# Check that remaining nodes have correct membership
neumann cluster members
# Verify no pending transactions for removed node
neumann transactions pending
Emergency Removal
Use emergency removal only when a node is unresponsive and cannot be drained gracefully.
Symptoms
- Node is unreachable (network partition, hardware failure)
- Node is unresponsive (hung process, resource exhaustion)
- Need to restore quorum quickly
Procedure
# Force remove unresponsive node
neumann cluster remove node3 --force
# The cluster will:
# 1. Remove node from membership
# 2. Abort any transactions involving the node
# 3. Re-elect leader if necessary
Resolution
After emergency removal:
- Investigate root cause of node failure
- Repair or replace hardware if needed
- Re-add node using the addition procedure above
Prevention
- Monitor node health with alerting
- Configure appropriate timeouts
- Maintain sufficient cluster size for fault tolerance
Quorum Considerations
| Cluster Size | Quorum | Fault Tolerance | Notes |
|---|---|---|---|
| 1 | 1 | 0 | Development only |
| 2 | 2 | 0 | Not recommended |
| 3 | 2 | 1 | Minimum for production |
| 5 | 3 | 2 | Recommended for HA |
| 7 | 4 | 3 | Maximum practical size |
Quorum Formula
quorum = (cluster_size / 2) + 1
fault_tolerance = cluster_size - quorum
Best Practices
- Always maintain odd number of nodes
- Never remove nodes if it would violate quorum
- Plan node additions/removals during low-traffic periods
- Test failover scenarios regularly