Runbooks

Operational runbooks for managing Neumann clusters, focusing on tensor_chain distributed operations.

Available Runbooks

Runbook	Scenario	Severity
Leader Election	Cluster has no leader	Critical
Split-Brain Recovery	Network partition healed	Critical
Node Recovery	Node crash or disk failure	High
Backup and Restore	Data backup and disaster recovery	High
Capacity Planning	Resource sizing and scaling	Medium
Deadlock Resolution	Transaction deadlocks	Medium

How to Use These Runbooks

Identify the symptom from alerts or monitoring
Find the matching runbook in the table above
Follow the diagnostic steps to confirm root cause
Execute the resolution steps in order
Verify recovery using the provided checks

Alerting Rules

Each runbook includes Prometheus alerting rules. Deploy them to your monitoring stack:

# Copy alerting rules
cp docs/book/src/operations/alerting-rules.yml /etc/prometheus/rules/neumann.yml

# Reload Prometheus
curl -X POST http://prometheus:9090/-/reload

Emergency Contacts

For production incidents:

Page the on-call engineer
Start an incident channel
Follow the relevant runbook
Document actions taken
Schedule post-incident review

Keyboard shortcuts

Neumann

Runbooks

Available Runbooks

How to Use These Runbooks

Alerting Rules

Emergency Contacts