Runbooks
Operational runbooks for managing Neumann clusters, focusing on tensor_chain distributed operations.
Available Runbooks
| Runbook | Scenario | Severity |
|---|---|---|
| Leader Election | Cluster has no leader | Critical |
| Split-Brain Recovery | Network partition healed | Critical |
| Node Recovery | Node crash or disk failure | High |
| Backup and Restore | Data backup and disaster recovery | High |
| Capacity Planning | Resource sizing and scaling | Medium |
| Deadlock Resolution | Transaction deadlocks | Medium |
How to Use These Runbooks
- Identify the symptom from alerts or monitoring
- Find the matching runbook in the table above
- Follow the diagnostic steps to confirm root cause
- Execute the resolution steps in order
- Verify recovery using the provided checks
Alerting Rules
Each runbook includes Prometheus alerting rules. Deploy them to your monitoring stack:
# Copy alerting rules
cp docs/book/src/operations/alerting-rules.yml /etc/prometheus/rules/neumann.yml
# Reload Prometheus
curl -X POST http://prometheus:9090/-/reload
Emergency Contacts
For production incidents:
- Page the on-call engineer
- Start an incident channel
- Follow the relevant runbook
- Document actions taken
- Schedule post-incident review