Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runbooks

Operational runbooks for managing Neumann clusters, focusing on tensor_chain distributed operations.

Available Runbooks

RunbookScenarioSeverity
Leader ElectionCluster has no leaderCritical
Split-Brain RecoveryNetwork partition healedCritical
Node RecoveryNode crash or disk failureHigh
Backup and RestoreData backup and disaster recoveryHigh
Capacity PlanningResource sizing and scalingMedium
Deadlock ResolutionTransaction deadlocksMedium

How to Use These Runbooks

  1. Identify the symptom from alerts or monitoring
  2. Find the matching runbook in the table above
  3. Follow the diagnostic steps to confirm root cause
  4. Execute the resolution steps in order
  5. Verify recovery using the provided checks

Alerting Rules

Each runbook includes Prometheus alerting rules. Deploy them to your monitoring stack:

# Copy alerting rules
cp docs/book/src/operations/alerting-rules.yml /etc/prometheus/rules/neumann.yml

# Reload Prometheus
curl -X POST http://prometheus:9090/-/reload

Emergency Contacts

For production incidents:

  1. Page the on-call engineer
  2. Start an incident channel
  3. Follow the relevant runbook
  4. Document actions taken
  5. Schedule post-incident review