Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitoring

Metrics Endpoint

Prometheus metrics are exposed at http://node:9090/metrics.

Key Metrics

Raft Consensus

MetricTypeDescription
tensor_chain_raft_stateGaugeCurrent state (follower=0, candidate=1, leader=2)
tensor_chain_termGaugeCurrent Raft term
tensor_chain_commit_indexGaugeHighest committed log index
tensor_chain_applied_indexGaugeHighest applied log index
tensor_chain_elections_totalCounterTotal elections started
tensor_chain_append_entries_totalCounterTotal AppendEntries RPCs

Transactions

MetricTypeDescription
tensor_chain_tx_activeGaugeCurrently active transactions
tensor_chain_tx_commits_totalCounterTotal committed transactions
tensor_chain_tx_aborts_totalCounterTotal aborted transactions
tensor_chain_tx_latency_secondsHistogramTransaction latency

Deadlock Detection

MetricTypeDescription
tensor_chain_deadlocks_totalCounterTotal deadlocks detected
tensor_chain_deadlock_victims_totalCounterTransactions aborted as victims
tensor_chain_wait_graph_sizeGaugeCurrent wait-for graph size

Gossip

MetricTypeDescription
tensor_chain_gossip_membersGaugeKnown cluster members
tensor_chain_gossip_healthyGaugeHealthy members
tensor_chain_gossip_suspectGaugeSuspect members
tensor_chain_gossip_failedGaugeFailed members

Storage

MetricTypeDescription
tensor_chain_entries_totalGaugeTotal stored entries
tensor_chain_memory_bytesGaugeMemory usage
tensor_chain_disk_bytesGaugeDisk usage
tensor_chain_wal_size_bytesGaugeWAL file size

Prometheus Configuration

scrape_configs:
  - job_name: 'neumann'
    static_configs:
      - targets:
        - 'node1:9090'
        - 'node2:9090'
        - 'node3:9090'

Grafana Dashboard

Import the dashboard from deploy/grafana/neumann-dashboard.json.

Panels include:

  • Cluster overview (leader, term, members)
  • Transaction throughput and latency
  • Replication lag
  • Memory and disk usage
  • Deadlock rate

Alerting Rules

See docs/book/src/operations/runbooks/ for alert definitions.

groups:
  - name: neumann
    rules:
      - alert: NoLeader
        expr: sum(tensor_chain_raft_state{state="leader"}) == 0
        for: 30s
        labels:
          severity: critical

      - alert: HighReplicationLag
        expr: tensor_chain_commit_index - tensor_chain_applied_index > 1000
        for: 1m
        labels:
          severity: warning

      - alert: HighDeadlockRate
        expr: rate(tensor_chain_deadlocks_total[5m]) > 1
        for: 5m
        labels:
          severity: warning

Health Endpoint

curl http://node:9090/health

Response:

{
  "status": "healthy",
  "raft_state": "leader",
  "term": 42,
  "commit_index": 12345,
  "members": 3,
  "healthy_members": 3
}

Logging

Configure log level:

RUST_LOG=tensor_chain=debug neumann

Log levels: error, warn, info, debug, trace