Monitoring

Metrics Endpoint

Prometheus metrics are exposed at http://node:9090/metrics.

Key Metrics

Raft Consensus

Metric	Type	Description
`tensor_chain_raft_state`	Gauge	Current state (follower=0, candidate=1, leader=2)
`tensor_chain_term`	Gauge	Current Raft term
`tensor_chain_commit_index`	Gauge	Highest committed log index
`tensor_chain_applied_index`	Gauge	Highest applied log index
`tensor_chain_elections_total`	Counter	Total elections started
`tensor_chain_append_entries_total`	Counter	Total AppendEntries RPCs

Transactions

Metric	Type	Description
`tensor_chain_tx_active`	Gauge	Currently active transactions
`tensor_chain_tx_commits_total`	Counter	Total committed transactions
`tensor_chain_tx_aborts_total`	Counter	Total aborted transactions
`tensor_chain_tx_latency_seconds`	Histogram	Transaction latency

Deadlock Detection

Metric	Type	Description
`tensor_chain_deadlocks_total`	Counter	Total deadlocks detected
`tensor_chain_deadlock_victims_total`	Counter	Transactions aborted as victims
`tensor_chain_wait_graph_size`	Gauge	Current wait-for graph size

Gossip

Metric	Type	Description
`tensor_chain_gossip_members`	Gauge	Known cluster members
`tensor_chain_gossip_healthy`	Gauge	Healthy members
`tensor_chain_gossip_suspect`	Gauge	Suspect members
`tensor_chain_gossip_failed`	Gauge	Failed members

Storage

Metric	Type	Description
`tensor_chain_entries_total`	Gauge	Total stored entries
`tensor_chain_memory_bytes`	Gauge	Memory usage
`tensor_chain_disk_bytes`	Gauge	Disk usage
`tensor_chain_wal_size_bytes`	Gauge	WAL file size

Prometheus Configuration

scrape_configs:
  - job_name: 'neumann'
    static_configs:
      - targets:
        - 'node1:9090'
        - 'node2:9090'
        - 'node3:9090'

Grafana Dashboard

Import the dashboard from deploy/grafana/neumann-dashboard.json.

Panels include:

Cluster overview (leader, term, members)
Transaction throughput and latency
Replication lag
Memory and disk usage
Deadlock rate

Alerting Rules

See docs/book/src/operations/runbooks/ for alert definitions.

groups:
  - name: neumann
    rules:
      - alert: NoLeader
        expr: sum(tensor_chain_raft_state{state="leader"}) == 0
        for: 30s
        labels:
          severity: critical

      - alert: HighReplicationLag
        expr: tensor_chain_commit_index - tensor_chain_applied_index > 1000
        for: 1m
        labels:
          severity: warning

      - alert: HighDeadlockRate
        expr: rate(tensor_chain_deadlocks_total[5m]) > 1
        for: 5m
        labels:
          severity: warning

Health Endpoint

curl http://node:9090/health

Response:

{
  "status": "healthy",
  "raft_state": "leader",
  "term": 42,
  "commit_index": 12345,
  "members": 3,
  "healthy_members": 3
}

Logging

Configure log level:

RUST_LOG=tensor_chain=debug neumann

Log levels: error, warn, info, debug, trace

Keyboard shortcuts

Neumann