Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Formal Verification

Neumann’s distributed protocols are formally specified in TLA+ and exhaustively model-checked with the TLC model checker. The specifications live in specs/tla/ and cover the three critical protocol layers in tensor_chain.

What Is Model Checked

TLC explores every reachable state of a bounded model, checking safety invariants and temporal properties at each state. Unlike testing (which samples executions), model checking is exhaustive: if TLC reports no errors, the properties hold for every possible interleaving within the model bounds.

Raft Consensus (Raft.tla)

Models leader election, log replication, and commit advancement for the Tensor-Raft protocol implemented in tensor_chain/src/raft.rs. Three configurations exercise different aspects of the protocol.

Properties verified (14):

PropertyTypeWhat It Means
ElectionSafetyInvariantAt most one leader per term
LogMatchingInvariantSame index + term implies same entry
StateMachineSafetyInvariantNo divergent committed entries
LeaderCompletenessInvariantCommitted entries survive leader changes
VoteIntegrityInvariantEach node votes at most once per term
PreVoteSafetyInvariantPre-vote does not disrupt existing leaders
ReplicationInvInvariantEvery committed entry exists on a quorum
TermMonotonicityTemporalTerms never decrease
CommittedLogAppendOnlyPropTemporalCommitted entries never retracted
MonotonicCommitIndexPropTemporalcommitIndex never decreases
MonotonicMatchIndexPropTemporalmatchIndex monotonic per leader term
NeverCommitEntryPrevTermsPropTemporalOnly current-term entries committed
StateTransitionsPropTemporalValid state machine transitions
PermittedLogChangesPropTemporalLog changes only via valid paths

Result (Raft.cfg, 3 nodes): 6,641,341 states generated, 1,338,669 distinct states, depth 42, 2 min 24s. Zero errors.

Two-Phase Commit (TwoPhaseCommit.tla)

Models the 2PC protocol for cross-shard distributed transactions implemented in tensor_chain/src/distributed_tx.rs. Includes a fault model with message loss and participant crash/recovery.

Properties verified (6):

PropertyTypeWhat It Means
AtomicityInvariantAll participants commit or all abort
NoOrphanedLocksInvariantCompleted transactions release locks
ConsistentDecisionInvariantCoordinator decision matches outcomes
VoteIrrevocabilityTemporalPrepared votes cannot be retracted without coordinator
DecisionStabilityTemporalCoordinator decision never changes

Fault model: DropMessage (network loss) and ParticipantRestart (crash with WAL-backed lock recovery).

Result: 1,869,429,350 states generated, 190,170,601 distinct states, depth 29, 2 hr 55 min. Zero errors. Every reachable state under message loss and crash/recovery satisfies all properties.

SWIM Gossip Membership (Membership.tla)

Models the SWIM-based gossip protocol for cluster membership and failure detection implemented in tensor_chain/src/gossip.rs and tensor_chain/src/membership.rs.

Properties verified (3):

PropertyTypeWhat It Means
NoFalsePositivesSafetyInvariantNo node marked Failed above its own incarnation
MonotonicEpochsTemporalLamport timestamps never decrease
MonotonicIncarnationsTemporalIncarnation numbers never decrease

Result (2-node): 136,097 states generated, 54,148 distinct states, depth 17. Zero errors. Result (3-node): 16,513 states generated, 5,992 distinct states, depth 13. Zero errors.

Bugs Found by Model Checking

TLC discovered real protocol bugs that would be extremely difficult to find through testing alone:

  1. matchIndex response reporting (Raft): Follower reported matchIndex = Len(log) instead of prevLogIndex + Len(entries). A heartbeat response would falsely claim the full log matched the leader’s, enabling incorrect commits. Caught by ReplicationInv.

  2. Out-of-order matchIndex regression (Raft): Leader unconditionally set matchIndex from responses. A stale heartbeat response arriving after a replication response would regress the value. Fixed by taking the max. Caught by MonotonicMatchIndexProp.

  3. inPreVote not reset on step-down (Raft): When stepping down to a higher term, the inPreVote flag was not cleared. A node could remain in pre-vote state as a Follower. Caught by PreVoteSafety.

  4. Self-message processing (Raft): A leader could process its own AppendEntries heartbeat, truncating its own log.

  5. Heartbeat log wipe (Raft): Empty heartbeat messages with prevLogIndex = 0 computed an empty new log, destroying committed entries.

  6. Out-of-order AppendEntries (Raft): Stale messages could overwrite entries from newer messages. Fixed with proper Raft Section 5.3 conflict-resolution.

  7. Gossip fairness formula (Membership): Quantification over messages (a state variable) inside WF_vars is semantically invalid in TLA+.

How to Run

cd specs/tla

# Fast CI check (~3 minutes total)
make ci

# All configs including extensions
make all

# Individual specs
java -XX:+UseParallelGC -Xmx4g -jar tla2tools.jar \
  -deadlock -workers auto -config Raft.cfg Raft.tla

The -deadlock flag suppresses false deadlock reports on terminal states in bounded models. The -workers auto flag enables multi-threaded checking.

Relationship to Testing

TechniqueCoverageFinds
Unit testsSpecific scenariosImplementation bugs
Integration testsCross-crate workflowsWiring bugs
Fuzz testingRandom inputsCrash/panic bugs
Model checkingAll interleavingsProtocol design bugs

Model checking complements testing. It verifies the protocol design is correct (no possible interleaving violates safety), while tests verify the Rust implementation matches the design. Together they provide high confidence that the distributed protocols behave correctly.

Further Reading