The Raft Consensus Algorithm

Lecture Notes for CS 190
Winter 2021
John Ousterhout

Motivation: Replicated State Machines

  • Service that is replicated on multiple machines:
    • All instances produce the same behavior
    • Fault tolerant: Available for service as long as a majority of the servers are up

Raft Basics

  • Leader based:
    • Pick one server to be leader
    • It accepts client requests, manages replication
    • Pick a new leader if the current one fails
  • Server states:
    • Leader
    • Follower: totally passive
    • Candidate: used for electing new leaders
  • Time divided into terms:
    • Terms have numbers that increase monotonically
    • Each term starts with an election
      • One or more candidates attempting to become leader
    • Winning candidate (if any) serves as leader for the rest of the term
    • Terms allow detection of stale information
      • Each server stores current term
      • Checked on every request
      • Term numbers increase monotonically
    • Different servers observe term transitions differently
  • Request-response protocol between servers (remote procedure calls, or RPCs). 2 request types:
    • RequestVote
    • AppendEntries (also serves as heartbeat)

Leader Election

  • All servers start as followers
  • No heartbeat (AppendEntries)? Start election:
    • Increment current term
    • Vote for self
    • Send RequestVote requests to all other servers
  • Election outcomes:
    • Receive votes from majority of cluster: become leader, start sending heartbeats to prevent additional elections
    • Receive AppendEntries from some other server in the new term: revert back to follower
    • Timeout: start a new election
  • Each server votes for at most one candidate in a given term
  • Election Safety: at most one server can be elected leader in a given term
  • Availability: randomized election timeouts reduce split votes

Log Replication

  • Handled by leader
  • When client request arrives:
    • Append to local log
    • Replicate to other servers using AppendEntries requests
    • Committed: entry replicated by leader to majority of servers
    • Once committed, apply to local state machine
    • Return result to client
    • Notify other servers of committed entries in future AppendEntries requests
  • Log entries: index, term, command
  • Logs can become inconsistent after leader crashes
  • Raft maintains a high level of coherency between logs (Log Matching Property):
    • If entries in different logs have same term and index, then
      • They also have the same command
      • The two logs are identical up through that entry
  • AppendEntries consistency check preserves above properties.
  • Leader forces other logs to match its own:
    • nextIndex for each follower (initialized to leader's log length)
    • If AppendEntries fails, reduce nextIndex for that follower and retry.
    • If follower receives conflicting entries but consistency check passes, removes all conflicting entries

Safety

  • Must ensure that the leader for new term always holds all of the log entries committed in previous terms (Leader Completeness Property).
  • Step 1: restriction on elections: don't vote for a candidate unless candidate's log is at least as up-to-date as yours.
  • Compare indexes and terms from last log entries.
  • Step 2: be very careful about when an entry is considered committed
    • As soon as replicated on majority? Unsafe!
    • Committed => entry in current term replicated on majority
      • All preceding entries also committed because of Log Matching Property

Persistent Storage

  • Each server stores the following in persistent storage (e.g. disk or flash):
    • Current term
    • Most recent vote granted
    • Log
  • These must be recovered from persistent storage after a crash
  • If a server loses its persistent storage, it cannot participate in the cluster anymore

Implementing Raft

  • Follow Figure 2 exactly! Every tiny detail matters.

Client Interactions

  • Clients interact only with the leader
  • Initially, a client can send a request to any server
    • If not leader, it rejects request, returns info about most recent leader
    • Client retries until it reaches leader
  • If leader crashes while executing a client request, the client retries (with a new randomly-chosen server) until the request succeeds
  • This can result in multiple executions of a command: not consistent!
  • Goal: linearizability: System behaves as if each operation is executed exactly once, atomically, sometime between sending of the request and receipt of the response.
  • Solution:
    • Client generates unique serial number for each request (client id, request number)
    • When retrying, client reuses the original serial number
    • Servers track latest serial number processed for each client, plus associated response (must be stored persistently)
    • When leader detects duplicate, it sends old response without reexecuting request

Other Issues

  • Cluster membership
  • Log compaction
  • See paper for details