The Raft Consensus Algorithm

Motivation: Replicated State Machines

Service that is replicated on multiple machines:

  • All instances produce the same behavior
  • Fault tolerant: Available for service as long as a majority of the servers are up

Mechanism:

  • Store incoming commands in a log
  • Replicate the log across all the servers
  • Apply a command to a state machine once it has been safely replicated
  • Consensus module is responsible for replicating the log

Desired properties:

  • Safety (don't do anything bad): if one machine applies a particular command in a particular log slot, no other machine will apply a different command for that slot
  • Availability (do something good, don't get stuck): continue operation in the face of failures (as long as a majority of servers are operational)

Raft Basics

Leader based:

  • Pick one server to be leader
  • It accepts client requests, manages replication
  • Pick a new leader if the current one fails

Server states:

  • Leader
  • Follower: totally passive
  • Candidate: used for electing new leaders

Time divided into terms:

  • Terms have numbers that increase monotonically
  • Each term starts with an election
    • One or more candidates attempting to become leader
  • Winning candidate (if any) serves as leader for the rest of the term
  • Terms allow detection of stale information
    • Each server stores current term
    • Checked on every request
    • Term numbers increase monotonically
  • Different servers observe term transitions differently

Request-response protocol between servers (remote procedure calls, or RPCs). 2 request types:

  • RequestVote
  • AppendEntries (also serves as heartbeat)

Leader Election

All servers start as followers

No heartbeat (AppendEntries)? Start election:

  • Switch to candidate state
  • Increment current term
  • Vote for self
  • Send RequestVote requests to all other servers

Election outcomes:

  • Receive votes from majority of cluster: become leader, start sending heartbeats to prevent additional elections
  • Receive AppendEntries from some other server in the new term: revert back to follower
  • Timeout: start a new election

Each server votes for at most one candidate in a given term

Election Safety: at most one server can be elected leader in a given term

Availability: randomized election timeouts reduce split votes

Log Replication

Handled by leader

When client request arrives:

  • Append to local log
  • Replicate to other servers using AppendEntries requests
  • Committed: entry replicated by leader to majority of servers
  • Once committed, apply to local state machine
  • Return result to client
  • Notify other servers of committed entries in future AppendEntries requests

Log entries: index, term, command

Logs can become inconsistent after leader crashes

Raft maintains a high level of coherency between logs (Log Matching Property):

  • If entries in different logs have same term and index, then
    • They also have the same command
    • The two logs are identical up through that entry

AppendEntries consistency check preserves above properties.

Leader forces other logs to match its own:

  • nextIndex for each follower (initialized to leader's log length)
  • If AppendEntries fails, reduce nextIndex for that follower and retry.
  • If follower receives conflicting entries but consistency check passes, removes all conflicting entries

Safety

Must ensure that the leader for new term always holds all of the log entries committed in previous terms (Leader Completeness Property).

Step 1: restriction on elections: don't vote for a candidate unless candidate's log is at least as up-to-date as yours.

Compare indexes and terms from last log entries.

Step 2: be very careful about when an entry is considered committed

  • As soon as replicated on majority? Unsafe!
  • Committed => entry in current term replicated on majority
    • All preceding entries also committed because of Log Matching Property

Persistent Storage

Each server stores the following in persistent storage (e.g. disk or flash):

  • Current term
  • Most recent vote granted
  • Log

These must be recovered from persistent storage after a crash

If a server loses its persistent storage, it cannot participate in the cluster anymore

Implementing Raft

Follow Figure 2 exactly! Every tiny detail matters.

Client Interactions

Clients interact only with the leader

Initially, a client can send a request to any server

  • If not leader, it rejects request, returns info about most recent leader
  • Client retries until it reaches leader

If leader crashes while executing a client request, the client retries (with a new randomly-chosen server) until the request succeeds

This can result in multiple executions of a command: not consistent!

Goal: linearizability: System behaves as if each operation is executed exactly once, atomically, sometime between sending of the request and receipt of the response.

Solution:

  • Client generates unique serial number for each request (client id, request number)
  • When retrying, client reuses the original serial number
  • Servers track latest serial number processed for each client, plus associated response (must be stored persistently)
  • When leader detects duplicate, it sends old response without reexecuting request

Other Issues

Cluster membership

Log compaction

See paper for details