Soffio Admin
Dashboard
Posts
Pages
Tags
Navigation
Uploads
Jobs
Audit
API keys
Site settings
View site
Edit Post: Consensus in Distributed Systems: From Paxos to Raft
Title
Excerpt
Achieving agreement in distributed systems is one of the fundamental challenges in computer science. This article explores the evolution of consensus algorithms, from the theoretical elegance of Paxos to the practical clarity of Raft, examining why distributed consensus matters and how modern systems achieve it.
Body Markdown
# Consensus in Distributed Systems: From Paxos to Raft  ## The Fundamental Problem In a world where single points of failure are unacceptable, distributed systems reign supreme. Yet with distribution comes a profound challenge: **how do independent nodes agree on a single truth?** This is the consensus problem, and it lies at the heart of every reliable distributed system—from databases to blockchains, from cloud orchestration to distributed locks. The consensus problem asks: given a collection of processes that can fail and communicate over an unreliable network, how can they agree on a single value? This seemingly simple question has profound implications for consistency, availability, and partition tolerance. ## The CAP Theorem and Consensus Before diving into consensus algorithms, we must understand the constraints. The CAP theorem tells us that in the presence of network partitions, we must choose between consistency and availability. Consensus algorithms are fundamentally about maintaining consistency—ensuring all nodes agree on the same sequence of operations. ```python # The consensus problem in pseudo-code class ConsensusNode: def __init__(self, node_id, peers): self.node_id = node_id self.peers = peers self.accepted_value = None def propose(self, value): """ Propose a value for consensus. Challenge: How do we ensure all nodes accept the same value, even in the face of failures and network partitions? """ pass def learn(self, value): """ Learn the chosen value. All correct nodes must learn the same value. """ self.accepted_value = value ``` ## Paxos: The Theoretical Foundation Leslie Lamport's Paxos algorithm, introduced in 1989 (though not published until 1998), is the theoretical foundation of distributed consensus. Paxos is famously difficult to understand, leading Lamport to write multiple explanations, including the whimsical "The Part-Time Parliament." ### The Paxos Protocol Paxos operates with three roles: 1. **Proposers**: Propose values for consensus 2. **Acceptors**: Vote on proposed values 3. **Learners**: Learn the chosen value The protocol proceeds in two phases: #### Phase 1: Prepare ```go // Paxos Phase 1: Prepare type Proposer struct { proposalNumber int acceptors []Acceptor } func (p *Proposer) Prepare() PrepareResponse { p.proposalNumber++ // Send prepare(n) to majority of acceptors responses := make(chan PrepareResponse, len(p.acceptors)) for _, acceptor := range p.acceptors { go func(a Acceptor) { responses <- a.ReceivePrepare(p.proposalNumber) }(acceptor) } // Wait for majority majority := len(p.acceptors)/2 + 1 promises := []PrepareResponse{} for i := 0; i < majority; i++ { promises = append(promises, <-responses) } return p.findHighestAccepted(promises) } type Acceptor struct { promisedNumber int acceptedNumber int acceptedValue interface{} } func (a *Acceptor) ReceivePrepare(n int) PrepareResponse { if n > a.promisedNumber { a.promisedNumber = n return PrepareResponse{ Promise: true, AcceptedNumber: a.acceptedNumber, AcceptedValue: a.acceptedValue, } } return PrepareResponse{Promise: false} } ``` #### Phase 2: Accept ```go // Paxos Phase 2: Accept func (p *Proposer) Accept(value interface{}) bool { // Send accept(n, value) to majority of acceptors responses := make(chan bool, len(p.acceptors)) for _, acceptor := range p.acceptors { go func(a Acceptor) { responses <- a.ReceiveAccept(p.proposalNumber, value) }(acceptor) } // Wait for majority majority := len(p.acceptors)/2 + 1 accepted := 0 for i := 0; i < majority; i++ { if <-responses { accepted++ } } return accepted >= majority } func (a *Acceptor) ReceiveAccept(n int, value interface{}) bool { if n >= a.promisedNumber { a.promisedNumber = n a.acceptedNumber = n a.acceptedValue = value return true } return false } ``` ### The Complexity of Paxos Paxos provides strong safety guarantees: - **Validity**: Only proposed values can be chosen - **Agreement**: At most one value is chosen - **Termination**: Eventually, a value is chosen (under reasonable assumptions) However, Paxos is notoriously difficult to implement correctly. The protocol handles multiple proposers competing simultaneously, message reordering, and partial failures. This complexity led to many incorrect implementations and the search for alternatives.  ## Multi-Paxos: Toward Practical Systems Basic Paxos achieves consensus on a single value. Real systems need to agree on a sequence of values—a replicated log. Multi-Paxos extends basic Paxos by electing a stable leader who can skip Phase 1 for subsequent proposals. ```rust // Multi-Paxos with leader election struct MultiPaxos { current_leader: NodeId, log: Vec<LogEntry>, commit_index: usize, } impl MultiPaxos { fn append_entry(&mut self, value: Value) -> Result<(), Error> { if self.is_leader() { // Leader can skip Phase 1 (prepare) let index = self.log.len(); let entry = LogEntry { term: self.current_term, index, value, }; // Directly send accept to followers self.replicate_to_followers(entry)?; self.log.push(entry); Ok(()) } else { // Redirect to leader Err(Error::NotLeader(self.current_leader)) } } fn replicate_to_followers(&self, entry: LogEntry) -> Result<(), Error> { // Send to majority of followers // Wait for acknowledgment // Update commit index todo!() } } ``` ## Raft: Understandability as a Design Goal In 2013, Diego Ongaro and John Ousterhout introduced Raft with an explicit goal: **understandability**. They recognized that consensus algorithms are implemented by humans, and human understanding is crucial for correctness. Raft decomposes consensus into three independent subproblems: 1. **Leader Election**: How to select a leader when one fails 2. **Log Replication**: How the leader replicates its log to followers 3. **Safety**: How to ensure consistency across failures ### Leader Election Raft nodes exist in three states: follower, candidate, or leader. ```typescript enum NodeState { Follower, Candidate, Leader } class RaftNode { private state: NodeState = NodeState.Follower; private currentTerm: number = 0; private votedFor: string | null = null; private log: LogEntry[] = []; private commitIndex: number = 0; private lastApplied: number = 0; // Election timeout triggers leader election private electionTimeout: number = randomRange(150, 300); startElection(): void { // Transition to candidate this.state = NodeState.Candidate; this.currentTerm++; this.votedFor = this.nodeId; let votesReceived = 1; // Vote for self // Request votes from all peers for (const peer of this.peers) { const response = peer.requestVote({ term: this.currentTerm, candidateId: this.nodeId, lastLogIndex: this.log.length - 1, lastLogTerm: this.log[this.log.length - 1]?.term || 0 }); if (response.voteGranted) { votesReceived++; } // Check if we have majority if (votesReceived > this.peers.length / 2) { this.becomeLeader(); return; } } // Election failed, restart timeout this.resetElectionTimeout(); } requestVote(request: VoteRequest): VoteResponse { // Reply false if term < currentTerm if (request.term < this.currentTerm) { return { term: this.currentTerm, voteGranted: false }; } // If votedFor is null or candidateId, and candidate's log is // at least as up-to-date as receiver's log, grant vote if ((this.votedFor === null || this.votedFor === request.candidateId) && this.isLogUpToDate(request.lastLogIndex, request.lastLogTerm)) { this.votedFor = request.candidateId; this.resetElectionTimeout(); return { term: this.currentTerm, voteGranted: true }; } return { term: this.currentTerm, voteGranted: false }; } } ``` ### Log Replication Once elected, the leader handles all client requests and replicates its log to followers. ```typescript class RaftNode { // Leader state private nextIndex: Map<string, number> = new Map(); private matchIndex: Map<string, number> = new Map(); appendEntries(request: AppendEntriesRequest): AppendEntriesResponse { // Reply false if term < currentTerm if (request.term < this.currentTerm) { return { term: this.currentTerm, success: false }; } // Reset election timeout - we heard from leader this.resetElectionTimeout(); // Reply false if log doesn't contain an entry at prevLogIndex // whose term matches prevLogTerm if (request.prevLogIndex >= 0) { const entry = this.log[request.prevLogIndex]; if (!entry || entry.term !== request.prevLogTerm) { return { term: this.currentTerm, success: false }; } } // If an existing entry conflicts with a new one, // delete the existing entry and all that follow it for (let i = 0; i < request.entries.length; i++) { const index = request.prevLogIndex + 1 + i; const newEntry = request.entries[i]; if (this.log[index] && this.log[index].term !== newEntry.term) { this.log = this.log.slice(0, index); } if (index >= this.log.length) { this.log.push(newEntry); } } // Update commit index if (request.leaderCommit > this.commitIndex) { this.commitIndex = Math.min( request.leaderCommit, this.log.length - 1 ); } return { term: this.currentTerm, success: true }; } replicateLog(): void { if (this.state !== NodeState.Leader) return; for (const peer of this.peers) { const nextIdx = this.nextIndex.get(peer.id) || 0; const prevLogIndex = nextIdx - 1; const prevLogTerm = prevLogIndex >= 0 ? this.log[prevLogIndex].term : 0; const response = peer.appendEntries({ term: this.currentTerm, leaderId: this.nodeId, prevLogIndex, prevLogTerm, entries: this.log.slice(nextIdx), leaderCommit: this.commitIndex }); if (response.success) { this.nextIndex.set(peer.id, this.log.length); this.matchIndex.set(peer.id, this.log.length - 1); this.updateCommitIndex(); } else if (response.term > this.currentTerm) { // Discovered higher term, step down this.becomeFollower(response.term); } else { // Decrement nextIndex and retry this.nextIndex.set(peer.id, nextIdx - 1); } } } } ```  ### Safety Properties Raft's safety guarantees ensure that: 1. **Election Safety**: At most one leader per term 2. **Leader Append-Only**: Leaders never overwrite or delete entries 3. **Log Matching**: If two logs contain an entry with the same index and term, all preceding entries are identical 4. **Leader Completeness**: If a log entry is committed in a given term, it will be present in the leaders' logs for all higher terms 5. **State Machine Safety**: If a server has applied a log entry at a given index, no other server will apply a different entry for that index ## Comparing Paxos and Raft | Aspect | Paxos | Raft | |--------|-------|------| | **Understandability** | Complex, multiple explanations exist | Designed for understandability | | **Leader** | Multiple proposers can compete | Single leader, stronger leadership | | **Log structure** | Allows gaps in log | Logs are contiguous, no gaps | | **Implementation** | Many subtle variations | More prescriptive specification | | **Performance** | Theoretical optimum | Practical and efficient | | **Adoption** | Chubby, Spanner | etcd, Consul, CockroachDB | ## Real-World Implementations ### etcd: Raft for Kubernetes etcd, the distributed key-value store backing Kubernetes, uses Raft for consensus. ```go // Simplified etcd Raft usage import "go.etcd.io/etcd/raft/v3" type EtcdNode struct { raftNode raft.Node storage *raft.MemoryStorage proposeC chan string commitC chan *commit } func (n *EtcdNode) processCommits() { for commit := range n.commitC { if commit.data != nil { // Apply committed entry to state machine n.applyToStateMachine(commit.data) } } } func (n *EtcdNode) propose(data []byte) { // Propose to Raft n.raftNode.Propose(context.TODO(), data) } ``` ### CockroachDB: Multi-Raft CockroachDB uses a "multi-Raft" approach, running one Raft group per range of keys. ```sql -- In CockroachDB, consensus happens per range -- Each range has its own Raft group SHOW RANGES FROM TABLE users; -- Output shows multiple ranges, each with Raft replicas: -- range_id | start_key | end_key | replicas -- 1 | /Min | /Table/50 | {1,2,3} -- 2 | /Table/50 | /Table/51 | {2,3,4} ``` ## Beyond Raft: Modern Variants ### Flexible Paxos Recent research shows that Paxos doesn't require majorities in both phases—only the intersection must form a majority. This "Flexible Paxos" can improve availability. ```python class FlexiblePaxos: def __init__(self, n_acceptors): self.n_acceptors = n_acceptors # Phase 1 quorum: 2 nodes # Phase 2 quorum: 3 nodes # Intersection: at least 1 node # 2 + 3 > 4, so guaranteed intersection self.phase1_quorum = 2 self.phase2_quorum = 3 def can_commit(self, phase1_responses, phase2_responses): return (len(phase1_responses) >= self.phase1_quorum and len(phase2_responses) >= self.phase2_quorum) ``` ### EPaxos: Leaderless Consensus Egalitarian Paxos (EPaxos) eliminates the leader bottleneck by allowing any replica to act as leader for its commands. ### Viewstamped Replication VR, developed in the 1980s (before Paxos was published), uses a similar approach to Raft but was less well known. ## The Philosophy of Consensus Beyond the algorithms, consensus raises profound questions: **What is truth in a distributed system?** In a centralized system, the database state is the truth. In a distributed system, truth is what the majority agrees upon. This is a philosophical shift from absolute to consensus-based truth. **The cost of agreement**: Consensus requires communication, and communication requires time. The speed of light imposes a lower bound on consensus latency. We cannot escape physics. **Byzantine failures**: All algorithms discussed assume non-Byzantine failures—nodes may crash but won't lie. Byzantine fault tolerance (BFT) algorithms like PBFT handle malicious nodes but with higher overhead. ```javascript // The fundamental tension in distributed systems const CAP_THEOREM = { choose_two_of_three: ['Consistency', 'Availability', 'Partition Tolerance'], reality: 'Partitions happen. You must choose between C and A.', consensus_choice: 'Consensus algorithms choose Consistency over Availability', philosophical_insight: 'Agreement is more valuable than availability when correctness matters.' }; ```  ## Practical Considerations ### When to Use Consensus Consensus is necessary when: - **Strong consistency** is required (financial transactions, distributed locks) - **Coordination** is needed (leader election, configuration management) - **Ordering** matters (event sequencing, log replication) Consensus is overkill when: - **Eventual consistency** suffices (DNS, caches, analytics) - **Commutativity** allows for conflict-free replication (CRDTs) - **Single writer** pattern can be used ### Performance Implications ```python # Consensus latency components import time class ConsensusLatency: def estimate_latency(self): # Network RTT to majority of nodes network_rtt = 50 # ms, depends on geography # Disk write latency (fsync for durability) disk_write = 10 # ms, depends on storage # Processing time processing = 1 # ms # Minimum latency for one round return network_rtt + disk_write + processing # For cross-datacenter deployment: # 3 datacenters, 5 nodes total # RTT between DCs: 100ms # Total latency: ~110ms per consensus round # For single datacenter: # RTT: 1ms # Total latency: ~12ms per consensus round ``` ## Conclusion: The Price of Agreement Consensus algorithms represent one of distributed systems' greatest achievements. From Paxos's theoretical elegance to Raft's practical clarity, these algorithms enable the reliable distributed systems we depend on daily. Yet consensus is not free. It requires: - **Time**: Multiple network round trips - **Space**: Replicated storage across nodes - **Complexity**: Careful implementation and testing The choice to use consensus is a choice to prioritize **correctness over latency**, **safety over availability**. In a world of distributed systems, this choice is often the right one. As Leslie Lamport noted, "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." Consensus algorithms are our defense against such chaos—they transform unreliable components into reliable systems, probabilistic networks into deterministic guarantees. The journey from Paxos to Raft to modern variants continues. Each iteration brings us closer to the ideal: consensus algorithms that are correct, efficient, and—perhaps most importantly—understandable by the engineers who must implement and maintain them. --- *The quest for distributed consensus mirrors a deeper human need: in a world of uncertainty, we seek agreement. In distributed systems, as in life, consensus is how we transform individual knowledge into collective truth.*
Summary Markdown
An exploration of distributed consensus algorithms, from the theoretical foundation of Paxos to the practical clarity of Raft, examining how modern systems achieve agreement in the face of failures and network partitions.
Status
Draft
Published
Tags
Selected
Paxos
#paxos
Raft
#raft
Available
AI
#ai
2
AI Systems
#ai-systems
0
Performance
#performance
5
Programming Languages
#programming-languages
4
Architecture
#architecture
2
Compilation
#compilation
2
Database
#database
2
Distributed Systems
#distributed-systems
2
Raft
#raft
2
Standards
#standards
2
Type Systems
#type-systems
2
B-Tree
#b-tree
1
Consensus
#consensus
1
CQRS
#cqrs
1
Deep Learning
#deep-learning
1
Design Patterns
#design-patterns
1
Edge Computing
#edge-computing
1
Embeddings
#embeddings
1
Event Sourcing
#event-sourcing
1
Event-Driven
#event-driven
1
Functional Programming
#functional-programming
1
Kubernetes
#kubernetes
1
Mathematics
#mathematics
1
Optimization
#optimization
1
Paradigms
#paradigms
1
Paxos
#paxos
1
Scalability
#scalability
1
Search
#search
1
Security
#security
1
Storage Engine
#storage-engine
1
Transformer
#transformer
1
Web
#web
1
WebAssembly
#webassembly
1
Zero Trust
#zero-trust
1
ACID
#acid
0
Async
#async
0
Blockchain
#blockchain
0
Build Tools
#build-tools
0
Bundlers
#bundlers
0
Cache
#cache
0
CAP Theorem
#cap-theorem
0
Compilers
#compilers
0
Computer Vision
#computer-vision
0
Concurrency
#concurrency
0
Consistency
#consistency
0
CPU
#cpu
0
Cryptography
#cryptography
0
Distributed Database
#distributed-database
0
Domain-Driven Design
#domain-driven-design
0
Emergence
#emergence
0
FP
#fp
0
Frontend
#frontend
0
Hardware
#hardware
0
Interoperability
#interoperability
0
IO Multiplexing
#io-multiplexing
0
IoT
#iot
0
Istio
#istio
0
Language Design
#language-design
0
LLM
#llm
0
LLVM
#llvm
0
LSM-Tree
#lsm-tree
0
Machine Learning
#machine-learning
0
Memory Safety
#memory-safety
0
Microservices
#microservices
0
Monitoring
#monitoring
0
Multimodal
#multimodal
0
Network
#network
0
Networking
#networking
0
Neural Networks
#neural-networks
0
NLP
#nlp
0
Observability
#observability
0
Patterns
#patterns
0
Production
#production
0
Programming Models
#programming-models
0
React
#react
0
Reactivity
#reactivity
0
Redis
#redis
0
RL
#rl
0
Rust
#rust
0
Service Mesh
#service-mesh
0
Signals
#signals
0
Static Analysis
#static-analysis
0
Systems Programming
#systems-programming
0
Time-Series
#time-series
0
Tracing
#tracing
0
Transactions
#transactions
0
Type Theory
#type-theory
0
Vector Database
#vector-database
0
Vite
#vite
0
Vue
#vue
0
Web Components
#web-components
0
Webpack
#webpack
0
Pin to top of feeds
Publication
Last Published:
2025/11/02 17:46:11
Save Changes