In-depth analysis of raftexample to understand the raft protocol

When it comes to raftexample, many people may be unfamiliar. I know raft, and I also know example. Where does raftexample come from? Here is a brief introduction. raftexample is a simple example of the implementation of the raft module in etcd. It implements a simple kvstore storage cluster system based on the raft protocol and provides rest api for use

The raftexample only has the following files and hundreds of lines of code. By reading it, it can help us better understand the raft protocol, and the input-output ratio is very large.

Demonstration animation

Officially recommended animation demo address: http://thesecretlivesofdata.com/raft/

leader election

election

log synchronization

relication

Concept analysis

logical clock

The logical clock is actually a timer time.Tick , which is triggered every once in a while and is the core of promoting raft selection. There are several properties in it.

electionElapsed: Logical clock advance count, the follower and leader do not advance the logical clock once, this value will be +1; the follower will reset to 0 after receiving the leader's heartbeat message; the leader will reset to 0 before sending each heartbeat

heartbeatElapsed: The logical clock advance count of the leader, which not only increases the count of electionElapsed, but also increases the count of heartbeatElapsed

heartbeatTimeout: When the leader's heartbeatElapsed count reaches the predefined value of heartbeatTimeout, it will send a heartbeat to each node

electionTimeout: When the number of logical clock advancements of the leader's electionElapsed exceeds this value, if the leader also enables checkQuorum to judge the survival status of each node in the current cluster, the leader will perform probing (probing will not initiate network requests) , relying on the state of each node stored by itself)

randomizedElectionTimeout: A value produced randomly. When the follower's electionElapsed count reaches this value, a new round of election will be initiated

Therefore, it can be seen that the logical clock is mainly to promote the leader's heartbeat, detection, and follower election.

Condition for follower to initiate election: electionElapsed >= randomizedElectionTimeout

Condition for leader to send heartbeat: heartbeatElapsed >= heartbeatTimeout

Condition for leader to initiate cluster node probing: electionElapsed > electionTimeout

Here is a question, why does the leader node have two counts, electionElapsed and heartbeatElapsed, is it not enough?

In fact, it is because, when the leader initiates a probe or the count meets the probe conditions, electionElapsed will be set to 0, so for the leader, the values of electionElapsed and heartbeatElapsed are not consistent or synchronized.

raft.Peer

Peer: A node in the cluster, each application process that joins the cluster is a node

 type Peer struct {
    ID      uint64
    Context []byte
}

ID: The id of the node. When the process is initialized, it will assign an ID to each node in the cluster

Context: context

raftpb.Message

Message: This is an important structure in raftexample (hereinafter referred to as raft), which is the abstraction of all messages, including but not limited to election/adding data/configuration changes, is a type of message

 type Message struct {
    Type MessageType 
    To   uint64      
    From uint64      
    Term uint64      
    LogTerm    uint64   
    Index      uint64   
    Entries    []Entry  
    Commit     uint64   
    Snapshot   Snapshot 
    Reject     bool     
    RejectHint uint64   
    Context    []byte   
}

Type: message type, raft implements different logic processing according to different message types

MsgHup MessageType = 0 // When the follower node thinks the leader is down, it initiates an election
MsgBeat MessageType = 1 // A local message only sent by the leader. When raft is initialized, a number of heartbeat timeouts will be defined. When the number of times the logical clock pushes exceeds the defined number of heartbeat timeouts, this message will be triggered, and then the leader will send a message to the follower. Send MsgHeartbeat message
MsgProp MessageType = 2 // Message type for writing data or modifying cluster configuration
MsgApp MessageType = 3 // The leader broadcasts the message of appending the log to the follower
MsgAppResp MessageType = 4 // The message type of the follower responding to the leader's append log request
MsgVote MessageType = 5 // Candinator sends the message type of the election command
MsgVoteResp MessageType = 6 // The message type of the remaining nodes responding to the candidate election command
MsgSnap MessageType = 7 // The message type for the leader to send snapshot data to the follower
MsgHeartbeat MessageType = 8 // Heartbeat message sent by leader to follower
MsgHeartbeatResp MessageType = 9 // Heartbeat message responded by follower
MsgUnreachable MessageType = 10 // message unreachable, local message
MsgSnapStatus MessageType = 11 // Report the status of the Snap message sent to the follower node, whether it is sent successfully
MsgCheckQuorum MessageType = 12 // This is also a local message, which is used by the leader to judge the status of the current surviving nodes. If less than half of the nodes survive, it will be reduced to a follower node
MsgTransferLeader MessageType = 13 // Transfer leader rights
MsgTimeoutNow MessageType = 14 // When the log of the follower sending MsgTransferLeader is consistent with the leader, the leader sends MsgTimeoutNow to the follower, and the follower starts to elect
MsgReadIndex MessageType = 15 // The follower node requests the leader node for the position of the commit index
MsgReadIndexResp MessageType = 16 // leader node responds to MsgReadIndex
MsgPreVote MessageType = 17 // The preVote message is an optional message before the Vote message. If preVote is enabled, the follower will perform a preVote and vote before converting to a candidate. At this time, the term of the term will not increase, to avoid less than 1 due to network partitions Frequent primary election of partition nodes of /2
MsgPreVoteResp MessageType = 18 // other nodes' response to MsgPreVote

To: the ID of the node to which the message is sent

From: ID of the sender node of the message

Term: the current term

LogTerm: When the leader sends the log to the follower node, when the follower node rejects it, it will match a suitable log location to try to let the leader start synchronization. The log corresponding to this log point is LogTerm.

Index: The Index of the message in the log

Entries: Logging

Commit: The location of the log commit of the message sending node

Snapshot: Snapshot information when transferring a snapshot

Reject: Whether the node rejects the received message; for example, when the follower receives the leader's MsgApp message and finds that the log cannot be directly appended, it will reject the leader's log synchronization message

RejectHint: After the follower rejects the leader node message, the calculated index position that may match the leader log

Context: context

raftpb.Entry

Entry: is the log that is often said

 type Entry struct {
    Term  uint64
    Index uint64
    Type  EntryType
    Data  []byte
}

Term: The term corresponding to this log. Each log is synchronized by the leader, and the leader has a term. This is the leader's term when synchronizing logs.

Index: Index, which is also an identifier for each log

Type: type of log, EntryNormal means regular log; EntryConfChange/EntryConfChangeV2 means log of configuration change

Data: is the data stored in the log

Just look at two demos

 leader和follower 日志不同步的情况
idx               1 2 3 4 5 6 7 8 9
                                    -----------------
term (Leader)     1 3 3 3 5 5 5 5 5
term (Follower)   1 1 1 1 2 2     

leader和follower 日志同步的情况
idx               1 2 3 4 5 6 7 8 9
                                    -----------------
term (Leader)     1 3 3 3 5 5 5 5 5
term (Follower)   1 3 3 3 5 5 5 5 5

The synchronization of the leader and follower node logs is judged by whether the Index and Term of each Entry in the Entries are the same.

Notice

All subsequent data additions, deletions, revisions, inspections, configuration additions and deletions, elections and other messages are collectively referred to as message logs or logs.
In log synchronization, we will ignore wal log and snapshot related content

Source code analysis

Structure introduction

raftNode

 type raftNode struct {
  // 接收kvstore的存储数据日志
    proposeC    <-chan string            // proposed messages (k,v)
  // 接收kvHttpApi的配置变更日志
    confChangeC <-chan raftpb.ConfChange // proposed cluster config changes
  // 日志同步到kvstore
    commitC     chan<- *commit           // entries committed to log (k,v)
  // 模块间出错的消息通知通道
    errorC      chan<- error             // errors from raft session

  // 节点的id
    id          int      // client ID for raft session
  // 当前集群下的所有节点ip:ports
    peers       []string // raft peer URLs
  // 当前节点是否接入一个集群，启动时候根据这个判断是重启还是启动一个新的节点
    join        bool     // node is joining an existing cluster
  // wal 日志目录
    waldir      string   // path to WAL directory
  // snapshot 目录
    snapdir     string   // path to snapshot directory
  // 获取snapshot的方法
    getSnapshot func() ([]byte, error)

  // 用于集群的状态
    confState     raftpb.ConfState
  // snapshot 的日志的Index
    snapshotIndex uint64
  // 已经applied的日志的Index
    appliedIndex  uint64

    // raft backing for the commit/error channel
  // node的实例，实现了Node接口的方法
    node        raft.Node
  // storage实例
    raftStorage *raft.MemoryStorage
  // wal实例
    wal         *wal.WAL

  // snapshot实例，管理snapshot
    snapshotter      *snap.Snapshotter
  // 与snapshot交互的通道，判断snapshot实例是否创建完成
    snapshotterReady chan *snap.Snapshotter // signals when snapshotter is ready

  // 两次snapshot之间的apply的日志最小条数
  // 当上一次snapshot之后，apply的日志超过这个数量，就会开启新一轮snapshot，用于及时释放wal和raftLog中的存储压力
    snapCount uint64
  // 网络组件
    transport *rafthttp.Transport
  // 通道通知关闭serveChannel，关闭网络组件
    stopc     chan struct{} // signals proposal channel closed
  // 关闭raft node的http服务
    httpstopc chan struct{} // signals http server to shutdown
  // raft node 的http关闭成功后通知其他模块的通道
    httpdonec chan struct{} // signals http server shutdown complete

  // 日志组件
    logger *zap.Logger
}

raft

 type raft struct {
  // 节点id
    id uint64

  // 当前节点的Term 任期
    Term uint64
  // 当前节点在选举时，投票给了谁，初始化时为0，也就是谁都没有投给
    Vote uint64

  // 与readIndex请求有关，这里不多做介绍
    readStates []ReadState

    // the log
    raftLog *raftLog

  // 单条消息最大的size
    maxMsgSize         uint64
  // 最大的uncommit 日志数量，当uncommit日志数量大于这个值的时候，就不再追加日志了
    maxUncommittedSize uint64
    // TODO(tbg): rename to trk.
  // 集中群各个节点状态，包含了节点日志复制情况等，下面会单独介绍一下
    prs tracker.ProgressTracker

  // 状态，follower、candidator、leader等
    state StateType

    // isLearner is true if the local raft node is a learner.
    isLearner bool

  // 记录了当前节点待发送的消息，这里的消息会被及时消费
    msgs []pb.Message

    // the leader id
    lead uint64
    // leadTransferee is id of the leader transfer target when its value is not zero.
    // Follow the procedure defined in raft thesis 3.10.
  // leader转换对象的id
    leadTransferee uint64
    // Only one conf change may be pending (in the log, but not yet
    // applied) at a time. This is enforced via pendingConfIndex, which
    // is set to a value >= the log index of the latest pending
    // configuration change (if any). Config changes are only allowed to
    // be proposed if the leader's applied index is greater than this
    // value.
    pendingConfIndex uint64
    // an estimate of the size of the uncommitted tail of the Raft log. Used to
    // prevent unbounded log growth. Only maintained by the leader. Reset on
    // term changes.
  // 未commit的日志的数量
    uncommittedSize uint64

    readOnly *readOnly

    // number of ticks since it reached last electionTimeout when it is leader
    // or candidate.
    // number of ticks since it reached last electionTimeout or received a
    // valid message from current leader when it is a follower.
    electionElapsed int

    // number of ticks since it reached last heartbeatTimeout.
    // only leader keeps heartbeatElapsed.
    heartbeatElapsed int

  // 是否开启节点探活
    checkQuorum bool
  // 是否开启preVote
    preVote     bool

    heartbeatTimeout int
    electionTimeout  int
    // randomizedElectionTimeout is a random number between
    // [electiontimeout, 2 * electiontimeout - 1]. It gets reset
    // when raft changes its state to follower or candidate.
    randomizedElectionTimeout int
  // 是否不允许数据转发给leader，开启的话，follower节点收到的数据日志，就直接丢弃掉，而不会转发给leader
    disableProposalForwarding bool

  // 逻辑时钟方法，leader对应tickHeartbeat，follower和candidate对应tickElection
    tick func()
  // 处理消息的方法，leader对应stepLeader，follower对应stepFollower，candidate对应stepCandidate
    step stepFunc

    logger Logger

    // pendingReadIndexMessages is used to store messages of type MsgReadIndex
    // that can't be answered as new leader didn't committed any log in
    // current term. Those will be handled as fast as first log is committed in
    // current term.
    pendingReadIndexMessages []pb.Message
}

Here are two points to explain:

checkQuorum: This switch is used by the leader to judge the active state of each node. The leader will set an active state when sending information to the follower, but if there is a network partition, the partition where the leader is located is less than 1/2 of the node, then this The leader is useless, and can actively demote itself to a follower node

preVote: preVote is also set for network partitions. When the network is partitioned, and the number of nodes in a partition is less than 1/2 of the number of nodes, followers will frequently initiate elections, and they will increase their Term +1 when they initiate elections. , but the election will not be successful, so it will fall into a loop: initiate election -> term+1 -> election failure -> initiate election -> term+1..., when the network partition contacts, this term may It is already very large. After the merger, the term of the real leader may not be as large as that of the follower in the partition, which will affect the accuracy of the log. Therefore, the nodes in the network partition will initiate preVote first. At this time, the term remains unchanged, and preVote Only after success will the actual election process be initiated

raftLog

raftLog is the component that manages logs in the node

 type raftLog struct {
    // storage contains all stable entries since the last snapshot.
  // storage组件，里面存储了snapshot之后的所有的stable日志记录
    storage Storage

    // unstable contains all unstable entries and snapshot.
    // they will be saved into storage.
  // 记录了所有unstable的日志记录和snapshot
    unstable unstable

    // committed is the highest log position that is known to be in
    // stable storage on a quorum of nodes.
  // commit 日志的最大的Index的值，这个是统计的storage组件里面的
    committed uint64
    // applied is the highest log position that the application has
    // been instructed to apply to its state machine.
    // Invariant: applied <= committed
  // 已经applied日志的最大的Index的值
    applied uint64

    logger Logger

    // maxNextEntsSize is the maximum number aggregate byte size of the messages
    // returned from calls to nextEnts.
    maxNextEntsSize uint64
}

// Invariant: applied <= committed

In the comment of the official applied field, there is such a paragraph: applied <= committed is always established, this is why

Let's first briefly understand the life cycle of the log, and then we will introduce it in detail later when we track the synchronization of the log.

The user submits a request to create data -> kvstore generates a data log -> the log is stored in the unstable structure -> the log is appended to the storage -> the log is synchronized to other nodes -> the node synchronization is completed, the commit log, and the committed location is updated -> kvstore store -> update applied location

start process

 func main() {
    cluster := flag.String("cluster", "http://127.0.0.1:9021", "comma separated cluster peers")
    id := flag.Int("id", 1, "node ID")
    kvport := flag.Int("port", 9121, "key-value server port")
    join := flag.Bool("join", false, "join an existing cluster")
    flag.Parse()

  // 创建proposeC和confChangeC通道，这两个通道用于kvstore raftNode http三个模块的交互，所以在外面创建
    proposeC := make(chan string)
    defer close(proposeC)
    confChangeC := make(chan raftpb.ConfChange)
    defer close(confChangeC)

    // raft provides a commit stream for the proposals from the http api
    var kvs *kvstore
    getSnapshot := func() ([]byte, error) { return kvs.getSnapshot() }
  // 启动raftNode模块
    commitC, errorC, snapshotterReady := newRaftNode(*id, strings.Split(*cluster, ","), *join, getSnapshot, proposeC, confChangeC)

  // 启动kvstore模块
    kvs = newKVStore(<-snapshotterReady, proposeC, commitC, errorC)

    // the key-value http handler will propose updates to raft
  // 启动httpKVApi模块，用于接收请求
    serveHttpKVAPI(kvs, *kvport, confChangeC, errorC)
}

The startup process of raftexample is relatively simple. All modules are started, and an interaction channel between each module is created, which facilitates communication between each module and realizes the decoupling between individual stock modules.

raftNode: The core module of raftexample, which provides the raft protocol, master selection, log synchronization and other capabilities
kvstore: The storage module of raftexample, the final submitted log will be stored in it. In raftexample, this storage system is essentially a map
httpKvApi: Provides an api for interacting with the outside world. You can create, modify and delete stored data and add and delete cluster nodes through httpapi.

Here we analyze the rhythm from simple to deep according to httpKvApi->kvstore->raftNode

httpKVAPI module

httpKvApi provides an http service, which operates different objects (data and cluster nodes) through different methods, and realizes the addition, deletion, modification and query of data and nodes.

start up

 func serveHttpKVAPI(kv *kvstore, port int, confChangeC chan<- raftpb.ConfChange, errorC <-chan error) {
    // 定义http server
  srv := http.Server{
        Addr: ":" + strconv.Itoa(port),
        Handler: &httpKVAPI{
            store:       kv,
            confChangeC: confChangeC,
        },
    }
    go func() {
    // 启动http server
        if err := srv.ListenAndServe(); err != nil {
            log.Fatal(err)
        }
    }()

    // exit when raft goes down
  // 与下层的raft通信，如果raft挂了，则本模块也要退出
    if err, ok := <-errorC; ok {
        log.Fatal(err)
    }
}

The logic of startup is relatively simple

Asynchronously starts an http service, and blocks the monitoring of error chan, first to avoid the process exit, and after the second raft module exits, http can also exit in time

request processing

Request processing is unified in the ServeHTTP method

 func (h *httpKVAPI) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    key := r.RequestURI
    defer r.Body.Close()
    switch r.Method {
    // Put方式，则表明是数据增加或修改
    case http.MethodPut:
        v, err := io.ReadAll(r.Body)
        if err != nil {
            log.Printf("Failed to read on PUT (%v)\n", err)
            http.Error(w, "Failed on PUT", http.StatusBadRequest)
            return
        }

        h.store.Propose(key, string(v))

        // Optimistic-- no waiting for ack from raft. Value is not yet
        // committed so a subsequent GET on the key may return old value
    // 这里直接返回，而底层则会进行数据一致性的处理等操作，所以如果立即进行GET请求的话，还是有可能获取老的数据
        w.WriteHeader(http.StatusNoContent)
    case http.MethodGet:
    // Get方式，直接到kvstore里面获取数据的value
        if v, ok := h.store.Lookup(key); ok {
            w.Write([]byte(v))
        } else {
            http.Error(w, "Failed to GET", http.StatusNotFound)
        }
    case http.MethodPost:
    // Post方式是增加集群节点，解析出来nodeId，并通过confChangeC传给raftNode
        url, err := io.ReadAll(r.Body)
        if err != nil {
            log.Printf("Failed to read on POST (%v)\n", err)
            http.Error(w, "Failed on POST", http.StatusBadRequest)
            return
        }

        nodeId, err := strconv.ParseUint(key[1:], 0, 64)
        if err != nil {
            log.Printf("Failed to convert ID for conf change (%v)\n", err)
            http.Error(w, "Failed on POST", http.StatusBadRequest)
            return
        }

        cc := raftpb.ConfChange{
            Type:    raftpb.ConfChangeAddNode,
            NodeID:  nodeId,
            Context: url,
        }
        h.confChangeC <- cc
        // As above, optimistic that raft will apply the conf change
        w.WriteHeader(http.StatusNoContent)
    case http.MethodDelete:
    // Delete方式则是删除一个集群节点，并通过confChangeC传给raftNode来处理
        nodeId, err := strconv.ParseUint(key[1:], 0, 64)
        if err != nil {
            log.Printf("Failed to convert ID for conf change (%v)\n", err)
            http.Error(w, "Failed on DELETE", http.StatusBadRequest)
            return
        }

        cc := raftpb.ConfChange{
            Type:   raftpb.ConfChangeRemoveNode,
            NodeID: nodeId,
        }
        h.confChangeC <- cc

        // As above, optimistic that raft will apply the conf change
        w.WriteHeader(http.StatusNoContent)
    default:
        w.Header().Set("Allow", http.MethodPut)
        w.Header().Add("Allow", http.MethodGet)
        w.Header().Add("Allow", http.MethodPost)
        w.Header().Add("Allow", http.MethodDelete)
        http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
    }
}

The above is all the processing of httpKvApi

Put request: Add or modify the data stored in kvstore, but this is not a direct modification, but calls h.store.Propose (ie kvstore.Propose ) to process, kvstore.Propose It is to send data to raftNode through proposeC to ensure data consistency and then submit and save it to kvstore, which will be analyzed in detail later, which is also the core of raft
Get request: This method is relatively simple, directly read data in kvstore, and Put method is to add and modify data, but Put does not directly modify, so if the value of a key is obtained immediately after Put, it is possible that raftNode It has not been processed and submitted to kvstore, and the old data is still read.
Post request: add a cluster node, this node will be known by the leader and add this node to the follower
Delete request: To delete a node, it is also handed over to the leader for processing

kvstore module

Create kvstore

 func newKVStore(snapshotter *snap.Snapshotter, proposeC chan<- string, commitC <-chan *commit, errorC <-chan error) *kvstore {
  // 实例化kvstore，example里面存储系统就是一个map
    s := &kvstore{proposeC: proposeC, kvStore: make(map[string]string), snapshotter: snapshotter}
  // 加载snapshot
    snapshot, err := s.loadSnapshot()
    if err != nil {
        log.Panic(err)
    }
  // 如果snapshot不为空，则先从snapshot里面把数据恢复
    if snapshot != nil {
        log.Printf("loading snapshot at term %d and index %d", snapshot.Metadata.Term, snapshot.Metadata.Index)
        if err := s.recoverFromSnapshot(snapshot.Data); err != nil {
            log.Panic(err)
        }
    }
    // read commits from raft into kvStore map until error
  // 开启一个goroutine，读取从raftNode里面提交的commit数据
    go s.readCommits(commitC, errorC)
    return s
}

// 加载snapshot
func (s *kvstore) loadSnapshot() (*raftpb.Snapshot, error) {
    snapshot, err := s.snapshotter.Load()
  // ErrNoSnapshot 表示没有snapshot，这里消化掉err，避免上层模块退出
    if err == snap.ErrNoSnapshot {
        return nil, nil
    }
    if err != nil {
        return nil, err
    }
  // 找到了，则返回
    return snapshot, nil
}

func (s *Snapshotter) Load() (*raftpb.Snapshot, error) {
    return s.loadMatching(func(*raftpb.Snapshot) bool { return true })
}

func (s *Snapshotter) loadMatching(matchFn func(*raftpb.Snapshot) bool) (*raftpb.Snapshot, error) {
  // 这里返回snapshot的文件列表，按照时间排序，也就是从新到旧排序
    names, err := s.snapNames()
    if err != nil {
        return nil, err
    }
    var snap *raftpb.Snapshot
  // 遍历snapshot文件，其实也就是读取最新的文件
    for _, name := range names {
    // loadSnap就不追了，就是读取snapshot文件的数据，并反序列化为结构体对象
        if snap, err = loadSnap(s.lg, s.dir, name); err == nil && matchFn(snap) {
            return snap, nil
        }
    }
  // 没找到获取没有合适的snapshot，就返回ErrNoSnapshot
    return nil, ErrNoSnapshot
}

// 从snapshot里面恢复数据到kvstore，这里的入参是上面snapshot.Data，也就是[]byte
func (s *kvstore) recoverFromSnapshot(snapshot []byte) error {
    var store map[string]string
    if err := json.Unmarshal(snapshot, &store); err != nil {
        return err
    }
  // 反序列化为map后，直接赋值给kvstore
    s.mu.Lock()
    defer s.mu.Unlock()
    s.kvStore = store
    return nil
}

Created logic:

First instantiate the kvstore structure, and instantiate the storage system, which is a map in the example
Then locally find if there is a snapshot, read the latest snapshot, and deserialize it into a snapshot structure
If the snapshot is found, the snapshot.Data is deserialized into a map and given to the kvstore, and the stored data of the kvstore is restored from the snapshot.
Finally, start a goroutine, interact with raftNode through commitC, read data and store it

data storage

As mentioned above, kvstore will open a channel chan to communicate with raftNode, read data and store it, here is the specific logic

 func (s *kvstore) readCommits(commitC <-chan *commit, errorC <-chan error) {
    for commit := range commitC {
    // 如果读取到nil，则从snapshot里面再次恢复数据
        if commit == nil {
            // signaled to load snapshot
            snapshot, err := s.loadSnapshot()
            if err != nil {
                log.Panic(err)
            }
            if snapshot != nil {
                log.Printf("loading snapshot at term %d and index %d", snapshot.Metadata.Term, snapshot.Metadata.Index)
                if err := s.recoverFromSnapshot(snapshot.Data); err != nil {
                    log.Panic(err)
                }
            }
            continue
        }

    // 读取数据
        for _, data := range commit.data {
            var dataKv kv
            dec := gob.NewDecoder(bytes.NewBufferString(data))
            if err := dec.Decode(&dataKv); err != nil {
                log.Fatalf("raftexample: could not decode message (%v)", err)
            }
            s.mu.Lock()
            s.kvStore[dataKv.Key] = dataKv.Val
            s.mu.Unlock()
        }
    // 关闭commit的chan，以通知上层模块存储完成
        close(commit.applyDoneC)
    }
  // 如果遇到error chan，跟httpkvapi模块一样，退出当前模块
    if err, ok := <-errorC; ok {
        log.Fatal(err)
    }
}

The commitC channel and processing logic are relatively simple

The goroutine is stuck in this chan, and when data comes, it will be processed, and the upper module will be notified that the processing is completed.

If raftNode encounters an exception and exits, it will be notified by error chan, and the kvstore module will also exit

data lookup

 func (s *kvstore) Lookup(key string) (string, bool) {
    s.mu.RLock()
    defer s.mu.RUnlock()
    v, ok := s.kvStore[key]
    return v, ok
}

The search is relatively simple, the data structure is a map, just read the data directly from the map

data submission

I mentioned data storage earlier, so what is the relationship between data submission and data storage, and why are there two ways to process data?

In fact, in the raft protocol, we can know that when a user submits a data creation/modification request, the data is first submitted, and then raftNode will notify the data to the following nodes. When more than half of the data are received by the nodes, will actually be stored in the storage system

 func (s *kvstore) Propose(k string, v string) {
    var buf bytes.Buffer
    if err := gob.NewEncoder(&buf).Encode(kv{k, v}); err != nil {
        log.Fatal(err)
    }
    s.proposeC <- buf.String()
}

Therefore, the data submission here is actually sent to raftNode through proposeC. After raftNode is processed, it is sent to kvstore through commitC for storage.

Summary & Notes

The map+lock method here will be much less efficient, it is better to use sync.Map directly; of course, this is just an example, so there is no need to consider so much
Data submission and storage process: When a user creates a data modification/creation request through httpkvapi, kvstore will first forward the data to raftNode through proposeC, and then raftNode will store it in the database (that is, its own map) after the processing is completed.

raftNode module

start raftNode

 func newRaftNode(id int, peers []string, join bool, getSnapshot func() ([]byte, error), proposeC <-chan string,
    confChangeC <-chan raftpb.ConfChange) (<-chan *commit, <-chan error, <-chan *snap.Snapshotter) {

    commitC := make(chan *commit)
    errorC := make(chan error)

    rc := &raftNode{
        proposeC:    proposeC,
        confChangeC: confChangeC,
        commitC:     commitC,
        errorC:      errorC,
        id:          id,
        peers:       peers,
        join:        join,
        waldir:      fmt.Sprintf("raftexample-%d", id),
        snapdir:     fmt.Sprintf("raftexample-%d-snap", id),
        getSnapshot: getSnapshot,
        snapCount:   defaultSnapshotCount,
        stopc:       make(chan struct{}),
        httpstopc:   make(chan struct{}),
        httpdonec:   make(chan struct{}),

        logger: zap.NewExample(),

        snapshotterReady: make(chan *snap.Snapshotter, 1),
        // rest of structure populated after WAL replay
    }
  // 异步启动raft
    go rc.startRaft()
  // 返回commitC，errorC rc.snapshotterReady，用于跟其他模块通信
    return commitC, errorC, rc.snapshotterReady
}

func (rc *raftNode) startRaft() {
    if !fileutil.Exist(rc.snapdir) {
        if err := os.Mkdir(rc.snapdir, 0750); err != nil {
            log.Fatalf("raftexample: cannot create dir for snapshot (%v)", err)
        }
    }
    rc.snapshotter = snap.New(zap.NewExample(), rc.snapdir)

    oldwal := wal.Exist(rc.waldir)
    rc.wal = rc.replayWAL()

    // signal replay has finishei
  // 通知其他模块 snapshotter初始化完成
    rc.snapshotterReady <- rc.snapshotter

    rpeers := make([]raft.Peer, len(rc.peers))
    for i := range rpeers {
        rpeers[i] = raft.Peer{ID: uint64(i + 1)}
    }
  // 初始化raft 的相关配置，用于启动raft
    c := &raft.Config{
        ID:                        uint64(rc.id),
        ElectionTick:              10,
        HeartbeatTick:             1,
        Storage:                   rc.raftStorage,
        MaxSizePerMsg:             1024 * 1024,
        MaxInflightMsgs:           256,
        MaxUncommittedEntriesSize: 1 << 30,
    }

  // 根据oldwal 和 join，判断是新启动的节点还是重启的节点
  // 如果是重启，则可以从snapshot里面恢复数据
    if oldwal || rc.join {
        rc.node = raft.RestartNode(c)
    } else {
        rc.node = raft.StartNode(c, rpeers)
    }

  // 初始化网络组件
    rc.transport = &rafthttp.Transport{
        Logger:      rc.logger,
        ID:          types.ID(rc.id),
        ClusterID:   0x1000,
        Raft:        rc,
        ServerStats: stats.NewServerStats("", ""),
        LeaderStats: stats.NewLeaderStats(zap.NewExample(), strconv.Itoa(rc.id)),
        ErrorC:      make(chan error),
    }

    rc.transport.Start()
  // 启动与各个节点的网络pipeline通信通道
    for i := range rc.peers {
        if i+1 != rc.id {
            rc.transport.AddPeer(types.ID(i+1), []string{rc.peers[i]})
        }
    }

    go rc.serveRaft()
    go rc.serveChannels()
}

raftNode startup process:

Start raft, return commitC to communicate with kvstore, errorC communicate with other modules to notify the exception to exit in time, snapshotterReady chan communicates with kvstore to notify snapshot initialization completion
Initialize snapshotter, playback wal log
Initialize raft.Config, and judge whether it is a new node or an old node according to whether the wal directory exists and the join configuration, and the old node can restore the data from the snapshot
Initialize network components, communicate with each node, and initialize the pipeline channel of each node
Start the raft server service
Start each channel of raftNode and communicate with each module

start node

 func StartNode(c *Config, peers []Peer) Node {
    if len(peers) == 0 {
        panic("no peers given; use RestartNode instead")
    }
  // 初始化node节点，并初始化配置
    rn, err := NewRawNode(c)
    if err != nil {
        panic(err)
    }
  // 将各个节点的信息存储到raftLog里面，等待日志同步，然后周知各个节点配置变更的消息
    err = rn.Bootstrap(peers)
    if err != nil {
        c.Logger.Warningf("error occurred during starting a new node: %v", err)
    }

  // 实例化node节点
    n := newNode(rn)

  // node就开始在后台异步运行了，直至退出
    go n.run()
    return &n
}

func NewRawNode(config *Config) (*RawNode, error) {
  // 根据配置，实例化raft
    r := newRaft(config)
    rn := &RawNode{
        raft: r,
    }
  // 初始化Leader, State, Term Vote CommitIndex 等raft自身属性
    rn.prevSoftSt = r.softState()
    rn.prevHardSt = r.hardState()
    return rn, nil
}

func newRaft(c *Config) *raft {
    if err := c.validate(); err != nil {
        panic(err.Error())
    }
    raftlog := newLogWithSize(c.Storage, c.Logger, c.MaxCommittedSizePerReady)
    hs, cs, err := c.Storage.InitialState()
    if err != nil {
        panic(err) // TODO(bdarnell)
    }

    r := &raft{
        id:                        c.ID,
        lead:                      None,
        isLearner:                 false,
        raftLog:                   raftlog,
        maxMsgSize:                c.MaxSizePerMsg,
        maxUncommittedSize:        c.MaxUncommittedEntriesSize,
        prs:                       tracker.MakeProgressTracker(c.MaxInflightMsgs),
        electionTimeout:           c.ElectionTick,
        heartbeatTimeout:          c.HeartbeatTick,
        logger:                    c.Logger,
        checkQuorum:               c.CheckQuorum,
        preVote:                   c.PreVote,
        readOnly:                  newReadOnly(c.ReadOnlyOption),
        disableProposalForwarding: c.DisableProposalForwarding,
    }

  // 初始化了各个节点在当前节点的状态及存储信息等，包括投票信息，日志commit节点，是否活跃等
    cfg, prs, err := confchange.Restore(confchange.Changer{
        Tracker:   r.prs,
        LastIndex: raftlog.lastIndex(),
    }, cs)
    if err != nil {
        panic(err)
    }
    assertConfStatesEquivalent(r.logger, cs, r.switchToConfig(cfg, prs))

    if !IsEmptyHardState(hs) {
        r.loadState(hs)
    }
  // 更新raftLog的apply index
    if c.Applied > 0 {
        raftlog.appliedTo(c.Applied)
    }
  // 启动后，就会变更集群中的follower节点
    r.becomeFollower(r.Term, None)

    var nodesStrs []string
    for _, n := range r.prs.VoterNodes() {
        nodesStrs = append(nodesStrs, fmt.Sprintf("%x", n))
    }

    r.logger.Infof("newRaft %x [peers: [%s], term: %d, commit: %d, applied: %d, lastindex: %d, lastterm: %d]",
        r.id, strings.Join(nodesStrs, ","), r.Term, r.raftLog.committed, r.raftLog.applied, r.raftLog.lastIndex(), r.raftLog.lastTerm())
    return r
}

// newNode就初始化了各个chan，与各个模块进行通信
func newNode(rn *RawNode) node {
    return node{
        propc:      make(chan msgWithResult),
        recvc:      make(chan pb.Message),
        confc:      make(chan pb.ConfChangeV2),
        confstatec: make(chan pb.ConfState),
        readyc:     make(chan Ready),
        advancec:   make(chan struct{}),
        // make tickc a buffered chan, so raft node can buffer some ticks when the node
        // is busy processing raft messages. Raft node will resume process buffered
        // ticks when it becomes idle.
        tickc:  make(chan struct{}, 128),
        done:   make(chan struct{}),
        stop:   make(chan struct{}),
        status: make(chan chan Status),
        rn:     rn,
    }
}

The startup process of the Node node:

Instantiate the node node, and initialize softState and hardState, that is, raft's own properties such as Leader, State, Term Vote CommitIndex, etc.
Instantiate raft, initialize raftLog, and then retrieve the hardState and configuration information of the storage
Initializes the status and storage information of each node in the current node, including voting information, log commit node, whether it is active, etc.
Update the hardState of the node according to the hardState retrieved from the storage
downgrade to follower

start serverRaft

 func (rc *raftNode) serveRaft() {
    url, err := url.Parse(rc.peers[rc.id-1])
    if err != nil {
        log.Fatalf("raftexample: Failed parsing URL (%v)", err)
    }

    ln, err := newStoppableListener(url.Host, rc.httpstopc)
    if err != nil {
        log.Fatalf("raftexample: Failed to listen rafthttp (%v)", err)
    }

    err = (&http.Server{Handler: rc.transport.Handler()}).Serve(ln)
    select {
    case <-rc.httpstopc:
    default:
        log.Fatalf("raftexample: Failed to serve rafthttp (%v)", err)
    }
    close(rc.httpdonec)
}

serverRaft starts an http service for communication between nodes; unlike httpKvApi, httpKvApi is only used to communicate with the cluster, while raft server is the master selection between nodes, log synchronization waits until communication

After the introduction of the startup process, each object is ready and the status is ready, and then the official work begins - leader election and log synchronization

leader election

The selection of raftNode is promoted by the logical clock. The logic of the logical clock has been introduced before. Follow the code to see the specific implementation.

The leader will send the heartbeat to the follower regularly, and the follower will have a timer to check the heartbeat interval. If the heartbeat interval exceeds the set time, it will trigger the master election.

timer trigger

 func (rc *raftNode) serveChannels() {    
    ticker := time.NewTicker(100 * time.Millisecond)
    defer ticker.Stop()
  ......
  for {
        select {
        case <-ticker.C:
            rc.node.Tick()
    ......
    }
  }
}

func (n *node) Tick() {
    select {
    case n.tickc <- struct{}{}:
    case <-n.done:
    default:
        n.rn.raft.logger.Warningf("%x A tick missed to fire. Node blocks too long!", n.rn.raft.id)
    }
}

Timer reception processing

The timer trigger is implemented in serveChannels , and then it will be given to the node.run method through tickc for processing

 func (n *node) run() {
  ......
  switch {
    ......
    case <-n.tickc:
            n.rn.Tick()
    ......
  }
}

func (rn *RawNode) Tick() {
    rn.raft.tick()
}

func (r *raft) tickElection() {
    r.electionElapsed++

  // 判断心跳是否超时，每次tick electionElapsed++，如果electionElapsed >= raft初始化时的randomizedElectionTimeout
  // 则认为心跳超时了，这时候就可以进行选主了
    if r.promotable() && r.pastElectionTimeout() {
        r.electionElapsed = 0
        if err := r.Step(pb.Message{From: r.id, Type: pb.MsgHup}); err != nil {
            r.logger.Debugf("error occurred during election: %v", err)
        }
    }
}

// 心跳超时判断
func (r *raft) pastElectionTimeout() bool {
    return r.electionElapsed >= r.randomizedElectionTimeout
}

The follower node will give electionElapsed+1 every time the logical clock advances. When raft is initialized, a randomizedElectionTimeout will be set. When the number of logical clock advances exceeds randomizedElectionTimeout, the leader election process will be triggered.

So when will electionElapsed be reset?

 func stepFollower(r *raft, m pb.Message) error {
    switch m.Type {
    ......
    case pb.MsgHeartbeat:
        r.electionElapsed = 0
        r.lead = m.From
        r.handleHeartbeat(m)
  ......
}

When the node starts, it will switch to the follower state. When the follower receives the leader's MsgHeartbeat, it will reset electionElapsed

Here, when it is judged that electionElapsed > randomizedElectionTimeout, that is, when the leader's heartbeat times out, the voting process for selecting the leader is initiated.

Poll

Leader election here we don't consider the log processing of voting messages

 func (r *raft) Step(m pb.Message) error {
    ......
    switch m.Type {
    case pb.MsgHup:
    // preVote 是个开关，用于选举前置判断，例如当前节点被网络分区等情况造成的异常，只有preVote通过后才会发起投票
    // preVote 和 vote逻辑差不多，就不多分析了
        if r.preVote {
            r.hup(campaignPreElection)
        } else {
            r.hup(campaignElection)
        }
  ......
  }
}


func (r *raft) hup(t CampaignType) {
  // 如果当前节点是leader，则忽略
    if r.state == StateLeader {
        r.logger.Debugf("%x ignoring MsgHup because already leader", r.id)
        return
    }
    // 判断当前节点是否可以晋升为leader
    if !r.promotable() {
        r.logger.Warningf("%x is unpromotable and can not campaign", r.id)
        return
    }
  // 获取未提交的raftLog
    ents, err := r.raftLog.slice(r.raftLog.applied+1, r.raftLog.committed+1, noLimit)
    if err != nil {
        r.logger.Panicf("unexpected error getting unapplied entries (%v)", err)
    }
    if n := numOfPendingConf(ents); n != 0 && r.raftLog.committed > r.raftLog.applied {
        r.logger.Warningf("%x cannot campaign at term %d since there are still %d pending configuration changes to apply", r.id, r.Term, n)
        return
    }

  // 开始竞选
    r.logger.Infof("%x is starting a new election at term %d", r.id, r.Term)
    r.campaign(t)
}

 func (r *raft) campaign(t CampaignType) {
    if !r.promotable() {
        // This path should not be hit (callers are supposed to check), but
        // better safe than sorry.
        r.logger.Warningf("%x is unpromotable; campaign() should have been called", r.id)
    }
    var term uint64
    var voteMsg pb.MessageType
  // 更新raft的状态
    if t == campaignPreElection {
        r.becomePreCandidate()
        voteMsg = pb.MsgPreVote
        // PreVote RPCs are sent for the next term before we've incremented r.Term.
        term = r.Term + 1
    } else {
        r.becomeCandidate()
        voteMsg = pb.MsgVote
        term = r.Term
    }
  // 给自身投票，并记录投票结果，判断是否能晋级
    if _, _, res := r.poll(r.id, voteRespMsgType(voteMsg), true); res == quorum.VoteWon {
        // We won the election after voting for ourselves (which must mean that
        // this is a single-node cluster). Advance to the next state.
    // 这里则是preVote成功了，则是发起真正的投票环节
        if t == campaignPreElection {
            r.campaign(campaignElection)
        } else {
      // 投票成功了，则晋升为leader
            r.becomeLeader()
        }
        return
    }
  
  // 投票结果未满足晋升结果，则开始周知其他节点进行投票
    var ids []uint64
    {

        idMap := r.prs.Voters.IDs()
        ids = make([]uint64, 0, len(idMap))
        for id := range idMap {
            ids = append(ids, id)
        }
        sort.Slice(ids, func(i, j int) bool { return ids[i] < ids[j] })
    }
    for _, id := range ids {
        if id == r.id {
            continue
        }
        r.logger.Infof("%x [logterm: %d, index: %d] sent %s request to %x at term %d",
            r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), voteMsg, id, r.Term)

        var ctx []byte
        if t == campaignTransfer {
            ctx = []byte(t)
        }
    // 将投票的消息发送给其他节点，这里只是把消息存储起来，等待其他goroutine消费发送
        r.send(pb.Message{Term: term, To: id, Type: voteMsg, Index: r.raftLog.lastIndex(), LogTerm: r.raftLog.lastTerm(), Context: ctx})
    }
}

这里我们先留意一下， preVote后，竞选类型就是campaignPreElection ，然后PreCandidate ；否则竞选类型是campaignElection , and then promoted to Candidate , let's see what these two roles did respectively

 func (r *raft) becomeCandidate() {
    // TODO(xiangli) remove the panic when the raft implementation is stable
    if r.state == StateLeader {
        panic("invalid transition [leader -> candidate]")
    }
    r.step = stepCandidate
    r.reset(r.Term + 1)
    r.tick = r.tickElection
    r.Vote = r.id
    r.state = StateCandidate
    r.logger.Infof("%x became candidate at term %d", r.id, r.Term)
}

func (r *raft) becomePreCandidate() {
    // TODO(xiangli) remove the panic when the raft implementation is stable
    if r.state == StateLeader {
        panic("invalid transition [leader -> pre-candidate]")
    }
    // Becoming a pre-candidate changes our step functions and state,
    // but doesn't change anything else. In particular it does not increase
    // r.Term or change r.Vote.
    r.step = stepCandidate
    r.prs.ResetVotes()
    r.tick = r.tickElection
    r.lead = None
    r.state = StatePreCandidate
    r.logger.Infof("%x became pre-candidate at term %d", r.id, r.Term)
}

It can be seen that PreCandidate The Term here has not changed, and the Term + 1 of Candidate 1b55eb71ec850615f8dac3282ff501fd--- matches what we said earlier preVote The role of - avoid Frequent primary elections caused by network partitions

The function of poll is to record the voting result of a certain id and judge whether the overall voting result is completed

 func (r *raft) poll(id uint64, t pb.MessageType, v bool) (granted int, rejected int, result quorum.VoteResult) {
    if v {
        r.logger.Infof("%x received %s from %x at term %d", r.id, t, id, r.Term)
    } else {
        r.logger.Infof("%x received %s rejection from %x at term %d", r.id, t, id, r.Term)
    }
    r.prs.RecordVote(id, v)
    granted, rejected, result = r.prs.TallyVotes()
    r.logger.Infof("%d check poll result, votes = %d, rejected = %d, voteResult = %d, record = %v, voters = %v",
        r.id, granted, rejected, result, r.prs.Votes, r.prs.Voters)
    return
}

 func (r *raft) send(m pb.Message) {
    if m.From == None {
        m.From = r.id
    }
    ......
  // 把消息追加到slice里面，等待其他goroutine消费
    r.msgs = append(r.msgs, m)
}

The logic of selecting the main process is as follows

First determine whether you are the leader or whether you can be promoted
Whether the preVote link is turned on, if it is turned on, go through the preVote process first, which is similar to the vote process
Then modify the status and information of its own node and promote it to Candidate
Cast a vote for yourself, and calculate whether the voting result is a victory. If you win, you will be directly promoted to the leader and start the leader-related process.
If the voting has not ended, start sending voting information to other nodes
The voting information here is only stored in the slice raft.msgs, waiting for other goroutines to consume

So at this point, the voting process here is over, but the voting message has not been sent and the voting results have not been received. This part of the logic has to return to raftNode.serveChannels and node.run to see span

Voting message sending

 func (n *node) run() {
    var propc chan msgWithResult
    var readyc chan Ready
    var advancec chan struct{}
    var rd Ready

    r := n.rn.raft

    lead := None

    for {
        if advancec != nil {
            readyc = nil
        } else if n.rn.HasReady() {
      // 判断是否有消息需要处理
            // Populate a Ready. Note that this Ready is not guaranteed to
            // actually be handled. We will arm readyc, but there's no guarantee
            // that we will actually send on it. It's possible that we will
            // service another channel instead, loop around, and then populate
            // the Ready again. We could instead force the previous Ready to be
            // handled first, but it's generally good to emit larger Readys plus
            // it simplifies testing (by emitting less frequently and more
            // predictably).
      // 获取msg和raft状态，组装成ready结构
            rd = n.rn.readyWithoutAccept()
            readyc = n.readyc
        }

        .......

        select {
        // TODO: maybe buffer the config propose if there exists one (the way
        // described in raft dissertation)
        // Currently it is dropped in Step silently.
        case pm := <-propc:
            m := pm.m
            m.From = r.id
            err := r.Step(m)
            if pm.result != nil {
                pm.result <- err
                close(pm.result)
            }
    // 接收到follower的投票结果
    // 这里是网络组件相关的处理，当节点收到其他的网络请求的时候，会通过recvc通道传递过来处理
        case m := <-n.recvc:
            // filter out response message from unknown From.
            if pr := r.prs.Progress[m.From]; pr != nil || !IsResponseMsg(m.Type) {
        // 重新进入Step函数处理，注意这里的m.Type == MsgVoteResp
                r.Step(m)
            }
        ......
    // 在这里，将上面组装好的ready结构体，通过readyc传给raftNode.serveChannels处理
        case readyc <- rd:
      // 更新raft状态和删除msgs
            n.rn.acceptReady(rd)
            advancec = n.advancec
        case <-advancec:
            n.rn.Advance(rd)
            rd = Ready{}
            advancec = nil
        case c := <-n.status:
            c <- getStatus(r)
        case <-n.stop:
            close(n.done)
            return
        }
    }
}

Processing of voting results

 func (r *raft) Step(m pb.Message) error {
    ......

    switch m.Type {
    .....

    default:
        err := r.step(r, m)
        if err != nil {
            return err
        }
    }
    return nil
}

Here we re-enter the Step function to process the message, but at this time, the received message type is MsgVoteResp , and the default step is executed directly

At the same time, the status of the node has already initiated voting, so the node is promoted to Candidate , r.step corresponds to the r.stepCandidate function

 func stepCandidate(r *raft, m pb.Message) error {
    ......
    case myVoteRespType:
    // 记录follower节点的投票结果，并计算投票的结果（输或赢）
        gr, rj, res := r.poll(m.From, m.Type, !m.Reject)
        r.logger.Infof("%x has received %d %s votes and %d vote rejections", r.id, gr, m.Type, rj)
        switch res {
    // 赢得选举，则晋升为leader节点
        case quorum.VoteWon:
            if r.state == StatePreCandidate {
                r.campaign(campaignElection)
            } else {
                r.becomeLeader()
                r.bcastAppend()
            }
    // 输掉选择，则降级为follower
        case quorum.VoteLost:
            // pb.MsgPreVoteResp contains future term of pre-candidate
            // m.Term > r.Term; reuse r.Term
            r.becomeFollower(r.Term, None)
    // 其余情况下，就是票数还不足与判断选举结果，则继续等待其他节点的投票
        }
    ......
    return nil
}

stepCandidate records the voting results passed by the followers, and re-counts the results of the leader election. If more than half of the votes are successfully obtained, it will be promoted to leader and notified to each node; if it loses the leader election, it will be downgraded to follower; In the rest of the cases, the election of the master is still in progress, and the remaining nodes have not voted, then continue to wait for the remaining nodes to vote

Summarize

There will be a timer in raft, which will be triggered regularly to promote the operation of the main selection logic.
Each time the timer is triggered, the property electionElapsed++ will be evaluated. When the value of electionElapsed exceeds the randomizedElectionTimeout created during initialization, it is considered that the leader sends a heartbeat timeout and starts to trigger the main election logic.
The follower node judges whether the attribute of preVote is set to true. If it is true, the preVote logic is executed. This is to avoid frequent master election caused by the number of nodes being less than 1/2 of the cluster caused by network partitions.
The follower node sets itself as Candidate and starts voting for itself
Determine if the voting result is enough to win the election, if so, promote to leader and issue a notification
When the voting result is not enough to win the election, distribute the voting task to other nodes
The voting results of other nodes are passed to raftNode through the node.recvc channel, and processed in the run function
After receiving the voting results of other nodes, re-execute the voting judgment to know whether to win the election or lose the election

log synchronization

data processing request

Data processing is handled through httpKVAPI. When the user's Put method requests, it is to create or modify data

 func (h *httpKVAPI) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    key := r.RequestURI
    defer r.Body.Close()
    switch r.Method {
    case http.MethodPut:
        v, err := io.ReadAll(r.Body)
        if err != nil {
            log.Printf("Failed to read on PUT (%v)\n", err)
            http.Error(w, "Failed on PUT", http.StatusBadRequest)
            return
        }

        h.store.Propose(key, string(v))

        // Optimistic-- no waiting for ack from raft. Value is not yet
        // committed so a subsequent GET on the key may return old value
        w.WriteHeader(http.StatusNoContent)
......
}

 func (s *kvstore) Propose(k string, v string) {
    var buf bytes.Buffer
    if err := gob.NewEncoder(&buf).Encode(kv{k, v}); err != nil {
        log.Fatal(err)
    }
    s.proposeC <- buf.String()
}

Requests for creating or modifying data are not directly stored in kvstore, but are transferred to raftNode through proposeC for processing

raftNode receives data

 func (rc *raftNode) serveChannels() {
  ......
    // send proposals over raft
    go func() {
        confChangeCount := uint64(0)

        for rc.proposeC != nil && rc.confChangeC != nil {
            select {
      // 接收kvstore传递过来的数据
            case prop, ok := <-rc.proposeC:
                if !ok {
                    rc.proposeC = nil
                } else {
                    // blocks until accepted by raft state machine
                    rc.node.Propose(context.TODO(), []byte(prop))
                }
      ......
            }
        }
        // client closed channel; shutdown raft if not already
        close(rc.stopc)
    }()

 func (n *node) Propose(ctx context.Context, data []byte) error {
   return n.stepWait(ctx, pb.Message{Type: pb.MsgProp, Entries: []pb.Entry{{Data: data}}})
}

func (n *node) stepWait(ctx context.Context, m pb.Message) error {
    return n.stepWithWaitOption(ctx, m, true)
}

func (n *node) stepWithWaitOption(ctx context.Context, m pb.Message, wait bool) error {
    ......
    ch := n.propc
    pm := msgWithResult{m: m}
    if wait {
        pm.result = make(chan error, 1)
    }
    select {
  // 将数据封装好后，传给node.propc这个通道，交给node.run方法处理
    case ch <- pm:
    // 这里是true，也就是不return
        if !wait {
            return nil
        }
    case <-ctx.Done():
        return ctx.Err()
    case <-n.done:
        return ErrStopped
    }
  // block直到有结果
    select {
    case err := <-pm.result:
        if err != nil {
            return err
        }
    case <-ctx.Done():
        return ctx.Err()
    case <-n.done:
        return ErrStopped
    }
    return nil
}

raftNode.serveChannels Start a goroutine loop to check that there are a few data messages, and after there are messages, they start to encapsulate and pass them to node.run for processing, and block them, know node.run return the processing result

Next, look at the processing of node.run

 func (n *node) run() {
    var propc chan msgWithResult
    var readyc chan Ready
    var advancec chan struct{}
    var rd Ready

    r := n.rn.raft

    lead := None

    for {
        ......

        select {
        // TODO: maybe buffer the config propose if there exists one (the way
        // described in raft dissertation)
        // Currently it is dropped in Step silently.
    // 这里就跟上面的选主一样，选主也是一个消息，日志同步也是一个消息，所以都走到了这里
        case pm := <-propc:
            m := pm.m
            m.From = r.id
            err := r.Step(m)
      // 处理完成后，将结果通过通道传递给上面，释放前面的block逻辑
            if pm.result != nil {
                pm.result <- err
                close(pm.result)
            }
        ......
        }
    }
}

 func (r *raft) Step(m pb.Message) error {
    // Handle the message term, which may result in our stepping down to a follower.
    ......

    default:
    // 调用自身注册的step方法来处理
        err := r.step(r, m)
        if err != nil {
            return err
        }
    }
    return nil
}

Here, r.step() will be called to process, and the step method has different implementations for different roles

follower corresponds to stepFollower

leader corresponds to stepLeader

Let's first look at stepFollower

 func stepFollower(r *raft, m pb.Message) error {
    switch m.Type {
    case pb.MsgProp:
        if r.lead == None {
            r.logger.Infof("%x no leader at term %d; dropping proposal", r.id, r.Term)
            return ErrProposalDropped
        } else if r.disableProposalForwarding {
            r.logger.Infof("%x not forwarding to leader %x at term %d; dropping proposal", r.id, r.lead, r.Term)
            return ErrProposalDropped
        }
        m.To = r.lead
    // 将消息转发给leader
        r.send(m)
    ......
    return nil
}

According to the logic of stepFollower , it can be seen that when the follower gets the message of creating and modifying data, it directly forwards the request to the leader for processing

Next, let's take a look at stepLeader , the corresponding leader's processing method when it gets the message of building and modifying data

 func stepLeader(r *raft, m pb.Message) error {
    ......
    case pb.MsgProp:
        if len(m.Entries) == 0 {
            r.logger.Pa

In-depth analysis of raftexample to understand the raft protocol

Demonstration animation

leader election

log synchronization

Concept analysis

logical clock

raft.Peer

raftpb.Message

raftpb.Entry

Notice

Source code analysis

Structure introduction

raftNode

raft

raftLog

start process

httpKVAPI module

start up

request processing

kvstore module

Create kvstore

data storage

data lookup

data submission

Summary & Notes

raftNode module

start raftNode

start node

start serverRaft

leader election

timer trigger

Timer reception processing

Poll

Voting message sending

Processing of voting results

Summarize

log synchronization

data processing request

raftNode receives data

tyloafer

引用和评论

基于LangChain的应急定位Agent探索

被 Manus 带火的 MCP 是什么｜一文看懂

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Go slice切片使用教程，一次通关！

腾讯 tRPC-Go 教学——（1）搭建服务

`引用和评论`