MIT6.824-Lab2C

实验简介

实验代码使用重构后的代码: 见Lab2B.

2C中节点随时可能crash, 所以我们要利用Persisiter来持久化一些数据: log[], currentTerm, votedFor，主要通过完善.readPersisit()和.persisit()方法, 这个在代码注释用有教程了。在rf.Make()立刻恢复crash前的状态，随后变为FOLLOWER。
2C模拟了一个不稳定的网络，rpc调用者和处理者都要考虑rpc乱序，丢失，重复，延迟的因素，所以要在2B的基础上继续完善处理逻辑。

持久化

raft/persisiter.go：序列化使用labgob, 这是6.824对gob的封装，按照注释完成raft.go: persisit(), readPersist().
什么时候进行持久化?当currentTerm, votedFor, log发生改变时.
rf.log要全部持久化.

遇到的问题

TestFigure8()

第三个TestFigure82C(), 测试小论文Figure 8 (大论文Figure 3.7)的情况．也就是Leader只能提交当前Term的日志。

有一定概率(<20%)错误提示:

// config.go, line 180
// some server has already committed a different value for this entry!
err_msg = fmt.Sprintf("commit index=%v server=%v %v != server=%v %v",
    m.CommandIndex, i, m.Command, j, old)

review代码，发现只持久化了rf.log[:rf.commitIndex+1]这一段, 没有充分理解论文的要求. 修改为把所有日志都持久即可．

为什么只持久化到rf.commitIndex是不行的？一个日志被复制到了Majority上但还没committed, 这时候节点crash，故障恢复后rf.commitIndex之后的日志都没有了, 而这一部分的日志很可能将来会被leader提交。

Raft算法保证committed的日志一定被持久化并且最终会被状态机执行．

TestUnreliableFigure8()

TestFigure8Unreliable2C()模拟了一个不稳定的网络环境，rpc handler需要留心过时的request和reply。
笔者测试时发现:

tsujo@masterTsujo[23:14:55]:~/mycode/mit_6824/src/raft$ grep "apply error" ./test_res/* | grep -n ""
1:./test_res/figure8_10.txt:2020/01/18 22:14:40 apply error: commit index=95 server=2 354 != server=0 324
2:./test_res/figure8_11.txt:2020/01/18 22:15:18 apply error: commit index=32 server=4 517 != server=2 210
3:./test_res/figure8_19.txt:2020/01/18 22:20:32 apply error: commit index=83 server=0 376 != server=3 346
4:./test_res/figure8_23.txt:2020/01/18 22:23:07 apply error: commit index=300 server=1 669 != server=3 542
5:./test_res/figure8_26.txt:2020/01/18 22:24:56 apply error: commit index=179 server=0 528 != server=3 482
6:./test_res/figure8_28.txt:2020/01/18 22:26:04 apply error: commit index=160 server=3 692 != server=1 673
7:./test_res/figure8_31.txt:2020/01/18 22:27:59 apply error: commit index=70 server=4 288 != server=3 207
8:./test_res/figure8_41.txt:2020/01/18 22:33:51 apply error: commit index=74 server=3 753 != server=2 618
9:./test_res/figure8_47.txt:2020/01/18 22:37:18 apply error: commit index=27 server=2 463 != server=0 310
10:./test_res/figure8_4.txt:2020/01/18 22:11:01 apply error: commit index=24 server=2 609 != server=4 557
11:./test_res/figure8_72.txt:2020/01/18 22:52:21 apply error: commit index=119 server=3 307 != server=0 228
12:./test_res/figure8_74.txt:2020/01/18 22:53:38 apply error: commit index=115 server=0 700 != server=3 411
13:./test_res/figure8_78.txt:2020/01/18 22:55:39 apply error: commit index=173 server=3 633 != server=1 595
14:./test_res/figure8_8.txt:2020/01/18 22:13:27 apply error: commit index=329 server=3 939 != server=4 778
15:./test_res/figure8_99.txt:2020/01/18 23:08:16 apply error: commit index=81 server=2 599 != server=4 288

tsujo@masterTsujo[23:15:02]:~/mycode/mit_6824/src/raft$ grep "one(" ./test_res/* | grep -n ""
1:./test_res/figure8_13.txt:    config.go:480: one(23336666) failed to reach agreement, expected index = 134  
2:./test_res/figure8_59.txt:    config.go:480: one(23336666) failed to reach agreement, expected index = 117  

tsujo@masterTsujo[23:15:15]:~/mycode/mit_6824/src/raft$

可以看到有两种错误，前一个错误也在TestFigure82C()中出现，但是笔者回头测试，发现TestFigure82C()运行情况良好，没有错误。可能是rpc handler不能处理好经历了网络异常的请求.

apply error是什么意思

测试代码的工作原理: 如果某t个节点向applyCh发送了一条日志,那么在这个Index下,其他节点要么还没发送,要么发送了一条相同的日志.

出现apply error表示节点s1先提交了日志A，另一个节点s2在相同index提交了不同的日志B。这可能是什么原因导致? 通过每次调用countingReplicas()时打印rf.matchIndex，猜测可能是matchIndex的更新存在Bug，但不确定具体原因。

参考Student's guide, 它提到rf.matchIndex更新这个例子:

// 笔者的做法
// 不安全, 考虑到不稳定的网络环境,rf.nextIndex可能被其他rpc请求修改过
rf.matchIndex[server] = rf.nextIndex[server] - 1

// 正确做法
rf.matchIndex[server] = prevIndex + len(args.Entries)

修改后就没有出现apply error的错误。

fail to reach agreement

可能是split-vote太久导致。

或者是.countingReplicas()过慢,笔者目前的算法每次把rf.commmitIndex递增1，复杂度O(N*M),其中N是Term的范围，M是集群节点个数.

在TestFigure8Unreliable2C()里日志有1000条左右，节点recover后的rf.commitIndex为0, 这时算法比较耗时。另一种思路是将rf.matchIndex(不含leader自己)排序,再取中间节点的值, 但目前的实现会导致apply error.

这两点可以继续完善。

结尾

经过连续90次测试，Lab2C没有出现apply error, 出现了1次fail to reach agreement.

目前这个问题还没修复.

MIT6.824-Lab2C

实验简介

持久化

遇到的问题

TestFigure8()

TestUnreliableFigure8()

apply error是什么意思

fail to reach agreement

结尾

Tsukami

引用和评论

事件驱动的HotStuff协议

深度解析：通过 AIBrix 多节点部署 DeepSeek-R1 671B 模型

架构师必看！现代应用架构发展趋势与数据库选型建议丨TiDB vs MySQL 专题（一）

演讲实录|分布式 Python 计算服务 MaxFrame 介绍及场景应用方案

【微服务架构】从链路追踪到日志关联：打造分布式系统问题定位利器

【赵渝强老师】TiDB的体系架构

CAP 理论：分布式系统的三选二原则与 Java 实战