【分布式系统设计简卷（0）】MapReduce

前言

本篇是关于 2022-MIT 6.824 的关于 MapReduce 的课程笔记及其实验记录；
课堂笔记部分基本摘自 Lecture，完全可以跳过；
如果发现内容上的纰漏，请不要吝啬您的键盘。

正文

课堂笔记

You should try everything else, before you try to build a distributed system, because they're not simpler.

Lab's Goal?

Lab 1: MapReduce
- implement your own version of the paper
Lab 2: replication for fault-tolerance using Raft
Lab 3: fault-tolerant key/value store
- use your Raft implementation to build a replicated K/V server
Lab 4: sharded key/value store
- clone the K/V server into a number of independent groups and split the data in your K/V storage system to get parallel speed-up and moving chunks of data between servers as they come and go without dropping any balls

Why do people build distributed systems?

to increase capacity via parallelism
- to get high-performance, lots of CPUs, memories, disk arms moving in parallel...
to tolerate faults via replication
- have two computers do the exact same things and if one of them fails you can cut over to the other one
to place computing physically close to external entities
- 美国西海岸和东海岸的两个服务器之间的通信问题
to achieve security via isolation

Challenges:

many concurrent parts, complex interactions
must cope with partial failure
tricky to realize performance potential

Big Goal?

This is a course about infrastructure for applications.

Storage system.
Communication sytem.
Computation system.

我们要尽可能地做 abstractions，使这些部分像是一个 non-distributed sytem 的部分。比如将 Storage system 抽象成熟悉的 file sytem，但背后实际上还是很复杂的 parrallel, fault-tolerated distributed system。因为这很难做到，所以这是我们的 big goal。

Topic:

implementation.
- here are three tools you'll need to build such a distributed system.
- RPC(remote procedure call), whose goal is to mask the fact that we're communicating over an unreliable network
- threads
- concurrenty control
performance.
- the high-level goal of building a distributed sytem is to get what people call scalable speed-up.
- scalability: 2x computers -> 2x throughput
fault tolerance
- 1000s of servers, big network -> always something broken
- We'd like to hide these failures from the application.(abstraction)
- Availability -- app can make progress despite failures
  - 有挂掉的也能继续正常工作
- Recoverability -- app will come back to life when failures are repaired
  - 修护后也能继续正常工作
  - non-volatile storage（但严重影响性能，通常不会选择）
  - Replication
consistency
- 假设当前 distributed system 提供的是 K/V Service
- 操作只有两种：Put(k, v) and Get(k) return v
- 若在 non-distributed sytem 中，这两种操作都只有一种语义；
- 但在 distributed system，由于通常会有 replication 或 caching 的存在，数据(kv pair) 会有多个拷贝。当你在更新这些拷贝的途中出现 failure 了，如果保证数据的 consistency 将会是件很重要的事。
- Strong consistency 可以使你访问到 the most recently updated data，但通讯的开销太大；
- Weak consistency 尽力而为，但通常能 acceptable，取决于你的 trade-off。

实验部分

实验目标

Lab 1 的目标是把 MapReduce 这篇 2004 年 Google 的经典论文的内容简单复现一遍（玩具），并通过测试。

MapReduce 设计了一个框架，这个框架能够屏蔽分布式系统内部的复杂细节，用户只需要编写 map function 和 reduce function 就可以享受分布式并行计算带来的高性能。完成着重阅读论文第 2 节 Programming Model 和第 3 节 Implementation 的部分。这两个部分直接关系到实验的实现。

实验技巧

asynchronous programming == event-driven programming（异步编程 == 事件驱动编程）？

single thread of control, sits an event-driven loop, waits for inputs, whenever it gets an input like a packet, it figures out which client did this packet come from, and it'll had a table of sort of what the state is of whatever activity it's managing for that client. it'll find I was in the middle of reading such-and-such a file, and now it's asked me to read the next block I'll go and be the next block I'll return it.

When to use sharing and locks, versus channels?

Most problems can be solved in either style
- state -- sharing and locks
- communication -- channels
For the 6.824 labs, I recommend sharing+locks for state, and sync.Cond or channels or time.Sleep() for waiting/notification.

实验过程

配置好虚拟机，安装最新版本的 Golang，安装一个 Goland，用 Git 把实验代码仓库克隆下来，实验环境就没了。这相比于之前 6.S081 去安装 qemu 简单太多。试过用 VScode 写 Golang，但发现安装了语法检查和自动补全的组件后直接代码编辑区直接爆红，暂时找不到解决方案就直接上手 Goland 了。

先前没有接触过 Go，所以遵循实验指导书的提示把 Go 官网的 Tutorial 过了一遍，效果真的很不错。刷完后至少能把实验的源码看得懂了。

关于实验的实现，其实仅根据论文的描述，尤其是论文里的那张彩图，对照着做就好了。亲测仅依赖论文和实验指导书的内容就可以独立完成实验 1，自己的实现也就 500 行左右的代码，况且难度还只是 moderate，好日子还在后头呢。

据说这门课直接贴具体代码是违规的，所以这里就只贴三个文件的 abstract。

src/mr/coordinator.go：

src/mr/worker.c：

src/mr/rpc.c：

实验踩坑

判断哪些是共享变量
- 有些线程安全问题 -race option 都有可能检测不出来，因此可能会出现数据不一致的问题。比如我原来在 createWorkerId() 时，“workerId 的分配”和“将 worker 结构体添加到 workers 切片内”的操作不是绑定的的，因此后期可能会出现数据的不一致问题。
慎用 append
- 将一个元素 append 一个切片后，当长度超过容量时，Go 会将切片中的所有元素拷贝到另一个更大的空闲区域。在这之前若有线程持有在切片某个元素的指针时，这个指针会指向原来的切片的位置，造成数据的不一致性。我的解决方法比较粗暴，初始化切片时直接给它分配 1000 的长度，再也不用担心切片扩容了。

后记

个人理解的一个分布式系统开发周期，要分三个阶段，各阶段不是严格串行执行的，每个阶段内可做数次迭代：

架构：做正确的事，系统分析与设计（预计花费 60% 的时间）

阅读相关材料、分析，理解；
画状态时序图，设计数据结构和算法。

稳定：正确地做事，测试驱动开发（预计花费 40% 的时间）

先保证一个用例运行起来程序不报错；
再保证一个用例能够可以输出正确的结果；
最后保证所有用例都能输出正确的结果，而且无论何时都能输出正确的结果；
迭代式地添加功能，按核心重要程度依次添加，每添加一个完整的功能就要做一次回归测试；
当资源充足并可行时，为每个功能模块做一次单元测试。
进行必要的重构，提高可维护性，实现新功能。

性能：精益求精，提升客户体验（规格外，无法预估）

降低资源使用门槛，压榨硬件性能，直到达到架构设限的天花板

最后说一下 Coding 的一些习惯：

先不考虑代码风格和质量等这些高级话题，就说最常见的 Debug。写好程序后跑测试挂掉了，然后翻日志不断进行比对，然后经过不知道多长时间后终于发现了数据不一致或不正确的原因。且不说这个过程有多么地浪费时间且低效，它还会对你的眼睛和颈椎造成伤害。

所以于其“浪费时间” Debug，要保证一开始写的时候就要考虑各种各样的 corner case，多考虑考虑这么写会有什么副作用，就算短时间实现不了也要留个注释备注一下。就算真出现 Bug 了也要尽快缩短排错时间，Error 信息要规范，要完备。我知道这听起来像是废话，但我的意思 Coding 的时候要集中注意力，带耳机听歌什么的还是不要了，不然你的那一点“马虎”所付出的代价可能是极其痛苦的，特别是在这种分布式系统场景下。对自己的代码负责的同时，更要对自己的时间负责。

一句话概括：有意识地缩短 Debug 的时间，不管是从源头上还是在过程上。

【分布式系统设计简卷（0）】MapReduce

前言

正文

课堂笔记

实验部分

实验目标

实验技巧

实验过程

实验踩坑

后记

Thinkingear

引用和评论

【分布式系统设计简卷（2）】KV Raft

Paxos 协议三阶段解密：原理剖析与 Java 实现

Dolphinscheduler 之 MR任务