分布式系统 - Ceph的Paxos源码注释 - Phase 1 - 存储老小伙的专栏

欢迎关注存储老小伙的博客。

上篇是Leader 选举部分。这篇是Ceph的Paxos协议的Phase1(Prepare)，其目的是就PN达成一致。

1. 几个要点说明

1.1 Epoch

每次选举产生新的leader，也会产生新的epoch。不选举则不会修改epoch。
一个leader当选期间，发送的所有消息，都会带有这个epoch。
如果由于网络分割等现象，有新的选举发生，则根据epoch就发现leader已经变了。
注意，按照paxos论文描述，没有Leader也是可以正常运行的，只是可能降低效率。
没有leader则不需要epoch

1.2 PN (Proposal Number)

Leader当选后，会首先执行一次phase 1过程，以确定PN。在其为leader期间，
所有的phase 2操作都共用一个PN。所以省略了大量的phase 1操作，这也是
paxos能够减小网络开销的原因。 "Paxos made simple"文中说：
"A newly chosen leader executes phase 1 for infinitely many
instances of the consensus algorithm"。
PN是必须的，无论是否有leader，都必须有PN

1.3 Version

可以理解成Paxos 的instance ID，或者raft的logID。

1.4 持久化

对比Raft，虽然ceph的复制也可以看成一个个log的追加，
但是所有信息都写在k/v中，而不是写log文件, 比如，instanceID为X的log，
在k/v存储中，其key是X，value是log内容。
其他各种需要持久化的值，都写在k/v存储中。

1.5其他需要持久化的数据结构

除了log以外，每个paxos成员，都维护以下几个需要持久化的变量。
大家可以跟raft的paper做些简单对比。

名称	含义	其他
last_pn	上次当选leader后生成的PN	get_new_proposal_number()使用，下次当选后，接着生成
accepted_pn	我接受过的PN，可能是别的leader提议的PN	peon根据这个值拒绝较小的PN
first_committed	本节点记录的第一个被commit的版本	更早的版本(日志)，本节点没有了
last_committed	本节点记录的最后一次被commit的版本	往后的版本，未被commit，可能有一个
uncommitted_v	本节点记录的未commit的版本，如果有，只能等于last_commit+1	ceph只允许有一个未commit的版本
uncommitted_pn	未commit的版本对应的PN	与uncommitted_v，uncommitted_value在一个事务中记录
uncommitted_value	为commit的版本的内容	见上面

注意，上述三个"uncommitted"开头的值，可能压根就不存在，比如正常关机，全部都commit了。

1.6 Phase 1交互过程简介

Phase 1就是 paxos协议的Propose阶段，包括三个步骤，如下表：

步骤	Leader	Peons	备注
1	collect() =>		Leader给quorum中各个peon发送PN以及其他附带信息
2		<=handle_collect()	Peon同意或者拒绝PN。并中间可能分享已经commit的数据
3	handle_last()		Quorum中peon全部同意leader的PN，才算成功

2. 代码

2.1 初始化

void Paxos::init()
{ //几个持久化的变量，加载时即从从kv读出。
  // load paxos variables from stable storage
  //上次产生的PN
  last_pn = get_store()->get(get_name(), "last_pn");
  //上次接受的pn
  accepted_pn = get_store()->get(get_name(), "accepted_pn");
  //最近或最后一个被commit的verion，实际上是paxos 的instance ID。
  last_committed = get_store()->get(get_name(), "last_committed");
  //保存的最早被commit的版本(log)。更早的log可能已经被truncate掉了
  first_committed = get_store()->get(get_name(), "first_committed");

  //paxos的 first_committed，并不是某个monitor的first_committed，各个monitor
  //对应值可能都是不一样的。
  assert(is_consistent());
}

2.2 Leader发起的collect

// PHASE 1:  collect和handle_collect基本能对应paxos的phase 1
//这是leader的当选后执行函数，用于确定新的PN。
//collect过程，相当于完成当选期间所有提议的phase 1。
//在其当选期间，会一直使用这个PN
void Paxos::collect(version_t oldpn)
{
 // we're recoverying, it seems!
  state = STATE_RECOVERING;
  assert(mon->is_leader());

  // reset the number of lasts received
  uncommitted_v = 0; //新当选，初始化
  uncommitted_pn = 0;
  uncommitted_value.clear();
  peer_first_committed.clear();
  peer_last_committed.clear();
  
  //ceph的实现中，只允许有一个proposal处于pending状态(跟raft相同)。
  //如果新leader当选后发现有pending的提议，那么其instanceID/version
  //只能是last_committed+1
  if (get_store()->exists(get_name(), last_committed+1)) {
    /*pending_v, pending_pn和last_committed+1是一个事务写的。
     所以一起检查 */
    version_t v = get_store()->get(get_name(), "pending_v");  
    version_t pn = get_store()->get(get_name(), "pending_pn"); 
    if (v && pn && v == last_committed + 1){//这个是正常分支
      uncommitted_pn = pn;
    } else {
      dout(10) << "WARNING: no pending_pn on disk, using previous accepted_pn " << accepted_pn  << " and crossing our fingers" << dendl;
      uncommitted_pn = accepted_pn;
    }
    uncommitted_v = last_committed+1;
    //找到uncommitted_v (这个key)对应的value
    get_store()->get(get_name(), last_committed+1, uncommitted_value);
    //uncommitted_v存在，要求uncommitted_value必须存在。    
    assert(uncommitted_value.length()); 
    logger->inc(l_paxos_collect_uncommitted);
  }

  //生成一个新的更大的PN，并自己先accept
  accepted_pn = get_new_proposal_number(MAX(accepted_pn, oldpn));
  accepted_pn_from = last_committed;
  num_last = 1;//1, 表示自己已经投票了

  //给quorum中各个成员发送
  for (set<int>::const_iterator p = mon->get_quorum().begin();
       p != mon->get_quorum().end();
       ++p) {
    //跳过自己，已经算投过了并修改了accepted_pn       
    if (*p == mon->rank) continue; 

    //epoch的用意: 如果网络分割，别人又发起了选举，现任leader不知道，接收方会发现epoch不对    
    MMonPaxos *collect = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_COLLECT,

    collect->last_committed = last_committed; //用来与peer比较的
    collect->first_committed = first_committed;
    //这个操作本身带的PN是刚生成的。
    collect->pn = accepted_pn; 
    mon->messenger->send_message(collect, mon->monmap->get_inst(*p));
  }

  //设置超时处理
  collect_timeout_event = new C_CollectTimeout(this);
  mon->timer.add_event_after(g_conf->mon_accept_timeout, collect_timeout_event);
}

2.3 Peon处理collect请求

//Peon，可以对应Raft的follower
void Paxos::handle_collect(MonOpRequestRef op)
{
  op->mark_paxos_event("handle_collect");

  MMonPaxos *collect = static_cast<MMonPaxos*>(op->get_req());

  assert(mon->is_peon()); // mon epoch filter should catch strays

  // we're recoverying, it seems!
  state = STATE_RECOVERING;

  //我落后的太远，中间相差的已无法通过log补齐，只有bootstrap(自举)了。
  if (collect->first_committed > last_committed+1) {
    dout(5) << __func__
            << " leader's lowest version is too high for our last committed"
            << " (theirs: " << collect->first_committed
            << "; ours: " << last_committed << ") -- bootstrap!" << dendl;
    op->mark_paxos_event("need to bootstrap");
    mon->bootstrap();
    return;
  }

  // reply
  MMonPaxos *last = new MMonPaxos(mon->get_epoch(), MMonPaxos::OP_LAST,
                  ceph_clock_now(g_ceph_context));
  //本地保存的两个committed，返回给leader                  
  last->last_committed = last_committed;
  last->first_committed = first_committed;
  
  version_t previous_pn = accepted_pn;//这个是本地记录的以前的accepted_pn

  //这个是标准的paxos PN比较，如果收到的PN大于我之前接受过的PN ,则同意
  if (collect->pn > accepted_pn) {
    accepted_pn = collect->pn;
    accepted_pn_from = collect->pn_from;
    dout(10) << "accepting pn " << accepted_pn << " from " 
         << accepted_pn_from << dendl;
  
    MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
    //需要先持久化，然后再回复
    t->put(get_name(), "accepted_pn", accepted_pn);
    dout(30) << __func__ << " transaction dump:\n";
    JSONFormatter f(true);
    t->dump(&f);
    f.flush(*_dout);
    *_dout << dendl;

    logger->inc(l_paxos_collect);
    logger->inc(l_paxos_collect_keys, t->get_keys());
    logger->inc(l_paxos_collect_bytes, t->get_bytes());
    utime_t start = ceph_clock_now(NULL);
    get_store()->apply_transaction(t);

    utime_t end = ceph_clock_now(NULL);
    logger->tinc(l_paxos_collect_latency, end - start);
  } else {//其他情况，不接受
    // don't accept!
    dout(10) << "NOT accepting pn " << collect->pn << " from " << collect->pn_from
         << ", we already accepted " << accepted_pn
         << " from " << accepted_pn_from << dendl;
  }
  //如果collect->pn(对方发过来的pn)小于我的PN，那么这个回复，就是拒绝。
  last->pn = accepted_pn;
  last->pn_from = accepted_pn_from;

  // share whatever committed values we have
  /*已经committed的数据都是可以信任的，如果对方的last_committed比我的小，
    那么我把我知道的已经commit的都分享做同步。share_state时，
    对方的处理函数是store_stat() 。完成后，对方也会修改了last_committed*/
  share_state(last, collect->first_committed, collect->last_committed);

  // do we have an accepted but uncommitted value?
  //  (it'll be at last_committed+1)
  bufferlist bl;
  if (collect->last_committed <= last_committed &&
      get_store()->exists(get_name(), last_committed+1)) {
      //前面提过，last_committed+1这个版本如果存在，那是一个未决的提议，
      //需要告诉leader。
    get_store()->get(get_name(), last_committed+1, bl);
    assert(bl.length() > 0);
    dout(10) << " sharing our accepted but uncommitted value for " 
         << last_committed+1 << " (" << bl.length() << " bytes)" << dendl;
    last->values[last_committed+1] = bl;

    version_t v = get_store()->get(get_name(), "pending_v");
    version_t pn = get_store()->get(get_name(), "pending_pn");
    
    if (v && pn && v == last_committed + 1) {
      /*如果有pending_pn，那么返回的uncommitted_pn就是 pending_pn，
      否则就在下面直接用previous_pn代替了*/
      last->uncommitted_pn = pn; 
    } else {
      // previously we didn't record which pn a value was accepted
      // under!  use the pn value we just had...  :(
      dout(10) << "WARNING: no pending_pn on disk, using previous accepted_pn " << previous_pn
           << " and crossing our fingers" << dendl;
      last->uncommitted_pn = previous_pn;
    }

    logger->inc(l_paxos_collect_uncommitted);
  }

  //reply可能是拒绝，如果我的pn比leader给的大
  collect->get_connection()->send_message(last);
}

2.4 分享已经commit的数据的两个函数

/**对方的处理函数是： store_state。share的是二者last_committed之间的各个版本对应的value。
 * @note This is Okay. We share our versions between peer_last_committed and
 *     our last_committed (inclusive), and add their bufferlists to the
 *     message. It will be the peer's job to apply them to its store, as
 *     these bufferlists will contain raw transactions.
 *     This function is called by both the Peon and the Leader. The Peon will
 *     share the state with the Leader during handle_collect(), sharing any
 *     values the leader may be missing (i.e., the leader's last_committed is
 *     lower than the peon's last_committed). The Leader will share the state
 *     with the Peon during handle_last(), if the peon's last_committed is
 *     lower than the leader's last_committed.
 */
void Paxos::share_state(MMonPaxos *m, version_t peer_first_committed,
            version_t peer_last_committed)
{
  assert(peer_last_committed < last_committed);

  dout(10) << "share_state peer has fc " << peer_first_committed 
       << " lc " << peer_last_committed << dendl;
  version_t v = peer_last_committed + 1;

  // include incrementals
  uint64_t bytes = 0;
  for ( ; v <= last_committed; v++) {
    /*注意这里面并没有进行消息传递，只是把两个版本之间的内容给打包
      进了msg，随着msg的其他内容一起发送*/
    if (get_store()->exists(get_name(), v)) {
      get_store()->get(get_name(), v, m->values[v]);
      assert(m->values[v].length());
      dout(10) << " sharing " << v << " ("
           << m->values[v].length() << " bytes)" << dendl;
      bytes += m->values[v].length() + 16;  // paxos_ + 10 digits = 16
    }
  }
  logger->inc(l_paxos_share_state);
  logger->inc(l_paxos_share_state_keys, m->values.size());
  logger->inc(l_paxos_share_state_bytes, bytes);

  m->last_committed = last_committed;
}

/**
 * Store on disk a state that was shared with us
 *
 * Basically, we received a set of version. Or just one. It doesn't matter.
 * What matters is that we have to stash it in the store. So, we will simply
 * write every single bufferlist into their own versions on our side (i.e.,
 * onto paxos-related keys), and then we will decode those same bufferlists
 * we just wrote and apply the transactions they hold. We will also update
 * our first and last committed values to point to the new values, if need
 * be. All all this is done tightly wrapped in a transaction to ensure we
 * enjoy the atomicity guarantees given by our awesome k/v store.
 */
bool Paxos::store_state(MMonPaxos *m)
{
  MonitorDBStore::TransactionRef t(new MonitorDBStore::Transaction);
  map<version_t,bufferlist>::iterator start = m->values.begin();
  bool changed = false;

  // build map of values to store
  // we want to write the range [last_committed, m->last_committed] only. 
  //对方状态比我快太多，没法根据收到的值去catchup
  if (start != m->values.end() &&
      start->first > last_committed + 1) {
    // ignore everything if values start in the future.
    dout(10) << "store_state ignoring all values, they start at " << start->first
         << " > last_committed+1" << dendl;
    start = m->values.end();
  }

  // push forward the start position on the message's values iterator, up until
  // we run out of positions or we find a position matching 'last_committed'.
  while (start != m->values.end() && start->first <= last_committed) {
  //移到我的last_committed开始
    ++start;
  }

  // make sure we get the right interval of values to apply by pushing forward
  // the 'end' iterator until it matches the message's 'last_committed'.
  map<version_t,bufferlist>::iterator end = start;
  while (end != m->values.end() && end->first <= m->last_committed) {
    last_committed = end->first;//内存中先修改
    ++end;
  }

  if (start == end) {
    dout(10) << "store_state nothing to commit" << dendl;
  } else {
    dout(10) << "store_state [" << start->first << ".." 
         << last_committed << "]" << dendl;
    //用一个事务，写入所有变化，包括last_committed和各个version
    t->put(get_name(), "last_committed", last_committed);

    // we should apply the state here -- decode every single bufferlist in the
    // map and append the transactions to 't'.
    map<version_t,bufferlist>::iterator it;
    for (it = start; it != end; ++it) {
      // write the bufferlist as the version's value
      //要store的version和相应value，先推入t
      t->put(get_name(), it->first, it->second);
      // decode the bufferlist and append it to the transaction we will shortly
      // apply.
      decode_append_transaction(t, it->second);
    }

    // discard obsolete uncommitted value?
    if (uncommitted_v && uncommitted_v <= last_committed) {
      dout(10) << " forgetting obsolete uncommitted value " << uncommitted_v
           << " pn " << uncommitted_pn << dendl;
      uncommitted_v = 0;
      uncommitted_pn = 0;
      uncommitted_value.clear();
    }
  }
  if (!t->empty()) {//t非空，说明有值要写

    logger->inc(l_paxos_store_state);
    logger->inc(l_paxos_store_state_bytes, t->get_bytes());
    logger->inc(l_paxos_store_state_keys, t->get_keys());
    utime_t start = ceph_clock_now(NULL);

    /*事务提交，包括last_committed和一些version及values。
      这个函数实际上会等待事务完成。*/
    get_store()->apply_transaction(t); 

    utime_t end = ceph_clock_now(NULL);
    logger->tinc(l_paxos_store_state_latency, end - start);

    //first_committed可能在事务执行过程中trim被修改了(log被trim了)，刷新下
    first_committed = get_store()->get(get_name(), "first_committed");

    _sanity_check_store();
    changed = true;//说明有修改的值
  }

  remove_legacy_versions();//erase掉比first_committed更早的

  return changed;
}

2.5 Leader处理Peon的回复

/*Leader收到回复后的处理。
  在 Ceph的election过程中，用预设的rank作为优先级。
  当选的leader不一定持有最新的数据，因此collection过程中，
  Leader需要更新下自己的数据。这些更新，都是根据已经"commit"的数据。*/
void Paxos::handle_last(MonOpRequestRef op)
{
  op->mark_paxos_event("handle_last");
  MMonPaxos *last = static_cast<MMonPaxos*>(op->get_req());
  bool need_refresh = false;
  
  //from是对方的编号
  int from = last->get_source().num();

  dout(10) << "handle_last " << *last << dendl;

  if (!mon->is_leader()) {
    dout(10) << "not leader, dropping" << dendl;
    return;
  }

  // note peer's first_ and last_committed, in case we learn a new
  // commit and need to push it to them.
  
  //本次返回的结果，插入map。
  peer_first_committed[from] = last->first_committed;
  peer_last_committed[from] = last->last_committed;
  
  //跟peer相比，自己落后很多，以至于别人也没有保留当时的各个版本的raw transaction信息。
  //只有直接走bootstrap流程，做完全同步。
  if (last->first_committed > last_committed + 1) {
    dout(5) << __func__
            << " mon." << from
        << " lowest version is too high for our last committed"
            << " (theirs: " << last->first_committed
            << "; ours: " << last_committed << ") -- bootstrap!" << dendl;
    op->mark_paxos_event("need to bootstrap");
    mon->bootstrap();
    return;
  }

  assert(g_conf->paxos_kill_at != 1);
  
  /*对应handle_collect 内部的share_state，对方可能给我共享了
    一部分更新的已commit数据(leader的状态比较旧)*/
  need_refresh = store_state(last);

  assert(g_conf->paxos_kill_at != 2);
  
  //store_state()会改变leader的last_committed和first_committed。
  //然后就可能发现某个peon也需要被更新
  for (map<int,version_t>::iterator p = peer_last_committed.begin();
       p != peer_last_committed.end();
       ++p) {
    if (p->second + 1 < first_committed && first_committed > 1) {
    //对方版本太旧，没法同步了。
      dout(5) << __func__
          << " peon " << p->first
          << " last_committed (" << p->second
          << ") is too low for our first_committed (" << first_committed
          << ") -- bootstrap!" << dendl;
      op->mark_paxos_event("need to bootstrap");
      mon->bootstrap();
      return;
    }
    
    //对方比我旧，但是还在可同步范围。
    if (p->second < last_committed) {
      // share committed values
      dout(10) << " sending commit to mon." << p->first << dendl;
      MMonPaxos *commit = new MMonPaxos(mon->get_epoch(),
      ceph_clock_now(g_ceph_context));
      //构造一条commit消息，给peon分享已经commit的数据
      share_state(commit, peer_first_committed[p->first], p->second);
      mon->messenger->send_message(commit, mon->monmap->get_inst(p->first));
    }
  }

  //Peon接受过的PN比Leader生成的PN大，按照paxos协议，提高PN，重试！
  if (last->pn > accepted_pn) {
    // no, try again.
    dout(10) << " they had a higher pn than us, picking a new one." << dendl; 

    // cancel timeout event
    mon->timer.cancel_event(collect_timeout_event);
    collect_timeout_event = 0;
    //注意，这次用新的PN继续collect。但不是重新选举。
    collect(last->pn);
  } else if (last->pn == accepted_pn) {//对方接受了
    // yes, they accepted our pn.  great.
    num_last++;

    // did this person send back an accepted but uncommitted value?
    if (last->uncommitted_pn) {
    //last_commited对应的肯定在此前达成quorum一致的，
    //而uncommitted则是认为没有形成quorum一致的，需要处理。
    //保证有大的uncommitted_pn，才符合paxos只直接受更大PN原则
      if (last->uncommitted_pn >= uncommitted_pn && 
      last->last_committed >= last_committed &&
      last->last_committed + 1 >= uncommitted_v) {
      //这个比较，是因为Leader会收到多个peon的uncommitted_v，取大的。
    uncommitted_v = last->last_committed+1; 
    // uncommitted_v 会一直朝大的变化
    uncommitted_pn = last->uncommitted_pn;
    uncommitted_value = last->values[uncommitted_v];
    dout(10) << "we learned an uncommitted value for " << uncommitted_v
         << " pn " << uncommitted_pn
         << " " << uncommitted_value.length() << " bytes"
         << dendl;
      } else {
    dout(10) << "ignoring uncommitted value for " << (last->last_committed+1)
         << " pn " << last->uncommitted_pn
         << " " << last->values[last->last_committed+1].length() << " bytes"
         << dendl;
      }
    }
    
    // is that everyone?
    if (num_last == mon->get_quorum().size()) {
      //这里要求quorum成员全体都响应
      // cancel timeout event
      mon->timer.cancel_event(collect_timeout_event);
      collect_timeout_event = 0;
      peer_first_committed.clear();
      peer_last_committed.clear();

      // almost...

      // did we learn an old value?
      if (uncommitted_v == last_committed+1 &&  //只允许差1
      //消息中携带的，上面刚刚赋了值
    dout(10) << "that's everyone.  begin on old learned value" << dendl;
    //选举结束，但是发现选举前，有未commit的value。
    //之前的value不一定形成了多数派，所以要重新走一次accept过程。
    state = STATE_UPDATING_PREVIOUS;
    
    //这个value可能只形成了少数派，不能直接commit。
    //而是用原来的PN最大的value，使用新的PN，重新走一次phase 2。
    begin(uncommitted_value);
      } else {//这个分支，实际上存在少数派宕机重启的不确定性问题 
    // active!
    dout(10) << "that's everyone.  active!" << dendl;
    extend_lease();

    need_refresh = false;
    if (do_refresh()) {
      finish_round();
    }
      }
    }
  } else {
    // no, this is an old message, discard
    dout(10) << "old pn, ignoring" << dendl;
  }

  if (need_refresh)
    (void)do_refresh();
}

2.6 超时处理函数

/*collect的超时处理：直接调用bootstrap。同步monitor信息，并重新选举leader*/
void Paxos::collect_timeout()
{
  dout(1) << "collect timeout, calling fresh election" << dendl;
  collect_timeout_event = 0;
  logger->inc(l_paxos_collect_timeout);
  assert(mon->is_leader());
  mon->bootstrap();
}

Ceph的Paxos源码注释 - Phase 1

1. 几个要点说明

1.1 Epoch

1.2 PN (Proposal Number)

1.3 Version

1.4 持久化

1.5其他需要持久化的数据结构

1.6 Phase 1交互过程简介

2. 代码

2.1 初始化

2.2 Leader发起的collect

2.3 Peon处理collect请求

2.4 分享已经commit的数据的两个函数

2.5 Leader处理Peon的回复

2.6 超时处理函数

存储老小伙

引用和评论

Paxos与Raft的简明比较

如何基于 Go 语言设计一个简洁优雅的分布式任务系统

融合AMD与NVIDIA GPU集群的MLOps：异构计算环境中的分布式训练架构实践

本地？线上？分布式系统前后端架构、部署、联调指南，突破技术

【微服务架构】从链路追踪到日志关联：打造分布式系统问题定位利器

CAP 理论：分布式系统的三选二原则与 Java 实战

Ceph存储1：AI大模型各阶段对存储需求是怎样的？