MPI入门

分布式系统中经常用到MPI,这里简单地学习一下基础用法,并做个笔记。

教程

通讯器(communicator)。通讯器定义了一组能够互相发消息的进程。在这组进程中,每个进程会被分配一个序号,称作(rank).

点对点通信

自己先把自己想要发送的数据写在一个buffer里,该buffer可以是MPI_Datatype类型的指针所指向的一片内存区域,调用Send的时候就将该类型指针转为void *.

MPI_Send(
    void* data,
    int count,
    MPI_Datatype datatype,
    int destination,
    int tag,
    MPI_Comm communicator)
MPI_Recv(
    void* data,
    int count,
    MPI_Datatype datatype,
    int source,
    int tag,
    MPI_Comm communicator,
    MPI_Status* status)

注意

  • MPI_Recv的source可以是任意tag,表示接收来自任何source的msg,但是MPI_Send的destination应该不可以用来发送消息给任何des吧
  • 目前MPI_Recv的status还没有用到,想忽略这个就直接置为MPI_STATUS_IGNORE

如果调用MPI_Recv时提供了MPI_Status参数,假设是一个名为stat的MPI_Status,就会往里面填入一些信息,主要是以下3个:

  • The rank of the sender: 通过stat.MPI_SOURCE去access;
  • tag of the message: stat.MPI_TAG;
  • length of the message: 不能直接通过访问stat的某个元素去access,需要调用下面的方法去access:
MPI_Get_count(
    MPI_Status* status,
    MPI_Datatype datatype,
    int* count)

The count variable is the total number of datatype elements that were received.
到这里就有一个疑问?为什么需要这3个信息?

  • MPI_Recv里不是已经有一个参数是count了吗?
    事实上,MPI_Recv里的count是最多接收多少个datatype类型的元素,但是MPI_Status里的count是实际接收到了多少个元素。
  • MPI_Recv里不是已经有一个参数tag了吗?
    普通的使用场景下,tag是个固定的值,但是也可以传MPI_ANY_TAG来表示接收任意tag的message,这时如何去分辨收到的message是属于何种tag就只能依靠Status里的信息
  • 类似地,MPI_Recv里也可以指定MPI_ANY_SOURCE来表示接收来自任何sender的message,这时如何分辨收到的message来自哪个sender也只能靠status的信息。

因为在调用MPI_Recv的时候需要提供一个buffer去存收到的消息嘛,而往往真正收到消息之前我们并不知道消息有多大,所以先调用Probe去探测一下。然后再调用MPI_Recv去真正接收message。

MPI_Probe(
    int source,
    int tag,
    MPI_Comm comm,
    MPI_Status* status)

collective通信

MPI_Barrier(MPI_Comm communicator): 就是BSP里的barrier啦。

关于同步最后一个要注意的地方是:始终记得每一个你调用的集体通信方法都是同步的。也就是说,如果你没法让所有进程都完成 MPI_Barrier,那么你也没法完成任何集体调用。如果你在没有确保所有进程都调用 MPI_Barrier 的情况下调用了它,那么程序会空闲下来。这对初学者来说会很迷惑,所以小心这类问题。

MPI_Bcast

MPI_Bcast(
    void* data,
    int count,
    MPI_Datatype datatype,
    int root,
    MPI_Comm communicator)

无论是发送方还是接收方,都是调用同样的MPI_Bcast。这与点对点通信不一样。

问题:compare_bcast.c中的第二个MPI_Barrier有什么作用?

MPI Scatter, Gather, and Allgather

MPI_Bcast与Scatter的区别:

  • Bcast将同样的数据发送给别的进程,而Scatter将数据的不同分片发给不同的进程,即每个进程只能获得一部分数据;
  • Bcast是发送给其他(即除了自己之外的)进程,而Scatter是发送给communicator中包括自己的所有进程。(这句话的前半句不太确定)。个人分析:Bcast中应该是发给除自己之外的其他进程,否则的话,它只提供了一个data指针,指向一个buffer,那么如果root也发给自己,那一方面是没必要,另一方面是data必然同时作为send_buffer和recv_buffer,这不可能。而在Scatter中,就提供了2个buffer。因为root就需要同时发送和接收。
MPI_Scatter(
    void* send_data,
    int send_count,
    MPI_Datatype send_datatype,
    void* recv_data,
    int recv_count,
    MPI_Datatype recv_datatype,
    int root,
    MPI_Comm communicator)

MPI Reduce and Allreduce

MPI_Reduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    int root,
    MPI_Comm communicator)
MPI_Allreduce(
    void* send_data,
    void* recv_data,
    int count,
    MPI_Datatype datatype,
    MPI_Op op,
    MPI_Comm communicator)

MPI_Allreduce与MPI_Allgather类似,就是普通的MPI_gather是将结果放到一个进程里,但是MPI_Allgather是将结果返回到所有进程,能被所有进程access到。MPI_Allreduce也一样,将reduce的结果可以被所有的进程access到。

Groups and Communicators

前面的应用,要么是talk to one process或者talk to all the processes, 只是用了默认的一个communicator。 随着程序规模的增大,可能需要只与部分processes通信,所以引入了group,每个group分别对应一个communicator。如何创建多个communicator呢?

MPI_Comm_split(
    MPI_Comm comm,
    int color,
    int key,
    MPI_Comm* newcomm)

MPI_Comm_split creates new communicators by “splitting” a communicator into a group of sub-communicators based on the input values color and key.

The first argument, comm, is the communicator that will be used as the basis for the new communicators. This could be MPI_COMM_WORLD, but it could be any other communicator as well.

The second argument, color, determines to which new communicator each processes will belong. All processes which pass in the same value for color are assigned to the same communicator. If the color is MPI_UNDEFINED, that process won’t be included in any of the new communicators.

The third argument, key, determines the ordering (rank) within each new communicator. The process which passes in the smallest value for key will be rank 0, the next smallest will be rank 1, and so on. If there is a tie, the process that had the lower rank in the original communicator will be first.

When you print things out in an MPI program, each process has to send its output back to the place where you launched your MPI job before it can be printed to the screen. This tends to mean that the ordering gets jumbled so you can’t ever assume that just because you print things in a specific rank order, that the output will actually end up in the same order you expect. The output was just rearranged here to look nice.

MPI has a limited number of objects that it can create at a time and not freeing your objects could result in a runtime error if MPI runs out of allocatable objects.

额外的问题

  1. 初始化MPI的时候,MPI Init or MPI_Init_thread
int MPI_Init_thread(int *argc, char *((*argv)[]), int required, int *provided) 
int MPI::Init_thread(int& argc, char**& argv, int required) 
int MPI::Init_thread(int required) 
  • argc和argv是optional,在C语言里,通过传NULL实现,C++里就是重载了两个额外的MPI_Init_thread。
  • required: the desired level of thread support, 可能的取值:

    • MPI_THREAD_SINGLE: Only one thread will execute.
    • MPI_THREAD_FUNNELED: The process may be multi-threaded, but only the main thread will make MPI calls (all MPI calls are ``funneled'' to the main thread).
    • MPI_THREAD_SERIALIZED: The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are ``serialized'').
    • MPI_THREAD_MULTIPLE: Multiple threads may call MPI, with no restrictions.
  • The call returns in provided information about the actual level of thread support that will be provided by MPI. It can be one of the four values listed above.

Vendors may provide (implementation dependent) means to specify the level(s) of thread support available when the MPI program is started, e.g., with arguments to mpiexec. This will affect the outcome of calls to MPI_INIT and MPI_INIT_THREAD.

Suppose, for example, that an MPI program has been started so that only MPI_THREAD_MULTIPLE is available. Then MPI_INIT_THREAD will return provided = MPI_THREAD_MULTIPLE, irrespective of the value of required; a call to MPI_INIT will also initialize the MPI thread support level to MPI_THREAD_MULTIPLE.

Suppose, on the other hand, that an MPI program has been started so that all four levels of thread support are available. Then, a call to MPI_INIT_THREAD will return provided = required; on the other hand, a call to MPI_INIT will initialize the MPI thread support level to MPI_THREAD_SINGLE. 当提供的参数required为MPI_THREAD_SINGLE时,与MPI_Init效果一样

https://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node165.htm


windkl
113 声望1 粉丝