Crash MPI
Following are basically the note when I went through https://www.hlrs.de/training/self-study-materials. This tutorial is highly recommended as it contains plenty of informative code exercises!
Overview
Sequential vs Parallel
Sequential - One memory, one processor; all the works done in linear order
Parallel - Many memories, many processors; at different levels, processors may share the memory / communicate via MPI (Socket - Node - Cluster)
SPMD
mingle (sub-)program, multiple data
MPMD (multiple program, multiple data) can be implemented with ranks.
rank in MPI
the rank of each processor determines the behavior / role in the program
First MPI example
1 | from mpi4py import MPI |
Execution with mpirun -np 4 python first-example.py ; the output is
1 | Enter the number of elements (n): 100 |
Bottom lines
- The output of each process is in the well defined sequence of its program
- The output from different processes can be intermixed in any sequence
- A sub-program needs to be connected to a message passing system
- The total program (i.e., all sub-programs of the program) must be started with the MPI startup tool
Messages among processors
- process id (rank)
- location (sender / reciever)
- Data type (sender / reciever)
- Data size (sender / reciever)
P2P Communication
From one processor to another;
Two types: synchronous / asynchronous (buffered)
Synchronous send
Both sender and reciever works at the same time; sender recieved an info when finished;
This operation blocks sender until the receive is posted; receive operation blocks until message was sent
Asynchronous send
Sender puts the data in the buffer and know nothing about if received
Blocking / Nonblocking
Some operations may block until another process acts;
Non-block: returns immediately, allows the subprocess to continue the work
All nonblocking procedures must have a matching wait (or test) procedure.
Amdahl’s Law
Extra cost of parallel sets a limitation of speedup $\propto 1/f$
Process Model and Language Bindings
Headers
1 | from mpi4py import MPI |
Functions
1 | comm_world = MPI.COMM_WORLD |
Initialize
1 | from mpi4py import MPI |
Finalize
1 | # MPI.Finalize() |
No more MPI call after this (including re-initialize)
Start MPI program
With either mpirun or mpiexec; the latter is recommended for standard reasons.
MPI_COMM_WORLD
This is a pre-defined handler for all the processes; each is assigned with a rank (0, 1, …, np - 1)
Handles
Handles are the idenfication of MPI objects;
1 | # predefined constants |
Rank
Ranks are the identification of processes
1 | rank = comm.Get_rank() |
Size
Total number of processors in the communicator (= max(rank) + 1)
1 | size = comm.Get_size() |
is Parallel ! awesome
We cannot guarantee the order of output, unless the processors can communicate ahead of output.
Messages and Point-to-Point Communication
Messages in Python
There are two ways of messaging in Python
1 | # Fast way, only for numpy |
Sending
1 | comm.Send(buf, int dest, int tag=0) |
buf is a ndarry (or anything implement the Python buffer protocol) while obj can be any object;
dest is the rank of the dest. processor;
tag is a int with additional info from the sender (usually used to specify the data type)
Various sending
ssend for synchronous send - Completes iff the receive has started
bsend for buffered / async send - Always complete, need to specify the place of buffer
send for standard send - Either buffered or synchronous, use interal buffer
rsend for ready send - Start only the receiver has been posted (?)
Which is better?
Synchronous ones requires waiting — high latency but best bandwidth
Async ones — low latency but bad bandwidth
Receiving
1 | comm.Recv(buf, int source=ANY_SOURCE, int tag=ANY_TAG, Status status=None) |
buf is the receive buffer;
source is the rank of the src. processor;
tag is still tag (tag must match!);
For Python, you need to match Send-Recv and send-recv
One receiving
recv can accpet all kind of send - Completes when a message arrived
On print function of Python
Following is the pingpong function:
1 | from mpi4py import MPI |
My expectation output is
1 | I am 0 before send ping |
Which demonstrates the order of send, recv operations; however, the real output is
1 | I am 1 after recv ping |
This is actually due to the feature of print rather than MPI: if we add time info into the output, we have
1 | I am 1 after recv ping 1680684130.3630078 |
Which means the order is #0 send -> #1 recv -> #1 send -> #0 recv. So the flush is only emptied when finishing parallel.
Dead locks
If Sync is used in send/recv, dead lock forms if two (or more) processors are waiting each other.
Status
These are info from src to dst; contains metadata of the communication
1 | status.Get_source() |
Message Order Preservation
One link one order