MPI and Sarus

Now we are jumping into HPC runtimes. It’s important to understand what we are about to do… :)

This youtube video from HPCW21 provides an overview about what how PMI and MPI play together to allow the hardware specific MPI libraries to be used.

That’s a longer discussion, but in short. The two money quote slides are.

Driver backwards/forward compatibility

As long as the kernel driver on the host is able to deal with the user-land driver you should be in the green.

ABI compatibility / BYOM

If you know which MPI you need, you can package the MPI into the container and you will be prepared for the system you expect to run on. The big drawback is that this won’t provide portability, because you locked yourself into the MPI you choose.

For MVAPICH based MPI you can leverage the ABI compatibility. You can swap out the underlying libraries and the binary will pick up the MPI that was changed. That is the way Sarus will work.

OCI hook

Sarus picks up OCI hooks to deal with MPI. The hooks direcroty is defined in the Sarus configuration.

sudo cat /opt/sarus/etc/sarus.json

The OCI hook for MPI has the following configuration.

sudo cat /opt/sarus/etc/hooks.d/01-mpi-hook.json

The MPI hook is instance specific and configured by the system admin. Since ParallelCluster comes with IntelMPI this instance will attempt to replace the container mpi with the libraries listed in the hook. It will also bind-mount the EFA device /dev/infiniband/uverbs0 to leverage the high-performance interconnect on the instances for communication.


sarus pull qnib/ethcscs-hellompi:debian
sarus images


cat > sarus-hello-run.sbatch << EOF
#SBATCH --job-name=sarus-hello-run
#SBATCH --ntasks=4
#SBATCH --output=/fsx/logs/%x_%j.out

module load intelmpi
set -x
mpirun sarus run --mpi qnib/ethcscs-hellompi:debian /fsx/bin/hello

We’ll submit the job with two nodes.

sbatch -N2 sarus-hello-run.sbatch

Once the job is finished the output should look like this.

cat /fsx/logs/sarus-hello-run_*.out