Now we are jumping into HPC runtimes. It’s important to understand what we are about to do… :)
This youtube video from HPCW21 provides an overview about what how PMI and MPI play together to allow the hardware specific MPI libraries to be used.
That’s a longer discussion, but in short. The two money quote slides are.
As long as the kernel driver on the host is able to deal with the user-land driver you should be in the green.
If you know which MPI you need, you can package the MPI into the container and you will be prepared for the system you expect to run on. The big drawback is that this won’t provide portability, because you locked yourself into the MPI you choose.
For MVAPICH based MPI you can leverage the ABI compatibility. You can swap out the underlying libraries and the binary will pick up the MPI that was changed. That is the way Sarus will work.
Sarus picks up OCI hooks to deal with MPI. The hooks direcroty is defined in the Sarus configuration.
sudo cat /opt/sarus/etc/sarus.json
The OCI hook for MPI has the following configuration.
sudo cat /opt/sarus/etc/hooks.d/01-mpi-hook.json
The MPI hook is instance specific and configured by the system admin. Since ParallelCluster comes with IntelMPI this instance will attempt to replace the container mpi with the libraries listed in the hook. It will also bind-mount the EFA device
/dev/infiniband/uverbs0 to leverage the high-performance interconnect on the instances for communication.
sarus pull qnib/ethcscs-hellompi:debian sarus images
cat > sarus-hello-run.sbatch << EOF #!/bin/bash #SBATCH --job-name=sarus-hello-run #SBATCH --ntasks=4 #SBATCH --output=/fsx/logs/%x_%j.out module load intelmpi unset I_MPI_PMI_LIBRARY set -x mpirun sarus run --mpi qnib/ethcscs-hellompi:debian /fsx/bin/hello EOF
We’ll submit the job with two nodes.
sbatch -N2 sarus-hello-run.sbatch
Once the job is finished the output should look like this.