G4dn

Community Images

To run this benchmark we fetch community images from gallery.ecr.aws/hpc.

Custom Build Image

In case you build the image in a previous section, please paste your image name (according to sarus images).

read -p "paste your image name (according to 'sarus images'): " SARUS_MNP_IMG

1 Rank / 16 Threads

For GPU accelerated runs we are going to stick to one rank and 16 Threads.

mkdir -p ~/slurm/
cat > ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x16.sbatch << EOF
#!/bin/bash
#SBATCH --job-name=gromacs-sarus-g4dn-cuda-mpich-1x16
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --gpus-per-node=1
#SBATCH --exclusive
#SBATCH --output=/fsx/logs/%x_%j.out

module load intelmpi
unset I_MPI_PMI_LIBRARY
mkdir -p /fsx/jobs/\${SLURM_JOBID}
export NVIDIA_DRIVER_CAPABILITIES=all

export INPUT=/fsx/input/gromacs/benchRIB.tpr

mpirun sarus run --mpi ${SARUS_MNP_IMG} -c "gmx_mpi mdrun -s \${INPUT} -resethway -pme cpu -ntomp 16"
EOF

1 Rank / 32 Threads

For GPU accelerated runs we are going to stick to one rank and 32 Threads.

mkdir -p ~/slurm/
cat > ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x32.sbatch << EOF
#!/bin/bash
#SBATCH --job-name=gromacs-sarus-g4dn-cuda-mpich-1x32
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-node=1
#SBATCH --exclusive
#SBATCH --output=/fsx/logs/%x_%j.out

module load intelmpi
unset I_MPI_PMI_LIBRARY
mkdir -p /fsx/jobs/\${SLURM_JOBID}
export INPUT=/fsx/input/gromacs/benchRIB.tpr
export NVIDIA_DRIVER_CAPABILITIES=all

mpirun sarus run --mpi ${SARUS_MNP_IMG} -c "gmx_mpi mdrun -s \${INPUT} -resethway -pme cpu -ntomp 32"
EOF

Submit Jobs

With Hyper Threading

sbatch -N2 -p g4dn8xl-ht --job-name=gromacs-sarus-g4dn-ht-cuda-mpich-2x16 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x16.sbatch
sbatch -N2 -p g4dn8xl-ht --job-name=gromacs-sarus-g4dn-ht-cuda-mpich-2x16 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x16.sbatch
sbatch -N2 -p g4dn8xl-ht --job-name=gromacs-sarus-g4dn-ht-cuda-mpich-2x32 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x32.sbatch
sbatch -N2 -p g4dn8xl-ht --job-name=gromacs-sarus-g4dn-ht-cuda-mpich-2x32 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x32.sbatch
sbatch -N4 -p g4dn8xl-ht --job-name=gromacs-sarus-g4dn-ht-cuda-mpich-4x16 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x16.sbatch
sbatch -N4 -p g4dn8xl-ht --job-name=gromacs-sarus-g4dn-ht-cuda-mpich-4x16 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x16.sbatch
sbatch -N4 -p g4dn8xl-ht --job-name=gromacs-sarus-g4dn-ht-cuda-mpich-4x32 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x32.sbatch
sbatch -N4 -p g4dn8xl-ht --job-name=gromacs-sarus-g4dn-ht-cuda-mpich-4x32 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x32.sbatch

Without Hyper Threading

sbatch -N2 -p g4dn8xl --job-name=gromacs-sarus-g4dn-cuda-mpich-2x16 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x16.sbatch
sbatch -N2 -p g4dn8xl --job-name=gromacs-sarus-g4dn-cuda-mpich-2x16 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x16.sbatch
sbatch -N4 -p g4dn8xl --job-name=gromacs-sarus-g4dn-cuda-mpich-4x16 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x16.sbatch
sbatch -N4 -p g4dn8xl --job-name=gromacs-sarus-g4dn-cuda-mpich-4x16 ~/slurm/gromacs-sarus-g4dn-cuda-mpich-1x16.sbatch
squeue

Afterwards you are going to see a bunch of jobs in the queue. Grab a coffee - that will take a couple of minutes. In the example below I increased the node-count to 32.

Results

After those runs are done, we grep the performance results.

grep -B2 Performance /fsx/logs/gromacs-s*

This extends the table from the gromacs-on-pcluster workshop started with decomposition.

Single-Node

# scheduler execution spec # * instance Ranks x Threads ns/day
1 slurm native gromacs@2021.1 1 * c5n.18xl 18x4 4.7
2 slurm native gromacs@2021.1 1 * c5n.18xl 36x2 5.3
3 slurm native gromacs@2021.1 1 * c5n.18xl 72x1 5.5
4 slurm native gromacs@2021.1 ^intel-mkl 1 * c5n.18xl 36x2 5.4
5 slurm native gromacs@2021.1 ^intel-mkl 1 * c5n.18xl 72x1 5.5
6 slurm native gromacs@2021.1 ~mpi 1 * c5n.18xl 36x2 5.5
7 slurm native gromacs@2021.1 ~mpi 1 * c5n.18xl 72x1 5.7
8 slurm native gromacs@2021.1 +cuda ~mpi 1 * g4dn.8xl 1x32 6.3
9 slurm sarus gromacs@2021.1 ~mpi 1 * c5n.18xl 36x2 5.45
10 slurm sarus gromacs@2021.1 ~mpi 1 * c5n.18xl 72x1 5.65

Multi-Node CPU

# scheduler execution spec # * instance Ranks x Threads ns/day
11 slurm sarus gromacs@2021.1 ^mpich 2 * c5n.18xl 36x4 8.8
12 slurm sarus gromacs@2021.1 ^mpich 2 * c5n.18xl 72x2 9.0
13 slurm sarus gromacs@2021.1 ^mpich 2 * c5n.18xl 144x1 9.65
14 slurm sarus gromacs@2021.1 ^mpich 4 * c5n.18xl 72x4 15.3
15 slurm sarus gromacs@2021.1 ^mpich 4 * c5n.18xl 144x2 16.1
16 slurm sarus gromacs@2021.1 ^mpich 4 * c5n.18xl 288x1 16.85
17 slurm sarus gromacs@2021.1 ^mpich 8 * c5n.18xl 144x4 25.8
18 slurm sarus gromacs@2021.1 ^mpich 8 * c5n.18xl 288x2 28
19 slurm sarus gromacs@2021.1 ^mpich 8 * c5n.18xl 576x1 29
20 slurm sarus gromacs@2021.1 ^mpich 16 * c5n.18xl 288x4 41.3
21 slurm sarus gromacs@2021.1 ^mpich 16 * c5n.18xl 576x2 44.5
23 slurm sarus gromacs@2021.1 ^mpich 32 * c5n.18xl 576x4 63.3

Multi-Node GPU

# scheduler execution spec # * instance HT Ranks x Threads ns/day
24 slurm sarus gromacs@2021.1 +cuda ^mpich 2 * g4dn.8xl off 2x16 7.6
25 slurm sarus gromacs@2021.1 +cuda ^mpich 2 * g4dn.8xl on 2x16 7.5
26 slurm sarus gromacs@2021.1 +cuda ^mpich 2 * g4dn.8xl on 2x32 7.3
27 slurm sarus gromacs@2021.1 +cuda ^mpich 4 * g4dn.8xl off 4x16 12.0
28 slurm sarus gromacs@2021.1 +cuda ^mpich 4 * g4dn.8xl on 4x16 12.1
29 slurm sarus gromacs@2021.1 +cuda ^mpich 4 * g4dn.8xl on 4x32 11

Please note, the containerized run yield the same result as the native run.