G4dn

In case you run this workshop in an AWS provided environment (Event Engine) please ask your facilitator if your vCPU limits on G instances are high enough to run these benchmarks.

Community Images

To run this benchmark we fetch community images from gallery.ecr.aws/hpc.

We’ll use Thread-MPI with baked in settings of how many OpenMP threads should be spawned.

Two images with different tags:

  1. g4dn_8xl_on: build for a g4dn.8xllarge with hypterthreading on. This image will use 1 OpenMP threads.
sarus pull public.ecr.aws/hpc/spack/gromacs/2021.1/cuda-tmpi:g4dn_8xl_on_2021-04-29
SARUS_G4DN_SN_IMG=public.ecr.aws/hpc/spack/gromacs/2021.1/cuda-tmpi:g4dn_8xl_on_2021-04-29

Custom Build Image

In case you build the image in a previous section, please paste your image name (according to sarus images).

read -p "paste your image name (according to 'sarus images'): " SARUS_G4DN_SN_IMG

32 Ranks

Our first job is aquivalent to the 32 ranks and 1 OpenMP threads, even though Thread-MPI is using process threads to mimic MPI ranks.

cat > gromacs-sarus-g4dn-cuda-tmpi-1x0.sbatch << EOF
#!/bin/bash
#SBATCH --job-name=gromacs-sarus-g4dn-cuda-tmpi-1x0
#SBATCH --exclusive
#SBATCH --output=/fsx/logs/%x_%j.out

mkdir -p /fsx/jobs/\${SLURM_JOBID}
export INPUT=/fsx/input/gromacs/benchRIB.tpr
export CUDA_VISIBLE_DEVICES=all
export NVIDIA_DRIVER_CAPABILITIES=all

sarus run --workdir=/fsx/jobs/\${SLURM_JOBID} ${SARUS_G4DN_SN_IMG}
EOF

Let’s submit a job with a dependency of the sleep-inf job we started earlier to block the node. Afterwards we add another sleep-in jobs to the queue to make sure the node stays up and is not scaled down.

sbatch $(genSlurmDep g4dn2xl "sleep") -N1 -p g4dn2xl gromacs-sarus-g4dn-cuda-tmpi-1x0.sbatch
sbatch $(genSlurmDep g4dn2xl gromacs-sarus-g4dn) -N1 -p g4dn2xl sleep.sbatch
squeue -p g4dn2xl

Results

After those runs are done, we grep the performance results.

grep -B2 Performance /fsx/logs/gromacs-single-node-sarus-g4dn-cuda-tmpi-*

This extends the table from the gromacs-on-pcluster workshop started with decomposition.

# execution spec instance Ranks x Threads ns/day
1 native gromacs@2021.1 c5n.18xl 18 x 4 4.7
2 native gromacs@2021.1 c5n.18xl 36 x 2 5.3
3 native gromacs@2021.1 c5n.18xl 72 x 1 5.5
4 native gromacs@2021.1 ^intel-mkl c5n.18xl 36 x 2 5.4
5 native gromacs@2021.1 ^intel-mkl c5n.18xl 72 x 1 5.5
6 native gromacs@2021.1 ~mpi c5n.18xl 36 x 2 5.5
7 native gromacs@2021.1 ~mpi c5n.18xl 72 x 1 5.7
8 native gromacs@2021.1 +cuda ~mpi g4dn.8xl 1 x 32 6.3
9 sarus gromacs@2021.1 ~mpi c5n.18xl 36 x 2 5.5
10 sarus gromacs@2021.1 ~mpi c5n.18xl 72 x 1 5.7
11 sarus gromacs@2021.1 +cuda ~mpi fftw precision=float g4dn.8xl 1 x 32 6.3