Submit Job

Now it is time for the fun part: submitting a job!


First we create a job script.

Scaling Test

cd /fsx/jobs
export I_MPI_DEBUG=2
for N in 32 64 96;do
cat > strong_scaling_test_0${N}.sbatch << EOF
#SBATCH --job-name=strong_scaling_test_0${N}
#SBATCH --ntasks=${N}
#SBATCH --ntasks-per-node=32
#SBATCH --output=/fsx/log/%x_%j.out
#SBATCH --exclusive

source /fsx/fds-smv/bin/
source /fsx/fds-smv/bin/

module load intelmpi 

export I_MPI_PIN_DOMAIN=omp

mkdir -p /fsx/results/\${SLURM_JOB_NAME}_\${SLURM_JOBID}
cd /fsx/results/\${SLURM_JOB_NAME}_\${SLURM_JOBID}
cat /fsx/input/fds/strong_scaling_test_0${N}.fds \
   | sed -e 's/T_END=0.2/T_END=1.0/' > strong_scaling_test_0${N}.fds
time mpirun -genv I_MPI_DEBUG ${I_MPI_DEBUG} -ppn 32 -np ${N} fds strong_scaling_test_0${N}.fds

The job should take around 10min, let us submit 3 jobs.

for N in 32 64 96;do
   sbatch strong_scaling_test_0${N}.sbatch

Job Results

These jobs will run on one, two and three nodes.

In the screenshot below the cluster is configured to be able to spin up more nodes; thus, the jobs can run concurrently.

Debug Scaling

In case you are running this with an account that hits scaling limits, you can check that out within the Auto Scaling Group (ASG - deep link).

As we can see already in the overview; the ASG has a capacity of 7, even the the desired capacity is 10. Click on the ASG and head to Activity to see more details.

It shows that we hit our limit of 512 vCPUs and thus, the ASG is not able to provision more capacity. Time to open a ticket and raise the limit. :)

Job Wallclock

The wallclock times are quite different, the 3 node job finishes in about 3min.

$ egrep '(^real|MPI Processes)' strong_scaling_test_096_*                                              
strong_scaling_test_096_14.out: MPI Enabled;    Number of MPI Processes:      96
strong_scaling_test_096_14.out:real     2m43.704s

The others take a while longer.