GPU Computing

1. Logging onto the cluster

To use the GPU machines on cluster, you need to log in to cluster. Here is the nice introduction of how to log in: https://www.alessandravalcarcel.com/blog/2019-04-23-ssh/

lpcgpu01, which is the host for GPU machines, is accessible via takim server. You may enter the takim server by typing this:

ssh -X <pennkey>@takim.pmacs.upenn.edu

*** Don’t read this if you are not familiar with GPU computing ***

For advanced user, looking for additional GPU power, we have six additional GPU cores, exclusively available to us. takim2 is a submit host (not a excutible host) that can be used for GPU computing. You can access it by typing this:

ssh -X <pennkey>@takim2.pmacs.upenn.edu

ssh -X <pennkey>@takim2

Note that this is a submit host, not a executable host. You can directly use it as a interactive session, but can’t submit a normal job to takim2.

2. Interactive Session Basics

If you intend to use an interactive session, consider using screen so that you don’t lose your work. You can open a screen by

screen -S <Screen-name>

and start your work. You can find the details about how to use a screen at: https://www.alessandravalcarcel.com/blog/2019-06-12-interactivesession1/

Once you are in the executable host, you can open an interactive session by typing this:

bsub -Is -q lpcgpu -gpu "num=1" -n 1 "bash"

Make sure you request gpu. “num=1” requests the number of GPU whereas “-n 1” requests the number of CPU.

Next, load torch and tensorflow to activate CUDA

module load torch
module load tensorflow/2.3-GPU

To check whether your CUDA is running, run this:

python

In Python, run this:

import torch
torch.cuda.is_available()
torch.cuda.device_count()
torch.cuda.current_device()
torch.cuda.device(0)
torch.cuda.get_device_name(0)

The response should be

torch.cuda.is_available() : True
torch.cuda.device_count() : 1
torch.cuda.current_device() : 0
torch.cuda.device(0) : <torch.cuda.device object at 0x2ae68db7cb20>
torch.cuda.get_device_name(0) : 'NVIDIA GeForce RTX 2080 Ti'

Now, you are ready to use the GPU!

3. Normal Job Sessions

Once you are in the executable host, you can submit a normal job, usually with the bash file.

bsub -q lpcgpu -gpu "num=1" -n 1 -J "orig[1-3]" -o <where to save your log file> <location of your bash file>

For example, I save my log file at /home/ecbae/nnUNet.txt and bash file at /home/ecbae/orig.sh. Note that my job index is [1-3], which in result requests 3 GPU cores and 3 CPU cores.

My bash file looks like this:

module load torch
module load tensorflow/2.3-GPU

nnUNet_plan_and_preprocess -t $(( $LSB_JOBINDEX +149 ))
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 0 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 1 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 2 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 3 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 4 --npz

Note that we have 10 GPU cores in lpcgpu01 host. If you need more GPU cores, you may want to use the takim2 host, discussed in the first section.

Also, I am running a pre-installed python package nnUNet. To install the existing package, you should submit a ticket to PMACS or send an email to Martin Das.

4. Concluding remarks

We need more GPU!