Running a job on epic

Check out basic info about loggin in to epic and running jobs (with the queue system slurm) https://www.hpc.ntnu.no/display/hpc/Getting+Started

Log in to epic and load the required modules

Log on with something like (on Mac/Linux) ssh epic.hpc.ntnu.no.

Modules needed to run on CPUs

Load the modules by just typing the following (when you now are at the login node)

  • module purge
  • module load GCC/7.3.0-2.30
  • module load OpenMPI/3.1.1
  • module load R/3.5.1
  • module load TensorFlow/1.12.0-Python-3.6.6

Starting R and installing keras

Then start R by typing “R” and install keras in R by

install.packages("keras") and quit R by typing q().

NB: do not try to do library(keras); install_keras(). This is not needed and will probably only produce an error.

Test script for running on CPUs

Make file called MNISTexSlurm.sh by copying the commands below, or copy my file at epic from /home/mettela/MA8701MNISTexSlurm.sh (if you are allowed). This file should be excecutable (chmod u+x)

#!/bin/sh
#SBATCH --partition=WORKQ
#SBATCH --time=00:15:00
#SBATCH --mem=8000
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --job-name="MNISTex"

module purge
module load GCC/7.3.0-2.30
module load OpenMPI/3.1.1
module load R/3.5.1
module load TensorFlow/1.12.0-Python-3.6.6

/usr/bin/time -v Rscript --vanilla MNISTex.R

Here we have set running time of max 15 minutes and asked for 8BG ram (remember to change this if you run on a larger example).

The MNISTex should take around 2.5 minutes and use ca 2 GB ram.

Observe that in the slum script we only ask for 1 CPU (ntasks-per-node=1). However, Keras is parallellised and can run on many CPUs, but for this small example running with 20 CPUs (ntasks-per-node=20) will take longer time than running on one CPU because the example does not really require any computing power and set-up-time is high for this example. For your larger jobs running with more CPUs is smart.

Test R script

Make file called MNISTex.R by copying the commands below, or copy my file at epic from /home/mettela/MA8701MNISTex.R (if you are allowed). As you see this is a simple example with a network with one dense hidden layer, and a 10 class problem. This is partially taken from the R keras page at RStudio https://keras.rstudio.com/

library(keras)
mnist <- dataset_mnist()
#Train
train_images <- mnist$train$x
train_labels <- mnist$train$y

# Test data
test_images <- mnist$test$x
test_labels <- mnist$test$y

network <- keras_model_sequential() %>%
  layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
  layer_dense(units = 10, activation = "softmax")

network %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)

train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255
train_labels <- to_categorical(train_labels)

test_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images <- test_images / 255
test_labels <- to_categorical(test_labels)

fitted<- network %>% fit(train_images, train_labels,
                         epochs = 30, batch_size = 128,
                         validation_split = 0.2
)

network %>% evaluate(test_images,test_labels)
network %>% predict_classes(test_images)

Run on epic

Now run the script with slurm, and look at the output.

sbatch MNISTexSlurm.sh

you are given a job id, I got Submitted batch job 248476 and by typing more slurm-248476.out I saw that I got loss and accuracy

$loss
[1] 0.1165949

$acc
[1] 0.982