Deep Learning Lecture 1 - Compiling Deep Learning Models in Keras

Compiling Keras models

Now that we have a Keras model defined. We need to configure how this model will be trained.

model %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

model %>% compile(
  optimizer = "rmsprop",
  loss = "mse",
  metrics = c("mae")
)

Gradient descent

first-order iterative optimization algorithm for finding the minimum of a function f(x).
x = x - step * gradient

Mini-batch stochastic gradient descent (SGD) applied to a Deep Learning model:

Draw a batch of training samples x and corresponding targets y.
Run the model on x to obtain predictions y' (forward pass).
Compute the loss of the model on the batch, a measure of the mismatch between y' and y.
Compute the gradient of the loss with regard to the model’s parameters (backward pass).
W = W - (step * gradient)
Repeat 1-5 until convergence.

The algorithm defined above is called mini-batch SGD. The Stochastic part comes from the fact that we are randomly sampling batches x from the training data.
Stochastic gradient descent happens when the batch size equals to 1.
Mini-batch SGD is a compromise between SGD (one sample per iteration) and full GD (full dataset per iteration)

Backpropagation algorithm:

DL models take advantage of the fact that all operations used in the model are differentiable.
Combining the information above with the chain rule of differentiation leads to the Backpropagation algorithm.
Backpropagation starts with the final loss value and works backward from the top layers to the bottom layers, applying the chain rule to compute the contribution that each parameter had in the loss value.

Variations of SGD

RMSprop

Loss function or objective function:

Common problem types and loss functions:

Binary cross-entropy

\[- y_i\log(p_{i1}) - (1-y_i) \log(1 - p_{i1})\]

Categorical cross-entropy

\[- \sum_{j=1}^C y_i \log(p_{ij})\]

Some observations:

It is not always possible to directly optimize for the metric that measures success on a problem.
Loss functions, after all, need to be:
- computable given only a mini-batch of data or ideally given only a single data point.
- must be differentiable.

Provides different forms to measure how well the predictions are compared with the true values.

This lecture note is based on (Chollet and Allaire 2018) and the following material:

Chollet, F., and J. Allaire. 2018. Deep Learning with R. Manning Publications. https://books.google.no/books?id=xnIRtAEACAAJ.