Neural Networks - IMDB - Densily connected NN

The IMDB dataset

The objective here is to classify a movie review as either positive or negative.

Preparing the data

The data has already been preprocessed: the reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.

The argument num_words = 10000 keep only the top 10,000 most frequently occurring words in the training data.

imdb <- dataset_imdb(num_words = 10000)

train_data <- imdb$train$x
train_labels <- imdb$train$y
test_data <- imdb$test$x
test_labels <- imdb$test$y

Each review is a list of word indices.
The labels are lists of 0s and 1s, where 0 stands for negative and 1 stands for positive.
The first review in the list:

str(train_data[[1]])

##  int [1:218] 1 14 22 16 43 530 973 1622 1385 65 ...

train_labels[[1]]

## [1] 1

Turning sequence of integers back to english

Below is the code to turn the reviews from sequence of integers back to english.

word_index <- dataset_imdb_word_index()  
reverse_word_index <- names(word_index)                                    
names(reverse_word_index) <- word_index
decoded_review <- sapply(train_data[[1]], function(index) {                
  word <- if (index >= 3) reverse_word_index[[as.character(index - 3)]]
  if (!is.null(word)) word else "?"
})

Turning sequence of integers to tensor format

The vectorize_sequences below will produce a tensor of rank 2 of the form (samples, features)
Each sample is represented by a feature vector of the size of the dictionary being used with values equal to 1 if a particular word is present and 0 if the particular word is absent.

vectorize_sequences <- function(sequences, dimension = 10000) {
  results <- matrix(0, nrow = length(sequences), ncol = dimension)
  for (i in 1:length(sequences))
    results[i, sequences[[i]]] <- 1                                     
  results
}

x_train <- vectorize_sequences(train_data)
x_test <- vectorize_sequences(test_data)

y_train <- as.numeric(train_labels)
y_test <- as.numeric(test_labels)

Model definition

model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

Model compilation

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

Validating your approach

Create a validation set by setting apart 10,000 samples from the original training data.

val_indices <- 1:10000

x_val <- x_train[val_indices,]
partial_x_train <- x_train[-val_indices,]
y_val <- y_train[val_indices]
partial_y_train <- y_train[-val_indices]

history <- model %>% fit(
  partial_x_train,
  partial_y_train,
  epochs = 20,
  batch_size = 512,
  validation_data = list(x_val, y_val)
)

Note that the call to fit() returns a history object. Let’s take a look at it:

str(history)

## List of 2
##  $ params :List of 8
##   ..$ metrics           : chr [1:4] "loss" "acc" "val_loss" "val_acc"
##   ..$ epochs            : int 20
##   ..$ steps             : NULL
##   ..$ do_validation     : logi TRUE
##   ..$ samples           : int 15000
##   ..$ batch_size        : int 512
##   ..$ verbose           : int 1
##   ..$ validation_samples: int 10000
##  $ metrics:List of 4
##   ..$ acc     : num [1:20] 0.793 0.903 0.927 0.944 0.953 ...
##   ..$ loss    : num [1:20] 0.51 0.305 0.227 0.178 0.146 ...
##   ..$ val_acc : num [1:20] 0.875 0.855 0.885 0.885 0.889 ...
##   ..$ val_loss: num [1:20] 0.381 0.357 0.296 0.291 0.283 ...
##  - attr(*, "class")= chr "keras_training_history"

The history object includes parameters used to fit the model (history$params) as well as data for each of the metrics being monitored (history$metrics).

plot(history)

You can customize all of this behavior via various arguments to the plot() method.
We can create custom visualization by using as.data.frame() method on the history to obtain a data frame with factors for each metric as well as training versus validation:

history_df <- as.data.frame(history)
head(history_df)

##   epoch     value metric     data
## 1     1 0.7932000    acc training
## 2     2 0.9034667    acc training
## 3     3 0.9267333    acc training
## 4     4 0.9436667    acc training
## 5     5 0.9529333    acc training
## 6     6 0.9670000    acc training

This fairly naive approach achieves an accuracy of 88%. With state-of-the-art approaches, you should be able to get close to 95%.

Predicting on new data

model %>% predict(x_test[1:10,])

##               [,1]
##  [1,] 5.316609e-03
##  [2,] 9.999976e-01
##  [3,] 8.015566e-01
##  [4,] 9.935709e-01
##  [5,] 9.974100e-01
##  [6,] 9.984455e-01
##  [7,] 6.220128e-01
##  [8,] 1.460681e-06
##  [9,] 9.731751e-01
## [10,] 9.999981e-01

Fighting overfitting

Reducing the network’s size

Let’s try a smaller network:

history <- keras_model_sequential() %>%
  layer_dense(units = 4, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 4, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid") %>%
  compile(optimizer = "rmsprop",
    loss = "binary_crossentropy",
    metrics = c("accuracy")) %>%
  fit(
    partial_x_train,
    partial_y_train,
    epochs = 20,
    batch_size = 512,
  validation_data = list(x_val, y_val))

plot(history)

And a bigger network:

history <- keras_model_sequential() %>%
  layer_dense(units = 512, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 512, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid") %>%
  compile(optimizer = "rmsprop",
    loss = "binary_crossentropy",
    metrics = c("accuracy")) %>%
  fit(
    partial_x_train,
    partial_y_train,
    epochs = 20,
    batch_size = 512,
  validation_data = list(x_val, y_val))

plot(history)

Adding weight regularization

Adding L2 weight regularization to the model:

history <- keras_model_sequential() %>%
  layer_dense(units = 16, kernel_regularizer = regularizer_l2(0.001),
              activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 16, kernel_regularizer = regularizer_l2(0.001),
              activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid") %>%
  compile(optimizer = "rmsprop",
    loss = "binary_crossentropy",
    metrics = c("accuracy")) %>%
  fit(
    partial_x_train,
    partial_y_train,
    epochs = 20,
    batch_size = 512,
  validation_data = list(x_val, y_val))

plot(history)

Adding dropout

Let’s add two dropout layers in the IMDB network to see how well they do at reducing overfitting.

history <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = "sigmoid") %>%
  compile(optimizer = "rmsprop",
    loss = "binary_crossentropy",
    metrics = c("accuracy")) %>%
  fit(
    partial_x_train,
    partial_y_train,
    epochs = 20,
    batch_size = 512,
  validation_data = list(x_val, y_val))

plot(history)