The IMDB dataset

The objective here is to classify a movie review as either positive or negative.

Preparing the data

The data has already been preprocessed: the reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.

  • The argument num_words = 10000 keep only the top 10,000 most frequently occurring words in the training data.
  • Each review is a list of word indices.
  • The labels are lists of 0s and 1s, where 0 stands for negative and 1 stands for positive.
  • The first review in the list:
##  int [1:218] 1 14 22 16 43 530 973 1622 1385 65 ...
## [1] 1

Turning sequence of integers to tensor format

  • The vectorize_sequences below will produce a tensor of rank 2 of the form (samples, features)
  • Each sample is represented by a feature vector of the size of the dictionary being used with values equal to 1 if a particular word is present and 0 if the particular word is absent.

Validating your approach

Create a validation set by setting apart 10,000 samples from the original training data.

Note that the call to fit() returns a history object. Let’s take a look at it:

## List of 2
##  $ params :List of 8
##   ..$ metrics           : chr [1:4] "loss" "acc" "val_loss" "val_acc"
##   ..$ epochs            : int 20
##   ..$ steps             : NULL
##   ..$ do_validation     : logi TRUE
##   ..$ samples           : int 15000
##   ..$ batch_size        : int 512
##   ..$ verbose           : int 1
##   ..$ validation_samples: int 10000
##  $ metrics:List of 4
##   ..$ acc     : num [1:20] 0.789 0.9 0.927 0.945 0.956 ...
##   ..$ loss    : num [1:20] 0.5 0.302 0.221 0.171 0.139 ...
##   ..$ val_acc : num [1:20] 0.859 0.889 0.887 0.889 0.876 ...
##   ..$ val_loss: num [1:20] 0.383 0.3 0.287 0.274 0.32 ...
##  - attr(*, "class")= chr "keras_training_history"

The history object includes parameters used to fit the model (history$params) as well as data for each of the metrics being monitored (history$metrics).

  • You can customize all of this behavior via various arguments to the plot() method.
  • We can create custom visualization by using as.data.frame() method on the history to obtain a data frame with factors for each metric as well as training versus validation:
##   epoch     value metric     data
## 1     1 0.4997946   loss training
## 2     2 0.3020927   loss training
## 3     3 0.2211647   loss training
## 4     4 0.1707503   loss training
## 5     5 0.1389315   loss training
## 6     6 0.1142671   loss training

This fairly naive approach achieves an accuracy of 88%. With state-of-the-art approaches, you should be able to get close to 95%.

Predicting on new data

##               [,1]
##  [1,] 0.0058080279
##  [2,] 1.0000000000
##  [3,] 0.7080981731
##  [4,] 0.9868260026
##  [5,] 0.9978235960
##  [6,] 0.9996370077
##  [7,] 0.6307815909
##  [8,] 0.0000171118
##  [9,] 0.9769185185
## [10,] 1.0000000000

Fighting overfitting