The Reuters dataset

The objective here is to classify short news stories into one of 46 topics available.

Preparing the data

Here, we use the multi-assignment operator (%<-%) from the zeallot package to unpack the list into a set of distinct variables.

reuters <- dataset_reuters(num_words = 10000)
c(c(train_data, train_labels), c(test_data, test_labels)) %<-% reuters
length(train_data)
## [1] 8982
length(test_data)
## [1] 2246

As with the IMDB reviews, each example is a list of integers (word indices):

train_data[[1]]
##  [1]    1    2    2    8   43   10  447    5   25  207  270    5 3095  111
## [15]   16  369  186   90   67    7   89    5   19  102    6   19  124   15
## [29]   90   67   84   22  482   26    7   48    4   49    8  864   39  209
## [43]  154    6  151    6   83   11   15   22  155   11   15    7   48    9
## [57] 4579 1005  504    6  258    6  272   11   15   22  134   44   11   15
## [71]   16    8  197 1245   90   67   52   29  209   30   32  132    6  109
## [85]   15   17   12
train_labels[[1]]
## [1] 3

You can vectorize the data with the exact same code as in the IMDB example

vectorize_sequences <- function(sequences, dimension = 10000) {
  results <- matrix(0, nrow = length(sequences), ncol = dimension)
  for (i in 1:length(sequences))
    results[i, sequences[[i]]] <- 1
  results
}

x_train <- vectorize_sequences(train_data)          
x_test <- vectorize_sequences(test_data)            

Vectorize the labels:

one_hot_train_labels <- to_categorical(train_labels)
one_hot_test_labels <- to_categorical(test_labels)

Building the model

  • The dimensionality of the output space (46 classes) is much larger.

Information bottleneck

  • Each layer can only access information present in the output of the previous layer.
  • Each layer can potentially become an information bottleneck.
  • A 16-dimensional intermediate layer may be too limited to learn to separate 46 different classes:
  • Such small layers may act as information bottlenecks, permanently dropping relevant information.

For this reason we will use larger layers. Let’s go with 64 units.

model <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 46, activation = "softmax")

Compiling the model

The best loss function to use in this case is categorical_crossentropy.

model %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)

Validating your approach

Let’s set apart 1000 samples in the training data to use as a validation set.

val_indices <- 1:1000

x_val <- x_train[val_indices,]
partial_x_train <- x_train[-val_indices,]

y_val <- one_hot_train_labels[val_indices,]
partial_y_train = one_hot_train_labels[-val_indices,]

Now, let’s train the network for 20 epochs.

history <- model %>% fit(
  partial_x_train,
  partial_y_train,
  epochs = 20,
  batch_size = 512,
  validation_data = list(x_val, y_val)
)
plot(history)

The network begins to overfit after nine epochs. Let’s train a new network from scratch for nine epochs and then evaluate it on the test set.

model <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 46, activation = "softmax")

model %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)

history <- model %>% fit(
  partial_x_train,
  partial_y_train,
  epochs = 9,
  batch_size = 512,
  validation_data = list(x_val, y_val)
)
results <- model %>% evaluate(x_test, one_hot_test_labels)

results
## $loss
## [1] 1.019871
## 
## $acc
## [1] 0.7769368

This approach reaches an accuracy of ~ 79%. With a balanced binary classification problem, the accuracy reached by a purely random classifier would be 50%. But in this case it’s closer to 18%, so the results seem pretty good, at least when compared to a random baseline:

test_labels_copy <- test_labels
test_labels_copy <- sample(test_labels_copy)
length(which(test_labels == test_labels_copy)) / length(test_labels)
## [1] 0.1954586

Predictions on new data

predictions <- model %>% predict(x_test)

Each entry in predictions is a vector of length 46:

dim(predictions)
## [1] 2246   46

The coefficients in this vector sum to 1:

sum(predictions[1,])
## [1] 1

The largest entry is the predicted class—the class with the highest probability:

which.max(predictions[1,])
## [1] 4