The Reuters dataset
The objective here is to classify short news stories into one of 46 topics available.
Preparing the data
Here, we use the multi-assignment operator (%<-%
) from the zeallot
package to unpack the list into a set of distinct variables.
reuters <- dataset_reuters(num_words = 10000)
c(c(train_data, train_labels), c(test_data, test_labels)) %<-% reuters
length(train_data)
## [1] 8982
length(test_data)
## [1] 2246
As with the IMDB reviews, each example is a list of integers (word indices):
train_data[[1]]
## [1] 1 2 2 8 43 10 447 5 25 207 270 5 3095 111
## [15] 16 369 186 90 67 7 89 5 19 102 6 19 124 15
## [29] 90 67 84 22 482 26 7 48 4 49 8 864 39 209
## [43] 154 6 151 6 83 11 15 22 155 11 15 7 48 9
## [57] 4579 1005 504 6 258 6 272 11 15 22 134 44 11 15
## [71] 16 8 197 1245 90 67 52 29 209 30 32 132 6 109
## [85] 15 17 12
train_labels[[1]]
## [1] 3
You can vectorize the data with the exact same code as in the IMDB example
vectorize_sequences <- function(sequences, dimension = 10000) {
results <- matrix(0, nrow = length(sequences), ncol = dimension)
for (i in 1:length(sequences))
results[i, sequences[[i]]] <- 1
results
}
x_train <- vectorize_sequences(train_data)
x_test <- vectorize_sequences(test_data)
Vectorize the labels:
one_hot_train_labels <- to_categorical(train_labels)
one_hot_test_labels <- to_categorical(test_labels)
Building the model
- The dimensionality of the output space (46 classes) is much larger.
Information bottleneck
- Each layer can only access information present in the output of the previous layer.
- Each layer can potentially become an information bottleneck.
- A 16-dimensional intermediate layer may be too limited to learn to separate 46 different classes:
- Such small layers may act as information bottlenecks, permanently dropping relevant information.
For this reason we will use larger layers. Let’s go with 64 units.
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu", input_shape = c(10000)) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 46, activation = "softmax")
Compiling the model
The best loss function to use in this case is categorical_crossentropy
.
model %>% compile(
optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("accuracy")
)
Validating your approach
Let’s set apart 1000 samples in the training data to use as a validation set.
val_indices <- 1:1000
x_val <- x_train[val_indices,]
partial_x_train <- x_train[-val_indices,]
y_val <- one_hot_train_labels[val_indices,]
partial_y_train = one_hot_train_labels[-val_indices,]
Now, let’s train the network for 20 epochs.
history <- model %>% fit(
partial_x_train,
partial_y_train,
epochs = 20,
batch_size = 512,
validation_data = list(x_val, y_val)
)
plot(history)
The network begins to overfit after nine epochs. Let’s train a new network from scratch for nine epochs and then evaluate it on the test set.
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu", input_shape = c(10000)) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 46, activation = "softmax")
model %>% compile(
optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("accuracy")
)
history <- model %>% fit(
partial_x_train,
partial_y_train,
epochs = 9,
batch_size = 512,
validation_data = list(x_val, y_val)
)
results <- model %>% evaluate(x_test, one_hot_test_labels)
results
## $loss
## [1] 1.019871
##
## $acc
## [1] 0.7769368
This approach reaches an accuracy of ~ 79%. With a balanced binary classification problem, the accuracy reached by a purely random classifier would be 50%. But in this case it’s closer to 18%, so the results seem pretty good, at least when compared to a random baseline:
test_labels_copy <- test_labels
test_labels_copy <- sample(test_labels_copy)
length(which(test_labels == test_labels_copy)) / length(test_labels)
## [1] 0.1954586
Predictions on new data
predictions <- model %>% predict(x_test)
Each entry in predictions is a vector of length 46:
dim(predictions)
## [1] 2246 46
The coefficients in this vector sum to 1:
sum(predictions[1,])
## [1] 1
The largest entry is the predicted class—the class with the highest probability:
which.max(predictions[1,])
## [1] 4
Recommended exercise 2
Use a vector of integers as labels instead of the one-hot encoding used above. Remember that this choice will impact the loss function used to train the model.