In this post we use tensorflow 2.1 custom model and custom loop on the famous MNIST dataset. We perform a multiclass classification with a basic convolutional neural network.

According to offical website, MNIST dataset is a database of handwritten digits. It contains a training and a test set. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

This dataset contains 10 classes (digits from 0 to 9) and images are in the grayscale (1-channel). As a reminder color images have three 3 channels: red, green and blue.

Here is a link to the notebook.

Let’s get started !

Load useful packages

import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten, Conv2D, Reshape, MaxPooling2D
from tensorflow.keras import Model

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import confusion_matrix
import seaborn as sns

Data preparation and exploration

Load the dataset

# Load
mnist_dataset = tf.keras.datasets.mnist.load_data()
# Unpack
(x_train, y_train), (x_test, y_test) = mnist_dataset

# Train dataset shapes
print('Train X shape ', x_train.shape)
print('Train Y shape ', y_train.shape)

#=> Train X shape  (60000, 28, 28)
#=> Train Y shape  (60000,)
# Test dataset shapes
print('Test X shape ', x_test.shape)
print('Test Y shape ', y_test.shape)

#=> Test X shape  (10000, 28, 28)
#=> Test Y shape  (10000,)

Explore images

print("pixel unique values:", len(np.unique(x_train)),
      "\nmin value:", np.min(x_train),
      "\nmax value:", np.max(x_train))

#=> pixel unique values: 256 
#=> min value: 0 
#=> max value: 255 

Explore labels

print("unique labels:", len(np.unique(y_train)),
      "\nmin value:", np.min(y_train),
      "\nmax value:", np.max(y_train))

#=> unique labels: 10
#=> min value: 0
#=> max value: 9

Inspect density probability distribution over the training/test sets

see code ref: datasets_distribution, print_dataset_distributions

We plot an histogram of the label distributions to see how balanced are the labels among samples, and among the training and test set.

traintest = datasets_distribution(y_train, y_test)
print(traintest)
print_dataset_distributions(traintest)

Training set and test set density probability


label train (%) test (%)
0 9.871667 9.80
1 11.236667 11.35
2 9.930000 10.32
3 10.218333 10.10
4 9.736667 9.82
5 9.035000 8.92
6 9.863333 9.58
7 10.441667 10.28
8 9.751667 9.74
9 9.915000 10.09

We can see that:

  • distribution among labels are pretty close
  • distribution for the training set and the test set are also very close.

Show me an image

see code ref: show_image

show_image(x_train, 0)

Number 5

Gather and prepare

see code ref: prepare_mnist_dataset

Here, we prepare the samples by converting them from integers (0-255) to floating-point numbers (0.0-1.0). Then we have to reshape the samples to fit in the convolutional layer. Thus, we add one dimension, because the first convolutional layer expect a 4D tensor ([batch, in_height, in_width, in_channels]).

def prepare_mnist_dataset(mnist_dataset):
  """
  Get the data form MNIST dataset
  http://yann.lecun.com/exdb/mnist/
  :param mnist_dataset: MNIST dataset
  :return: tuple containing (x_train, y_train), (x_test, y_test)
  """
  (x_train, y_train), (x_test, y_test) = mnist_dataset
  # Reduce the samples from integers
  x_train, x_test = x_train / np.float32(255), x_test / np.float32(255)
  y_train, y_test = y_train.astype(np.int64), y_test.astype(np.int64)
  # Get the number of training and test samples
  m_train = x_train.shape[0]
  m_test = x_test.shape[0]
  # Get image dimensions
  height = x_test.shape[1]
  width = x_test.shape[2]
  # Reshape adding one dimension for the channel
  x_train = x_train.reshape(m_train, height, width, 1)
  x_test = x_test.reshape(m_test, height, width, 1)
  return (x_train, y_train), (x_test, y_test)

Define the model

see code ref: SimpleConvModel, tf.keras.layers.Conv2D, tf.keras.layers.MaxPool2D, tf.keras.layers.Flatten, tf.keras.Model

We define a custom tensorflow model by creating a class that inherits from tf.keras.Model. In this class, we build a sequencial neural network with the help of some predefined keras layers:

  • Conv2D(32, (3,3), input_shape=(28, 28, 1), activation="relu")
    • a 2D convolutional layer for spatial convolution over an image
    • channel: only one has the input is grey colour
    • convolutions: 32, size of the convolution 3x3 grid
    • activation function: ReLU
  • MaxPooling2D(2, 2)
    • a MaxPooling layer that compress the image while maintaining the content of the features that were highlighted by the convolution.
    • size: (2,2) the effect is to quarter the size of the image.
class SimpleConvModel(Model):
  """
  Custom convolutional model
  """
  def __init__(self, image_height, image_width, channel_count):
    """
    Constructor
    :param self: self
    :param image_height: height of image in pixel
    :param image_width: width of image in pixel
    :channel_count: channel count, for example color image has 3 channels, grayscale image has only one channel
    :return: void
    """
    super(SimpleConvModel, self).__init__()

    # Define sequential layers
    self.convolution = Conv2D(32, (3,3), input_shape=(image_height, image_width, channel_count), activation="relu")
    self.max_pooling = MaxPooling2D(2, 2)
    self.flatten = Flatten()
    self.dense = Dense(128, activation="relu")
    self.softmax = Dense(10, activation="softmax")

    # Keep convolutional layer output
    self.convolutional_output = tf.constant(0)
    # Keep max pooling layer output
    self.max_pooling_output = tf.constant(0)
    # Input signature for tf.saved_model.save()
    self.input_signature = tf.TensorSpec(shape=[None, image_height, image_width, channel_count], dtype=tf.float32, name='prev_img')

  def call(self, inputs):
    """
    Forward propagation
    :param self: self
    :param inputs: tensor of dimension [batch_size, image_height, image_width, channel_count]
    :return: predictions
    """
    self.convolutional_output = self.convolution(inputs)
    self.max_pooling_output = self.max_pooling(self.convolutional_output)
    x = self.flatten(self.max_pooling_output)
    x = self.dense(x)
    return self.softmax(x)

Then we create a SimpleConvModel to see more details about the neural network we’ve just built.

# Create a model instance
IMAGE_HEIGHT = 28
IMAGE_WIDTH = 28

model = SimpleConvModel(IMAGE_HEIGHT, IMAGE_WIDTH, 1)
model.build(input_shape=(None, IMAGE_HEIGHT, IMAGE_WIDTH, 1))
model.summary()

#=> Model: "simple_conv_model"
#=> _________________________________________________________________
#=> Layer (type)                 Output Shape              Param #   
#=> =================================================================
#=> conv2d (Conv2D)              multiple                  320       
#=> _________________________________________________________________
#=> max_pooling2d (MaxPooling2D) multiple                  0         
#=> _________________________________________________________________
#=> flatten (Flatten)            multiple                  0         
#=> _________________________________________________________________
#=> dense (Dense)                multiple                  692352    
#=> _________________________________________________________________
#=> dense_1 (Dense)              multiple                  1290      
#=> =================================================================
#=> Total params: 693,962
#=> Trainable params: 693,962
#=> Non-trainable params: 0
#=> _________________________________________________________________

For better understanding of the tensorflow output, here is a table where we added the input and output dimensions. To simplifiy, we omit the batch/sample size.

Layer Input Dimension Output Dimension Parameter Count
Conv2D 28x28  32x26x26 (3x3 + 1) x 32
MaxPooling2D 32x26x26  32x13x13 0
Flatten 32x13x13  5408 0
Dense (128)  5408 128 (5408 + 1) x 128
Dense softmax (10) 128  10 (128 + 1) x 10

For the first layer, we have 32 filters of square size 3x3. Each filter has 9 parameters (3x3). So then the number of parameter is (9 + 1) by 32 because we add the bias unit for every filter.

Optimizer

see code ref: tf.keras.optimizers.Adam()

We choose ADAM optimizer for (Adaptive Moment Estimation). It is a combination of AdaGrad and RMSProp algorithm.

optimizer_function = tf.keras.optimizers.Adam()

Loss

see code ref: tf.keras.losses.SparseCategoricalCrossentropy

Computes the crossentropy loss between the labels and predictions. For more information go to section “\(\mathcal{L}\) as Loss function and \(E\) as Error” in my previous post

loss_function = tf.keras.losses.SparseCategoricalCrossentropy()

def compute_loss(labels, logits):
  """
  Compute loss
  :param labels: true label
  :param logits: predicted label
  :return: loss
  """
  return loss_function(labels, logits)

Accuracy

see code ref: tf.math.argmax, tf.math.reduce_mean, tf.cast, tf.math.equal

Accuracy is the first metric we usually computes. It represents the part of correct predictions.

def compute_accuracy(labels, logits):
  """
  Compute accuracy
  :param labels: true label
  :param logits: predicted label
  :return: accuracy of type float
  """
  predictions = tf.math.argmax(logits, axis=1)
  return tf.math.reduce_mean(tf.cast(tf.math.equal(predictions, labels), tf.float32))

Now, action ! :movie_camera:

Get the data prepared

see code ref: prepare_mnist_dataset, batch_dataset

# Prepare data
(x_train, y_train), (x_test, y_test) = prepare_mnist_dataset(mnist_dataset)

# Train
# Get a `TensorSliceDataset` object from `ndarray`s
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset_batch = batch_dataset(train_dataset,
                        take_count = 60000,
                        batch_count = 100)

# Test
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
test_dataset_batch = test_dataset.batch(100)

Optimization loop

see code ref: save, tf.keras.metrics.Mean, tf.keras.metrics.SparseCategoricalAccuracy, tf.GradientTape

We write an optimization loop helped by tf.GradientTape() which enables us to automatically computes the gradient. We stop the optimization loop when the test accuracy doesn’t improve for 2 epochs.

EPOCHS = 10

train_losses = []
train_accurarcies = []

test_losses = []
test_accurarcies = []

for epoch in range(EPOCHS):
  train_loss_aggregate = tf.keras.metrics.Mean(name="train_loss")
  test_loss_aggregate = tf.keras.metrics.Mean(name="test_loss")
  train_accuracy_aggregate = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")
  test_accuracy_aggregate = tf.keras.metrics.SparseCategoricalAccuracy(name="test_accuracy")

  for train_images, train_labels in train_dataset_batch:
    with tf.GradientTape() as tape:
      # forward propagation
      predictions = model(train_images)
      # calculate loss
      loss = compute_loss(train_labels, predictions)
      
    # calculate gradients from model definition and loss
    gradients = tape.gradient(loss, model.trainable_variables)
    # update model from gradients
    optimizer_function.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss_aggregate(loss)
    train_accuracy_aggregate(train_labels, predictions)

  for test_images, test_labels in test_dataset_batch:
    predictions = model(test_images)

    loss = compute_loss(test_labels, predictions)

    test_loss_aggregate(loss)
    test_accuracy_aggregate(test_labels, predictions)

  train_losses.append(train_loss_aggregate.result().numpy())
  train_accurarcies.append(train_accuracy_aggregate.result().numpy()*100)

  test_losses.append(test_loss_aggregate.result().numpy())
  test_accurarcies.append(test_accuracy_aggregate.result().numpy()*100)

  print('epoch', epoch,
        'train loss', train_losses[-1],
        'train accuracy', train_accurarcies[-1],
        'test loss', test_losses[-1],
        'test accuracy', test_accurarcies[-1])

  save(model, 'mnist/epoch/{0}'.format(epoch))
  
  if epoch > 1:
    if test_accurarcies[-2] >= test_accurarcies[-1] and test_accurarcies[-3] >= test_accurarcies[-2]:
      break

#=> epoch 0 train loss 0.2019603 train accuracy 94.14166808128357 test loss 0.07888014 test accuracy 97.54999876022339
#=> epoch 1 train loss 0.064292975 train accuracy 98.07999730110168 test loss 0.057503365 test accuracy 98.089998960495
#=> epoch 2 train loss 0.043002833 train accuracy 98.71166944503784 test loss 0.05409672 test accuracy 98.089998960495
#=> epoch 3 train loss 0.031002931 train accuracy 99.05166625976562 test loss 0.046595603 test accuracy 98.43999743461609
#=> epoch 4 train loss 0.022615613 train accuracy 99.31166768074036 test loss 0.04606781 test accuracy 98.43000173568726
#=> epoch 5 train loss 0.016122704 train accuracy 99.50500130653381 test loss 0.03901471 test accuracy 98.580002784729
#=> epoch 6 train loss 0.012739406 train accuracy 99.6150016784668 test loss 0.045439206 test accuracy 98.5700011253357
#=> epoch 7 train loss 0.009279974 train accuracy 99.72833395004272 test loss 0.047415126 test accuracy 98.5700011253357

It seems that the model begin to overfit after epoch 5 as the train accuracy still grow but the test accuracy begin to decrease.

For visualization, we plot the losses and accuracies over epochs.

see doc ref: losses, accuracies

Losses Accuracies

Model evaluation

Confusion matrix, accuracy, precisions and recalls

see doc ref: custom_confusion_matrix, print_confusion, confusion_matrix

In real world problems, accuracy is not often a reliable metric for the performance of a classifier. Indeed, unbalanced data set will cause skewed accuracy results.

To overcome this, we often use confusion matrices, also called error matrices. In a confusion matrix the number of true/actual labels (by rows) are displayed against predicted labels (by columns). In this configuration, true positives (\(tp\)) are on the diagonal; false positive (\(fp\)) on columns; and false negative (\(fn\)) on rows.

  Pred. 0 Pred. 1 Pred. 2 .. Pred. 9 Recall
Act. 0 Correct 0 P=1 but A=0 P=2 but A=0 .. P=9 but A=0 Recall 0
Act. 1 P=0 but A=1 Correct 1 P=2 but A=1 .. P=9 but A=1 Recall 1
Act. 2 P=0 but A=2 P=1 but A=2 Correct 2 .. P=9 but A=2 Recall 2
.. .. .. .. .. .. ..
Act. 9 P=0 but A=9 P=1 but A=9 P=2 but A=9 .. Correct 9 Recall 9
Precision Precision 0 Precision 1 Precision 2 .. Precision 9 Accuracy

We added three other metrics to the matrix: precision, accuracy and recall.

The accuracy is the percentage of predictions that are corrects, or the sum of all correct prediction over the total number of samples. Refering to the confusion matrix it’s the trace of the matrix over the total number of exemples.

\[accuracy = { Tr(confusion) \over m }\]

Precision: when it predicts a k label, how often it is the true label.

\[precision(label = k) = { tp(label = k) \over tp(label = k) + fp(label = k) }\]

Referring to the matrix, precisions can be calculated as follow

\[precisions = { diag(confusion) \over rowsum(confusion) }\]

Recall or “true positive rate”, when it’s actually label k, how often does it predict label k

\[recall(label = k) = { tp(label = k) \over tp(label = k) + fn(label = k) }\]

Referring to the matrix, recalls can be calculated as follow

\[recalls = { diag(confusion) \over colsum(confusion) }\]

Remind the results of the optimization loop

Epoch train loss train accuracy test loss test accuracy
0 0.2019603 94.14166808128357 0.07888014 97.54999876022339
1 0.064292975 98.07999730110168 0.057503365 98.089998960495
2 0.043002833 98.71166944503784 0.05409672 98.089998960495
3 0.031002931 99.05166625976562 0.046595603 98.43999743461609
4 0.022615613 99.31166768074036 0.04606781 98.43000173568726
5 0.016122704 99.50500130653381 0.03901471 98.580002784729
6 0.012739406 99.6150016784668 0.045439206 98.5700011253357
7 0.009279974 99.72833395004272 0.047415126 98.5700011253357

We focus on epoch n°0 and n°5 which are the two most different epochs in terms of accuracies.

Here is the confusion matrix for epoch 0:

Confusion matrix at epoch 0

We can see that 19 images are misclassified 2 instead of 7.

For epoch 5

Confusion matrix at epoch 5

Here again a significant number of 7 are misclassified as 2. Indeed, the label 7 has the poorest recall rate.

We also notice that in general true labels 2 are correctly classified.

On the other side, label 9 has the poorest precision because the model often misclassifies 4, 7 and 8 as 9.

Show me what the network sees

see doc ref: print_convolutions, print_max_poolings

Let’s pick a random image and choose the most significant convolutions

sample_index = 10
show_image(x_test, sample_index)

prediction = model(x_test[sample_index].reshape(1, IMAGE_HEIGHT, IMAGE_WIDTH, 1))

print("label", y_test[sample_index])
print("prediction", tf.math.argmax(prediction, axis=1).numpy())

print_convolutions(model)
print_max_poolings(model)

original 6th convolution 13th convolution 17th convolution

Error analysis

see doc ref: tf.math.top_k

Finally, we perform a short error analysis, to see where the model goes wrong. For this, we picked the model at epoch 5 because it seems to be the less wrong.

epoch = 5
savedModel = tf.saved_model.load("mnist/epoch/{0}".format(epoch))

m = x_test.shape[0]

# calculate prediction
test_predictions = model(x_test.reshape(m, IMAGE_HEIGHT, IMAGE_WIDTH, 1))
error_indices = y_test != np.argmax(test_predictions, axis=1)

error_images = x_test[error_indices]
error_labels = y_test[error_indices]
error_predictions = test_predictions[error_indices]
np.set_printoptions(suppress=True)

for i, error_prediction in enumerate(error_predictions):
  print("index", i)
  print("true label", error_labels[i])
  print("prediction", tf.math.argmax(error_prediction).numpy())

  top3 = tf.math.top_k(error_prediction, 3)
  print("top k", top3.indices.numpy(), top3.values.numpy())

  show_image(error_images, i)

Let’s take a seven misclassified as 2.

7 misclassified

In this case the model is not really good because none of the top 3 predictions are 7 !

We then have two different cases. The next is when the hand wirtten digit is even hard for a human to read. This 3 for exemple:

3 misclassified

And this last one, which explains pretty well why CAPTCHAs are so difficult to predict.

8 misclassified

Writing this article, I really liked the following sources