Experimenting with models on CIFAR 10

Abstract

This will be my first attempt to build a successful classifier on the CIFAR 10 dataset through experimentation with various models and optimization techniques, having never sought to tackle a machine learning problem without at least a little bit of guidance. The goal is to take a pretrained model, and utilizing the concept of transfer learning, apply it to a different dataset hoping to achieve an evaluated accuracy of at least 87% on the new dataset. I will attempt to document the steps, along with my confusion and insights throughout this writeup.

Introduction

I’m given instructions to train on the CIFAR10 dataset using any Keras application of my choice, along with any additional layers or optimizations necessary to improve the model. The validation accuracy of the model, which is different than the accuracy during training, must at least achieve 87% or higher.

The idea of transfer learning is to transfer the knowledge from an intelligent model to a new problem, thereby taking advantage of the learned intelligence of the existing model to reduce your training time. I like to think of it with a couple of different analogies. Firstly, I consider it a form of scaffolding learning — taking existing knowledge and using it to build upon to learn new things. Similarly, another way to think of it is say I’m wanting to learn about something new, and my friend has some knowledge in that area but maybe not exactly what I’m doing. Just before I start my task, my friend does brain dump on me to give me as much useful information as possible to give me a better baseline to start from. Another example analogy would be as if I wanted to learn to ride a bike when I wasn’t already able to walk very well. I might certainly get there eventually, but if I already had well developed leg muscles and stability it would be a much quicker learning curve.

So, in this machine learning context, we can take the weights of an existing model and use those as our starting point when applying it to a new problem. We can utilize an existing keras application, which are built on the network architecture of the best models researchers have created over the years. We can take one of these applications and use the the weights of that model trained on the imagenet dataset as our starting point. Imagenet is a dataset with over 14 million images and 1000 classes. With these pretrained weights as our starting point, we can build a useable model on another dataset much faster. It’s also a way to gain better results with less data.

Materials and Methods

During the documenting of the project, I ran into a plethora of issue that I didn’t remember to log. I also missed many outputs from various attempts, so this section might not be perfectly complete, but demonstrates how I was able to stumble my way to a working solution through much trial and error.

A major stumbling block worth mentioning is my need to train on a GPU, but headaches with bugs and inconsistencies between different versions of tensorflow, tensorflow-gpu, python versions, various dependencies, running out of memory, cuda incompatibilities, kernel dying, etc. The task example used tensorflow 1.12 with python 3.5. Collab only gives you access to tensorflow 1.15 with python 3.8. Paperspace 1.14. Nimblebox I could choose, but still ran into issues. During the week I went back and forth between colab, nimblebox, paperspace, vagrant (ubuntu), and anaconda (windows). My final solution was anaconda in a virtual environment on my personal machine with tensorflow-gpu 1.12 and python 3.5. This way I was able to use my rtx 2070 super to train on.

I simply started by creating a resnet50 model in tensorflow, then adding an Adam optimizer, compiling then fitting the model, then saving it as cifar10.h5.

I initially started with resnet50 as it’s not too large, in the hope that I’d reduce my potential training time while I stumble through getting the model to at least compile. Being my first time, and figuring out some seemingly simple steps had me spinning my wheels for a while.

A challenge I ran into was the input shapes expected by the old network might be different than the shape of the current dataset I’m trying to use for learning in this new scenario. As I struggled through this problem I gave up on resnet50, and tried my luck with vgg16. I hoped it would be simpler to solve as it’s a simpler model. Both had the same issue though, just with different shapes.

Resnet trained on imagenet expected a shape of (244, 244, 3), whereas vgg16 wanted (299, 299, 3). My current dataset is (32, 32, 3). I needed to scale my input to the dimensions the model required, but my input data included multiple examples. When I my initial stabs using numpy.resize() and tf.keras.layers.resize_images() failed, as my arrays included an extra dimension, as there were 10,000 images in the dataset. When I finally figured to try using a lambda layer in tensorflow to apply resize_images() on every image, I discovered that the resizing method asked for a scaling factor, not the new dimensions. This factor is an integer, which means you can’t scale up evenly to any number. Coming from 32x32, I needed to scale to a multiple of this. While vgg16 wanted a 299x299x3 image, the best I could do was 288x288x3 by scaling up by a factor of 9. I discovered though, that this was okay as long as the dimensions were greater than 75, although it did throw a warning.

After solving this challenge, the next stumbling block was a similar scaling issue. The original imagenet dataset was a dataset with 1000 classes, whereas in my current data we only have 10. So, the existing model has a final dense layer with softmax activation expecting 1000 labels but needed 10 instead. This last layer is called the top layer in these keras application functions. By setting this to false, this removes this final layer, allowing me to replace it with a valid classifier layer for my dataset. Additionally, the last layer of the vgg16 network won’t fit into the softmax classifier without first flattening the output into a single dimension. This is resolved by adding a flattening layer right before the last dense layer w/ softmax activation.

I attempted to add a Dense layer with 512 nodes, followed by a dropout layer with 0.4 threshold. This improved my accuracy marginally.

Attempting a smaller dropout value of 0.2 increased the accuracy marginally, but adding 10 epochs allowed the model more iterations to train before saving it for evaluation.

Next, I added two more dense layer + dropout layers with 0.2 dropout rate, moving down from 512 to 256 to 128 nodes in each fully connected layer. This reduced my training accuracy considerably.

At this point I realized that to benefit from transfer learning, I don’t want the base model to update it’s weights, so I need to set the trainable parameters to False. This is the key insight from this project — not only did my model accuracy improve, but the performance during training increased significantly! Essentially, the weights didn’t need to be updated for the layers of vgg16, only the last dense layers, cutting out considerable computation that was reducing the performance of the model.

Maybe I spoke to soon — while initially the model started off in the earlier epocs with higher accuracy, after a few epochs the model took a turn for the worst.

It seems like the model is overfitting and updated parameters away from the global minimum. Maybe my learning rate is too high? I could try reducing my learning rate, or implementing learning rate decay. Will try the built in applications.vgg16.preprocess_input method. Each model prefers a different scale of values, so it’s best to use the preprocessing method the model prefers.

This prevented the model from dropping off for another epoch or so, but still had the same problem. While reading about others attempts to use vgg16 for transfer learning, I noticed they were using the ReLu activation on their final fully connected layers, so I’m trying this next.

VAST improvement!!

My next thought was to scale the output values to fall between -1 and 1. Since the image values range from 0 to 255, I simply divided each element of the array by 255 to normalize the values between -1 and 1.

This had hardly any effect, and I guess the vgg16.prepropcess_input function already scaled the values to be optimal for the vgg16 network. And, now that I think of it, the ReLu activations already do this scaling. I’ll leave it though as it did provide marginal improvement.

I was curious as to whether the amount of the nodes in each dense layer would have a significant effect, so I set them all to 256. This resulted in a very slight degradation in performance. Next I tried 128, 64, and 32. Further degradation.

At 256, 128, 64:

1024, 256, 32:

Slightly better, but now I’ll try adding a few more and going from 1024, 512, 256, 128, 64, 32.

Still worse. Removing the bottom few layers and going back to 512, 256, 128.

So far this is the best model. Next, I’ll try setting my batch size to 32, and shuffling the batches between each epoch.

Hardly an improvement. Now, I’ll try an average pooling layer before the fully connected layers.

Considerable loss of performance. For now I’ll try adding more epochs to train longer and see where I end up.

Roughly 19 epochs to meet the 87% hurdle.

At this point I was able to save the model, reload it, then evaluate it. I realized then that I was training and testing on the same slice of the test data when the model evaluated at 99%. After separating the training set from the testing set, I reran the evaluation to make sure my model was generalizable beyond the training data. I also reran the training and noticed my accuracy was much lower, around 77% after 19 epochs with 50,000 examples each. So my next thought was to try other applications. Both inception and vgg16 had similar results with everything set up the same. Densenet121 gave me some better results.

By now I have been so frustrated by the time it takes to train, even using a RTX 2070 super. I realized that I could maybe reduced the computational complexity by moving my flattening layer above the fully connected layers, after the base model. Similarly, I came across an article about batch normalization before every single layer, which basically renormalizes the output before each layer, reducing the computational complexity and also reducing the likelihood of overfitting and vanishing gradients. Adding these layers vastly improved the model.

With densenet on the proper training set, I achieved 84% after 10 epochs. Since the training set is 50,000 examples, it takes much longer to train than when I was using the wrong test set. So rather than let it train for 20 epochs or more, I decided to try RMSProp. I read in a stackoverflow post that it’s useful for sparse problems. I’m not sure what the term sparse is referring to, if this means the size of the dataset, or number of connections between layers? But it seems to have improved the model considerably.

When I decided to let 50 epochs run overnight, I remembered about using checkpoint callback, so that the best model would be in the checkpoints and would be used when saving.

Woke up to a dead kernel :’(

This though, is with 10 epochs with batch normalization before each layer along with rmsprop. Because I want to better see how each affected the model, I’m going to remove both (no batch norm and back to Adam vs rmsprop) and then re add each to see how the performance changed.

Densenet with rms and batchnorm:

Above evaluation, very poor, so my model is overfitting:

No batch norm and with adam:

Didn’t screen shot but it was around 77% and very low evaluation accuracy.

Now a week into the project our curriculum team has released a video with some explanations. At this point they explain that for small datasets, adding too many layers can result in overfitting, as small changes in features get exaggerated, and that we shouldn’t train over too many epochs for the same reason. So I’m removing the last two dense layers and trying again, along with the batch_norm layer before the last two dense layers, and only 4 epochs. Now that I think of it, I should have realized this when I was experiementing with the layers above, and how my performance decreased as I tried adding more fully connected layers. Should have thought to try fewer layers.

With a single dense layer with 512 nodes, ReLu activated, dropout of .2, and batch norm in between, my training accuracy is exceeding 90% but my test accuracy is still lagging by half, meaning I’m badly overfitting. The val_acc also decreased for each epoch, meaning the overfitting got worse for each epoch.

Will try increasing the dropout rate to 0.5, hoping this can reduce the model nodes from conspiring… This made it considerably worse, down to .3 val_acc.

Trying to replace my flattening layer with global average pooling.

dropout initially with .4 seemed to reduce the accuracy

adding multiple dense+dropout

with dropout 512, 256, 128 + basemodel false

Experimenting with different applications, pooling vs flattening layer, dropout rates, with and without batch_norm layers, etc. With inception_resnet_v2, the evaluation wasn’t too bad with validation accuracy of 78%.

Removing the batch_norm layer reduced the validation accuracy slightly.

Trying 500 for the dense layer instead of 512, and I’m going to try resizing the image to the original size used in imagenet for the model, which is 299. Since I used K.backend.resize_images I could only scale by a factor of the original size, so 288 was the closest value. The project restricts us to one import of keras only. But to see how it affects the model, since faculty mentioned reduced results with a shape of 150, I wanted to see if the same shape as imagenet trained on would improve the results.

Slightly better. Going to train on more epochs and call it a week.

Results

In the end I decided to keep inception_resnet_v2, and train over 10 epochs. I am certain there is a better model that can train more efficiently with fewer iterations, but for now I’m thankful for the learning experience. My final evaluated accuracy was:

Discussion

When we have smaller datasets it’s better to use simpler models to prevent small changes in the feature set from being exaggerated and the model overfitting our data. Input shape to the start of the network, as well as in between different layers can have an affect on not only the output, but considerably affect the performance of the training time. Also, optimization techniques aren’t universal, but are better suited to some types of models. I hope to build better intuition over this as I continue to gain more experience.

I also discovered that the model improved if I resized it to the same size the model used when training on imagenet. This violated one of our restrictions though, as we were only allowed a single import of keras as K, so to use tf.image.resize_images() was disallowed (the tf version allows resizing, wheras the keras version scales up by a integer factor).

Near the end of my training I discovered this neat graphic in an article that I’m sure I’ll reference again in the future when choosing which model to run. Notice inception_resnet_v2 near the center at the third highest top-1 accuracy.

Acknowledgements

I received tips and discussion with a few fellow peers as well as Holberton staff. Many of the references below, and even more I’ve forgotten to include, helped me think of things to try, and troubleshoot potential issues.

One of the responses from staff I found especially insightful which I’ll add here, regarding the effect of flattening vs pooling layers:

“Third, global average pooling can be used to reduce the dimensionality of the feature maps output of the convolutional layer to minimize overfitting by reducing the total number of parameters in the model. The Flatten layer will always have at least as much parameters as the GlobalAveragePooling2D layer. If the final tensor shape before flattening is still large, for instance (16, 240, 240, 128), using Flatten will make an insane amount of parameters: 240*240*128 = 7,372,800. At that moment, GlobalAveragePooling2D might be preferred in most cases. If you used MaxPooling2D and Conv2D so much that your tensor shape before flattening is like (16, 1, 1, 128), it won’t make a difference.
If you’re overfitting, you might want to try GlobalAveragePooling2D.”

References

Brownlee, J. (2019). Transfer Learning in Keras with Computer Vision Models. Retrieved from https://machinelearningmastery.com/how-to-use-transfer-learning-when-developing-convolutional-neural-network-models/

fchollet. (2015). Transfer learning & fine-tuning. Retrieved from https://keras.io/guides/transfer_learning/

Hamdi, B. (2021). Transfer learning with Keras using DenseNet121. Retrieved from https://bouzouitina-hamdi.medium.com/transfer-learning-with-keras-using-densenet121-fffc6bb0c233

Keras. (2021). Keras Applications. Retrieved from https://keras.io/api/applications/

Leo, M. S. (2020). How to Choose the Best Keras Pre-Trained Model for Image Classification. Retrieved from https://towardsdatascience.com/how-to-choose-the-best-keras-pre-trained-model-for-image-classification-b850ca4428d4

Merhben, O. (2021). Transfer learning on CIFAR10. Retrieved from https://www.youtube.com/watch?v=xr46kbl2T8Q&feature=youtu.be

Sarkar, D. (2018). A Comprehensive Hands-on Guide to Transfer Learning with Real-World Applications in Deep Learning. Retrieved from https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a