Neural Network is a network consisting of a series on neurons arranged in a particular way to mimic the way neurons are functioned in a human body. They begin with a very small block, known as a neuron which actually is nothing but a small block to which some mathematical transformations are applied, and this whole network is then interlinked. Every single neuron is a weak learner, but when trained as a whole network they can create some unrealistic training accuracies.
The choice of using a neural network over regular Machine Learning Algorithm in many cases itself gives an indication of the amount of depth it brings to solve a problem statement.
What is a neural network made up of and why is it so effective?
Any Neural Network, Deep Neural Network, Convolutional Neural Network, Recurrent Neural Network or others. basically, are made up of few elements. The first part is the actual network consisting of neurons arranged together layer after layer, leading to a network of ‘N’ parameters. Other important elements are the Activation Function, Optimizers, Loss Functions, Callbacks and other hyper parameters.
An activation function will add non-linearity to the neural network, optimizer gives a route to the loss function, to reach its minimum and callbacks are some add on features to have a check on the neural networks.
General Approach to increase the accuracy of model-
“More complex the data, more deep the network”, Is what the general rule says. But is that always the case? More the network is deeper less will be the chance for the weights and biases to give an individual level impact, which is good to have in a model, as there will be more players to decide the output probabilities.
Let us take a case of CNN-Convolutional Neural Networks. These are used a lot in computer vision. Say we have a classification problem where the data is complicated and there is no clear relation between the features and the labels. In this case the depth of the CNN will allow the model to extract more and more features from the images, eventually helping the model itself to train better on the given data.
So having more parameters in a neural network gives good accuracies on the data. There are many pre-defined networks like AlexNet, ImageNet, VGG, ResNet and others. These have millions of parameters and hence are used extensively.
But every pro comes with a con, training the data on such large neural network will take up a lot of computational time and resource. Hence this may not be feasible for every single application in use.
Below is a case study to give an insight on how can a model perform better with limited parameters in the network.
Training on MNIST Data with total parameters less than 10,000-
MNIST is a dataset available on Kaggle, and many other sources which has images of hand written digits ranging from 0-9 with 60,000 training images and 10,000 testing images in the database. The best model available till data which was also the competition winner on Kaggle, has a validation accuracy of 99.8% and the total number of parameters were around 2,00,000.
A challenge is to get such a good accuracy using just 10,000 parameters. To start the training process, the network was built with different combinations of Convolutional and Dense layers keeping into mind the limit of parameters to be below 10,000. The finalized structure of the neural network is as seen in the figure below.
As evident from the figure the total number of parameters for the network is just 9,657. Now the challenge was to increase the accuracy keeping the network same. The accuracy of the model for the 1st teration was around 84%.
Different image augmentation techniques were applied before passing them through the network, but the results obtained from these augmentations were nearly same, as MNIST images are just simple handwritten digits without much of complexity to the images.
The main logic to train the model was to make the model more sensitive to the outlines of the digits and hence masking methods were tried using the open cv tool. This surprisingly decreased the accuracy to 78-79 %.
Next steps included playing around with different activation functions. Many activation functions were implemented like relu, leaky-relu, elu, tanh. Softmax was used in the final layer of the network.
In combinations with the activation functions every single optimizer was evaluated like the SGD, Adam, Adagrad, Adadelta. Next step was to play around with their hyper parameters like applying a learning rate scheduler, and other inbuilt methods from keras. Callbacks like early stopping were used to avoid overfitting. Model Check point was used to save the best weights.
After testing different combinations of optimizers and activation functions, along with loss functions, the final model had an accuracy of 99.05%.
Getting an accuracy above 99% using just 10,000 parameters proves the importance of activation functions, optimizers, loss functions, and their hyper parameters in the process of training a neural network.
To check the whole code please refer the below link of the Google Collab Notebook.