# Several Tips for Improving Neural Network

In this post, it will be mentioned about how we can improve the performace of neural network. Especially, we are talking about ReLU activation function, Weight Initialization, Dropout, and Batch Normalization

```
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
plt.rcParams['figure.figsize'] = (16, 10)
plt.rcParams['text.usetex'] = True
plt.rc('font', size=15)
```

## ReLU Activation Function

### Problem of Sigmoid

Previously, we talked about the process happened int neural network. When the input pass througth the network, and generate the output, we called **forward propagation**. From this, we can measure the error between the predicted output and actual output. Of course, we want to train the neural network for minimizing this error. So we differentiate the the error and update the weight based on this. It is called **backpropation**.

$$g(z) = \frac{1}{1 + e^{-z}} $$

This is the **sigmoid** function. We used this for measuring the probability of binary classification. And its range is from 0 to 1. When we apply sigmoid function in the output, sigmoid function will be affected in backpropgation. The problem is that, when we differentiate the middle point of sigmoid function. It doesn't care while we differentiate the sigmoid function in middle point. The problem is when the error goes $\infty$ or $-\infty$. As you can see, when the error is high, the gradient of sigmoid goes to 0, and when the error is negatively high, the gradient of sigmoid goes to 0 too. When we cover the chain rule in previous post, the gradient in post step is used to calculate the overall gradient. So what if error is too high in some nodes, the overall gradient go towards to 0, because of chain rule. This kind of problem is called **Vanishing Gradient**. Of course, we cannot calculate the gradient, and it is hard to update the weight.

### ReLU

Here, we introduce the new activation function, **Rectified Linear Unit** (ReLU for short). Originally, simple linear unit is like this,

$$ f(x) = x $$

But we just consider the range of over 0, and ignore the value less than 0. We can express the form like this,

$$ f(x) = \max(0, x) $$

This form can be explained that, when the input is less than 0, then output will be 0. and input is larger than 0, input will be output itself.

So in this case, how can we analyze its gradient? If the x is larger than 0, its gradient will be 1. Unlike sigmoid, whatever the number of layers is increased, if the error is larger than 0, its gradient maintains and transfers to next step of chain rule. But there is a small problem when the error is less than 0. In this range, its gradient is 0. That is, gradient will be omitted when the error is less than 0. May be this is a same situation in Sigmoid case. But At least, we can main the gradient terms when the error is larger than 0.

There are another variation for handling vanishing gradient problem, such as Exponential Linear Unit (ELU), Scaled Exponential Linear Unit (SELU), Leaky ReLU and so on.

In this example, we will use MNIST dataset for comparing the preformance of each activation function.

```
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import mnist
# Load dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
print(X_train.shape, X_test.shape)
# Expand the dimension from 2D to 3D
X_train = tf.expand_dims(X_train, axis=-1)
X_test = tf.expand_dims(X_test, axis=-1)
print(X_train.shape, X_test.shape)
```

Maybe someone will be confused in expanding the dimension. That's because tensorflow enforce image inputs shapes like `[batch_size, height, width, channel]`

. But MNIST dataset included in keras, doesn't have information of channel. So we expand the dimension in the end of dataset for expressing its channel(you know that the channel in MNIST is grayscale, so it is 0)

And its image is grayscale, so the range of data is from 0 to 255. And it is helpful for training while its dataset is normalized. So we apply the normalization.

```
X_train = tf.cast(X_train, tf.float32) / 255.0
X_test = tf.cast(X_test, tf.float32) / 255.0
```

And the range of label is from 0 to 9. And its type is categorical. So we need to convert the label with one-hot encoding. Keras offers `to_categorical`

APIs to do this. (There are so many approaches for one-hot encoding, we can try it by your mind).

```
y_train = to_categorical(y_train, num_classes=10)
y_test = to_categorical(y_test, num_classes=10)
```

At last, we are going to implement network. In this case, we will build it with class object. Note that, to implement model with class object, we need to delegate the `tf.keras.Model`

as an parent class.

**Note:**We add the

`training`

argument while implementing `call`

function. Its purpose is to separate the feature between training and test(or inference). It`ll be used in Dropout section, later in the post.
```
class Model(tf.keras.Model):
def __init__(self, label_dim):
super(Model, self).__init__()
# Weight initialization (Normal Initializer)
weight_init = tf.keras.initializers.RandomNormal()
# Sequential Model
self.model = tf.keras.Sequential()
self.model.add(tf.keras.layers.Flatten()) # [N, 28, 28, 1] -> [N, 784]
for _ in range(2):
# [N, 784] -> [N, 256] -> [N, 256]
self.model.add(tf.keras.layers.Dense(256, use_bias=True, kernel_initializer=weight_init))
self.model.add(tf.keras.layers.Activation(tf.keras.activations.relu))
self.model.add(tf.keras.layers.Dense(label_dim, use_bias=True, kernel_initializer=weight_init))
def call(self, x, training=None, mask=None):
x = self.model(x)
return x
```

Next, we need to define loss function. Here, we will use softmax cross entropy loss since ourl task is multi label classficiation. Of course, tensorflow offers simple API to calculate it easily. Just calculate the logits (the output generated from your model) and labels, and input it.

```
def loss_fn(model, images, labels):
logits = model(images, training=True)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
return loss
# Accuracy function for inference
def accuracy_fn(model, images, labels):
logits = model(images, training=False)
predict = tf.equal(tf.argmax(logits, -1), tf.argmax(labels, -1))
accuracy = tf.reduce_mean(tf.cast(predict, tf.float32))
return accuracy
# Gradient function
def grad(model, images, labels):
with tf.GradientTape() as tape:
loss = loss_fn(model, images, labels)
return tape.gradient(loss, model.variables)
```

Then, we can set model hyperparameters such as learning rate, epochs, batch sizes and so on.

```
learning_rate = 0.001
batch_size = 128
training_epochs = 1
training_iter = len(X_train) // batch_size
label_dim=10
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
```

We can make graph input from original dataset. We already saw this in previous examples. Since, the memory usage is very large if we load whole dataset into memory, we sliced each dataset with batch size.

```
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train)).\
shuffle(buffer_size=100000).\
prefetch(buffer_size=batch_size).\
batch(batch_size)
test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test)).\
prefetch(buffer_size=len(X_test)).\
batch(len(X_test))
```

In the training step, we instantiate the model and set the checkpoint. Checkpoint is the model save feature during training. So when the model training is failed due to the unexpected external problem, if we set the checkpoint, then we can reload the model at the beginning of last failure point.

```
import os
from time import time
def load(model, checkpoint_dir):
print(" [*] Reading checkpoints...")
ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
if ckpt :
ckpt_name = os.path.basename(ckpt.model_checkpoint_path)
checkpoint = tf.train.Checkpoint(dnn=model)
checkpoint.restore(save_path=os.path.join(checkpoint_dir, ckpt_name))
counter = int(ckpt_name.split('-')[1])
print(" [*] Success to read {}".format(ckpt_name))
return True, counter
else:
print(" [*] Failed to find a checkpoint")
return False, 0
def check_folder(dir):
if not os.path.exists(dir):
os.makedirs(dir)
return dir
""" Writer """
checkpoint_dir = 'checkpoints'
logs_dir = 'logs'
model_dir = 'nn_softmax'
checkpoint_dir = os.path.join(checkpoint_dir, model_dir)
check_folder(checkpoint_dir)
checkpoint_prefix = os.path.join(checkpoint_dir, model_dir)
logs_dir = os.path.join(logs_dir, model_dir)
```

```
model = Model(label_dim)
start_time =time()
# Set checkpoint
checkpoint = tf.train.Checkpoint(dnn=model)
# Restore checkpoint if it exists
could_load, checkpoint_counter = load(model, checkpoint_dir)
if could_load:
start_epoch = (int)(checkpoint_counter / training_iter)
counter = checkpoint_counter
print(" [*] Load SUCCESS")
else:
start_epoch = 0
start_iteration = 0
counter = 0
print(" [!] Load failed...")
# train phase
for epoch in range(start_epoch, training_epochs):
for idx, (train_input, train_label) in enumerate(train_ds):
grads = grad(model, train_input, train_label)
optimizer.apply_gradients(grads_and_vars=zip(grads, model.variables))
train_loss = loss_fn(model, train_input, train_label)
train_accuracy = accuracy_fn(model, train_input, train_label)
for test_input, test_label in test_ds:
test_accuracy = accuracy_fn(model, test_input, test_label)
print(
"Epoch: [%2d] [%5d/%5d] time: %4.4f, train_loss: %.8f, train_accuracy: %.4f, test_Accuracy: %.4f" \
% (epoch, idx, training_iter, time() - start_time, train_loss, train_accuracy,
test_accuracy))
counter += 1
checkpoint.save(file_prefix=checkpoint_prefix + '-{}'.format(counter))
```

After training, we make a model with training accuracy of 98.9% and test accracy of 97.1%. Also, the checkpoint is generated, so we don't need to train at the beginning of the process, just load the model.

```
could_load, checkpoint_counter = load(model, checkpoint_dir)
if could_load:
start_epoch = (int)(checkpoint_counter / training_iter)
counter = checkpoint_counter
print(" [*] Load SUCCESS")
else:
start_epoch = 0
start_iteration = 0
counter = 0
print(" [!] Load failed...")
# train phase
for epoch in range(start_epoch, training_epochs):
for idx, (train_input, train_label) in enumerate(train_ds):
grads = grad(model, train_input, train_label)
optimizer.apply_gradients(grads_and_vars=zip(grads, model.variables))
train_loss = loss_fn(model, train_input, train_label)
train_accuracy = accuracy_fn(model, train_input, train_label)
for test_input, test_label in test_ds:
test_accuracy = accuracy_fn(model, test_input, test_label)
print(
"Epoch: [%2d] [%5d/%5d] time: %4.4f, train_loss: %.8f, train_accuracy: %.4f, test_Accuracy: %.4f" \
% (epoch, idx, training_iter, time() - start_time, train_loss, train_accuracy,
test_accuracy))
counter += 1
checkpoint.save(file_prefix=checkpoint_prefix + '-{}'.format(counter))
```

## Weight Initialization

The purpose of Gradient Descent is to find the point that minimize the loss.

So in this example, whatever the loss is different with respect to x, y, z, when we apply gradient descent, we can find the minimum point. But what if the loss function space is like this, how can we find the minimum point when we use gradient descent?

Previously, we initialized our weight to sample randomly from normal distribution. But our weight is initialized with $A$, we cannot reach the global minima, just local minima. Or we may stuck in saddle point.

There are many approaches to avoid stucking local minima or saddle point. One of the approaches may be initializing the weight with some rules. **Xavier initialization** is that kind of things. Instead of sampling from normal distribution, Xavier initialization samples its weight from some distribution that have variance,

$$ Var_{Xe}(W) = \frac{2}{\text{Channel_in} + \text{Channel_out}} $$

As you can see that, the number of channel input and output is related on the weight sampling, it has more probability that can find global minima. For the details, please check this paper.

**Note:**Tensorflow layer API has weight initialization argument(

`kernel_initializer`

). And its default value is `glorot_uniform`

. Actually, Xavier initialization is also called glorot initialization, since the author of paper that introduced xavier initialization is glorot.
**He Initialization**is another way to initialize weights, especially focused on ReLU activation function. Similar with xavier initialization, he initialization samples its weights from the distribution with variance,

$$ Var_{He}(W) = \frac{4}{\text{Channel_in} + \text{Channel_out}} $$

### Code

In the previous example, we initialized its weight from normali distribution. If we want to change this to Xavier or He, you can define the weight_init like this,

```
# Xavier Initializer
weight_init = tf.keras.initializers.glorot_uniform()
# He Initializer
weight init = tf.keras.initializers.he_uniform()
```

## Dropout

Suppose we have following three cases,

**Under-fitting** is that trained model doesn't predict well on training dataset. Of course, it doesn't work well on test dataset, that may be unseen while training. We know that this is the problem we need to care. But the problem is also occurred in **Over-fitting**. Over-fitting is the situation that trained model works well on training dataset, but not work well on test dataset. That's because the model is not trained in terms of generalization. Many approaches can handle overfitting problem such as training model with larger dataset, and Dropout method is introduced here.

Previously, we just define the layer while we build the model. Instead of using whole nodes in layer, we can disable some nodes with some probability. For example, we can define drop rate of 50%, then we can use 50% of nodes in layers.

Thanks to Dropout, we can improve model performance in terms of generalization.

### Code

Tensorflow implements Dropout layers for an API. So if you want to use, you can add it after each hidden layers like this,

```
for _ in range(2):
# [N, 784] -> [N, 256] -> [N, 256]
self.model.add(tf.keras.layers.Dense(256, use_bias=True, kernel_initializer=weight_init))
self.model.add(tf.keras.layers.Activation(tf.keras.activations.relu))
self.model.add(tf.keras.layers.Dropout(rate=0.5))
```

## Batch Normalization

This section is related on the information distribution. If the distribution of input and output is normally distributed, the trained model may work well. But what if the distribution is crashed while information is pass through the hidden layer?

Even if the information in input layer distributed normally, mean and variance may be shifted and changed. This is called **Internal Covariate Shift**. To avoid this, what can we do?

If we remember the knowledge from statistics, there is a way to convert some distribution to unit normal distribution. Yes, it is **Standardization**. We can apply this and regenerate the distribution like this,

$$ \bar{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \qquad \hat{x} = \gamma \bar{x} + \beta $$

There is a noise term $\epsilon$, but it will make $\bar{x}$ to unit normal distribution (which has 0 mean and 1 variance). After adding $\gamma$ and $\beta$, we can make the distribution that we want to make.

### Code

Tensorflow also implements BatchNormalization layers for an API. So if you want to use, you can add it after each hidden layers like this,

```
for _ in range(2):
# [N, 784] -> [N, 256] -> [N, 256]
self.model.add(tf.keras.layers.Dense(256, use_bias=True, kernel_initializer=weight_init))
self.model.add(tf.keras.layers.BatchNormalization())
self.model.add(tf.keras.layers.Activation(tf.keras.activations.relu))
```