import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (10, 8)
plt.style.use('seaborn')


## Hypothesis and cost function

From the previous post, we covered the definition of hypothesis and cost function(also known as Loss function)

$$\text{Hypothesis: } \quad H(x) = Wx + b \\ \text{Cost: } \quad cost(W, b) = \frac{1}{m} \sum_{i=1}^m(H(x_i) - y_i)^2$$

Actually, when we try to explain the hypothesis, we usually embed the bias term into the weight vector. So it just simplifies the form like this,

$$\text{Hypothesis: } \quad H(x) = Wx \\ \text{Cost:} \quad cost(W) = \frac{1}{m} \sum_{i=1}^m(W x_i - y_i)^2$$

Through the learning process, we want to find the optimal weight vector($W$) for minimizing cost function. As I explained earlier post, the common approach to find the minimum value is to calculate the gradient of each point,then find the pattern of decreasing(descent). This kind approach is called gradient descent, and it is used lots of minimization problems.

## Gradient Descent Algorithm

The detailed process of gradient descent is like this:

• Start with initial guess,

• Usually select 0,0 (but we can select other values at the beginning)
• Keeping changing $W$ and $b$ a little bit with learning rate to try and reduce cost
• Each time, we can change its parameter, then we can calculate the gradient which reduces cost function the most possible

• Repeat
• Do so until it converge to a minimum

But actually, we can not sure that the minimum value that we found from Gradient Descent is global optimum. In some cases, the minimum value may be the local minimum.

In the figure, it is easy to find the global minimum through Gradient descent. When we start with initial weights, through the direction of decreasing gradient, it updates its weights to find the point of minimum cost. Wherever we change the inital points, it may be found the global minimum point.

So, how can we update the weight while finding the minimum point? We use "gradient" for each data point, it requires derivates of cost function. The formal definition of weight update is mentioned below.

\begin{aligned} W &:= W - \alpha * \frac{\partial}{\partial W} \frac{1}{2m} \sum_{i=1}^m (W(x_i) - y_i)^2 \\ &:= W - \alpha * \frac{1}{2m} \sum_{i=1}^{m} 2 (W (x_i) - y_i) x_i \\ & := W - \alpha \frac{1}{m} \sum_{i=1}^m (W(x_i) - y_i) x_i \end{aligned}

Here, $\alpha$ is learning rate. And there is a assumption in formal definition that coefficient of cost function is changed from $\frac{1}{m}$ to $\frac{1}{2m}$. That's because, when the derivate of 2nd order term generates 2, so it is easy to simplify the constant term when we have $2m$, and it is not affect too much in whole calculation.

As a result, weight vector is changed with respect to $\alpha$ while updating.

## Cost function in pure Python

Let's build the model with python. This may be the duplicate of previous post content.

X = np.array([1, 2, 3])
Y = np.array([1, 2, 3])

# Cost function
def cost_func(W, X, Y):
c = 0
for x, y in zip(X, Y):
c += (W*x - y) ** 2
return c / len(X)

cost_hist = []

for feed_W in np.linspace(-3, 5, num=15):
curr_cost = cost_func(feed_W, X, Y)
cost_hist.append(curr_cost)
print("{:6.3f} | {:10.5f}".format(feed_W, curr_cost))

-3.000 |   74.66667
-2.429 |   54.85714
-1.857 |   38.09524
-1.286 |   24.38095
-0.714 |   13.71429
-0.143 |    6.09524
0.429 |    1.52381
1.000 |    0.00000
1.571 |    1.52381
2.143 |    6.09524
2.714 |   13.71429
3.286 |   24.38095
3.857 |   38.09524
4.429 |   54.85714
5.000 |   74.66667


Through cost function, we calculated the MSE, and measured cost for each weight values. As you can see, cost is minimized when $W=1$. We can also confirm from the visualization of the cost value.

plt.figure(figsize=(10, 8));
plt.plot(np.linspace(-3, 5, num=15), cost_hist);
plt.title('Cost function');
plt.xlabel('W');
plt.ylabel('cost');


## Cost function in Tensorflow

We can derive the result using Tensorflow. In this time, we can calculate the MSE with tf.reduce_mean

def cost_func_tf(W, X, Y):
h = X * W
return tf.reduce_mean(tf.square(h - Y))

W_values = np.linspace(-3, 5, num=15)
cost_hist = []

for feed_W in W_values:
curr_cost = cost_func_tf(feed_W, X, Y)
cost_hist.append(curr_cost)
print("{:6.3f} | {:10.5f}".format(feed_W, curr_cost))

-3.000 |   74.66667
-2.429 |   54.85714
-1.857 |   38.09524
-1.286 |   24.38095
-0.714 |   13.71429
-0.143 |    6.09524
0.429 |    1.52381
1.000 |    0.00000
1.571 |    1.52381
2.143 |    6.09524
2.714 |   13.71429
3.286 |   24.38095
3.857 |   38.09524
4.429 |   54.85714
5.000 |   74.66667


We can get same result here as described before.

plt.figure(figsize=(10, 8));
plt.plot(np.linspace(-3, 5, num=15), cost_hist);
plt.title('Cost function with tf');
plt.xlabel('W');
plt.ylabel('cost');


## Gradient Descent in Tensorflow

As we explained it earlier, we can express the gradient descent algorithm in Tensorflow. Remember that the definition of gradient descent is that,

$$W := W - \alpha \frac{1}{m} \sum_{i=1}^m (W(x_i) -y_i) x_i$$

Note: For the test, we set the learning rate($\alpha$) to 0.01, and epochs to 300
X = [1., 2., 3., 4.]
Y = [1., 3., 5., 7.]

W = tf.Variable(tf.random.normal([1], -100., 100.))
alpha = 0.01
cost_hist = []
W_hist = []

for e in range(300):
h = W * X
cost = tf.reduce_mean(tf.square(h - Y))

# Gradient Descent
gradient = tf.reduce_mean((W * X - Y) * X)
descent = W - alpha * gradient
# update weight
W.assign(descent)

if e % 10 == 0:
cost_hist.append(cost.numpy())
W_hist.append(W.numpy()[0])
print('{:5} | {:10.4f} | {:10.6f}'.format(e, cost.numpy(), W.numpy()[0]))

    0 |  3147.8733 |  20.616623
10 |   662.1220 |  10.356780
20 |   139.3744 |   5.651799
30 |    29.4417 |   3.494178
40 |     6.3231 |   2.504731
50 |     1.4614 |   2.050988
60 |     0.4389 |   1.842910
70 |     0.2239 |   1.747489
80 |     0.1787 |   1.703730
90 |     0.1692 |   1.683663
100 |     0.1672 |   1.674461
110 |     0.1668 |   1.670241
120 |     0.1667 |   1.668306
130 |     0.1667 |   1.667418
140 |     0.1667 |   1.667011
150 |     0.1667 |   1.666825
160 |     0.1667 |   1.666739
170 |     0.1667 |   1.666700
180 |     0.1667 |   1.666682
190 |     0.1667 |   1.666674
200 |     0.1667 |   1.666670
210 |     0.1667 |   1.666668
220 |     0.1667 |   1.666667
230 |     0.1667 |   1.666667
240 |     0.1667 |   1.666667
250 |     0.1667 |   1.666667
260 |     0.1667 |   1.666667
270 |     0.1667 |   1.666667
280 |     0.1667 |   1.666667
290 |     0.1667 |   1.666667

plt.figure(figsize=(10, 8));
plt.scatter(W_hist, cost_hist);
plt.plot(W_hist, cost_hist, color='red');
plt.axvline(x=min(W_hist), linestyle='--');


As you can see from the figure, Gradient Descent tends to find the $W$ for minimizing cost.

## Summary

In this post, we tried to explain what the hypothesis and cost function (again!!), and we want to find the best Weight vector (and also bias) for minimizing error between hypothesis and actual data. To do this, Gradient Descent Algorithm is introduced, and test it with real code (both python and tensorflow)