import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (10, 8)
plt.style.use('seaborn')

Hypothesis

In the simple Linear Regression, we expressed the hypothesis like this,

$$ H(x) = W x + b $$

But, most of real-world problem is related on various variables. For the case with 3 variables, hypothesis can be expanded with 3-variables,

$$ H(x_1, x_2, x_3) = w_1 x_1 + w_2 x_2 + w_3 x_3 +b $$

And also, cost function for this hypothesis will be different from the one we saw in previous post.

$$ \text{cost}(W, b) = \frac{1}{m} \sum_{i=1}^{m} (H(x_1, x_2, x_3) - y_i)^2 $$

Multi-variables

In general, if we have $n$ variables for regression, the hypothesis will be,

$$ H(x_1, x_2, x_3, \dots, x_n) = w_1 x_1 + w_2 x_2 + w_3 x_3 + \dots + w_n x_n + b $$

If we express with mathematical form, it is hard to display it in one line. Instead, we can express it with matrix multiplication form. Suppose $X$ is the vector of $x$, and $W$ is the vector of $w$, hypothesis with matrix form will be like this,

$$ H(X) = W \cdot X = \begin{bmatrix} w_1 & w_2 & \dots & w_n \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \\ \dots \\ x_n \end{bmatrix} = w_1 x_1 + w_2 x_2 + \dots + w_n x_n $$

Or we can reverse the order of $W$ and $X$, (same notation)

$$ H(X) = X \cdot W = \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_n \end{bmatrix} = w_1 x_1 + w_2 x_2 + \dots + w_n x_n $$

Note: We omit the bias term($b$) for simplicity. If you really want to express the hypothesis with bias, just increase the shape by 1 and adding bias term,
$$ H(X) = X \cdot W = \begin{bmatrix} x_1 & x_2 & \dots & x_n & 1 \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_n \\ b \end{bmatrix} = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b $$

The advantage of matrix multiplication is parallelization. In the previous example, we just express the formula with just one row of $X$. What if $X$ has lots of rows? It can also expand from previous formula.

$$ H(X) = X \cdot W = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1n} \\ x_{21} & x_{22} & \dots & x_{2n} \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_n \end{bmatrix} = \begin{bmatrix} w_1 x_{11} + w_2 x_{12} + \dots + w_n x_{1n} \\ w_1 x_{21} + w_2 x_{22} + \dots + w_n x_{2n} \end{bmatrix}$$

Also, we can expand the dimension of weight term, meaning that there is another layer of weight vector. With matrix multiplication, we don't need to expand it manually, and usually GPU (short for Graphic Processing Unit) has an advantage to calculate the matrix multiplication thanks to its architecture.

Multi-variable Linear Regression in Tensorflow

Suppose we have three dependent variable. Then, the hypothesis will be

$$ H(x_1, x_2, x_3) = w_1 x_1 + w_2 x_2 + w_3 x_3 $$

We have datasets and initialize the weights, and define the hypothesis in tensorflow.

x1 = [73., 93., 89., 96., 73.]
x2 = [80., 88., 91., 98., 66.]
x3 = [75., 93., 90., 100., 70.]
y = [152., 185., 180., 196., 142.]

# random weights
w1 = tf.Variable(tf.random.normal([1]))
w2 = tf.Variable(tf.random.normal([1]))
w3 = tf.Variable(tf.random.normal([1]))
b = tf.Variable(tf.random.normal([1]))

# Hypothesis
h = w1 * x1 + w2 * x2 + w3 * x3 + b

We can build the training process with Gradient Descent.

learning_rate = 0.00001

for e in range(1000):
    # Record the gradient history of the cost function
    with tf.GradientTape() as tape:
        h = w1 * x1 + w2 * x2 + w3 * x3 + b
        cost = tf.reduce_mean(tf.square(h - y))
    
    # Calculate the gradient of each weight
    w1_grad, w2_grad, w3_grad, b_grad = tape.gradient(cost, [w1, w2, w3, b])
    
    # update the weight
    w1.assign_sub(learning_rate * w1_grad)
    w2.assign_sub(learning_rate * w2_grad)
    w3.assign_sub(learning_rate * w3_grad)
    b.assign_sub(learning_rate * b_grad)
    
    if e % 100 == 0:
        print("epoch: {:5} | cost: {:12.4f}".format(e, cost.numpy()))
epoch:     0 | cost:    7312.8384
epoch:   100 | cost:      35.9158
epoch:   200 | cost:      34.0776
epoch:   300 | cost:      32.3360
epoch:   400 | cost:      30.6861
epoch:   500 | cost:      29.1231
epoch:   600 | cost:      27.6422
epoch:   700 | cost:      26.2393
epoch:   800 | cost:      24.9103
epoch:   900 | cost:      23.6510

We can express it with matrix multiplication form. To do this, the dataset is merged with one numpy array.

data = np.array([
    [73., 80., 75., 152.],
    [93., 88., 93., 185.],
    [89., 91., 90., 180.],
    [96., 98., 100., 196.],
    [73., 66., 70., 142. ]
], dtype=np.float32)

X = data[:, :-1]
y = data[:, [-1]]

W = tf.Variable(tf.random.normal(shape=[X.shape[1], 1]))
b = tf.Variable(tf.random.normal(shape=[1]))

# Replace hypothesis with predict function
def predict(X):
    return tf.matmul(X, W) + b

Note: it may be confused to make y from slicing with [1]. Because it must maintain the matrix form, we generate the y like that.
data[:, -1].shape
(5,)
data[:, [-1]].shape
(5, 1)

Same learning process with gradient descent is applied,

learning_rate = 0.00001

for e in range(2000):
    # Record the gradient history of the cost function
    with tf.GradientTape() as tape:
        cost = tf.reduce_mean(tf.square(predict(X) - y))
    
    # Calculate the gradient of each weight
    W_grad, b_grad = tape.gradient(cost, [W, b])
    
    # update the weight
    W.assign_sub(learning_rate * W_grad)
    b.assign_sub(learning_rate * b_grad)
    
    if e % 100 == 0:
        print("epoch: {:5} | cost: {:12.4f}".format(e, cost.numpy()))
epoch:     0 | cost:  103242.9219
epoch:   100 | cost:       1.8692
epoch:   200 | cost:       1.8591
epoch:   300 | cost:       1.8491
epoch:   400 | cost:       1.8392
epoch:   500 | cost:       1.8294
epoch:   600 | cost:       1.8198
epoch:   700 | cost:       1.8103
epoch:   800 | cost:       1.8009
epoch:   900 | cost:       1.7917
epoch:  1000 | cost:       1.7825
epoch:  1100 | cost:       1.7735
epoch:  1200 | cost:       1.7645
epoch:  1300 | cost:       1.7557
epoch:  1400 | cost:       1.7469
epoch:  1500 | cost:       1.7382
epoch:  1600 | cost:       1.7296
epoch:  1700 | cost:       1.7211
epoch:  1800 | cost:       1.7127
epoch:  1900 | cost:       1.7044

As you can see from the result, cost is decreased significantly while 100 epoch are passed. And you can also notice the advantage from matrix multiplication that we don't need to define weight vector manually. ($w_1, w_2, w_3 \to W$)

Summary

In this post, we expand the linear regression from single variable to multi variables. And using matrix multiplication notation, it helps to operate gradient descent effectively.