# Multi-variable Linear Regression

In this post, it will cover the basic concept of Multi-variable Linear Regression. Unlike Simple Linear Regression, Multi-variable Linear Regression have several dependent variables, so its hypothesis is different from we saw in previous posts.

```
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 8)
plt.style.use('seaborn')
```

## Hypothesis

In the simple Linear Regression, we expressed the hypothesis like this,

$$ H(x) = W x + b $$

But, most of real-world problem is related on various variables. For the case with 3 variables, hypothesis can be expanded with 3-variables,

$$ H(x_1, x_2, x_3) = w_1 x_1 + w_2 x_2 + w_3 x_3 +b $$

And also, cost function for this hypothesis will be different from the one we saw in previous post.

$$ \text{cost}(W, b) = \frac{1}{m} \sum_{i=1}^{m} (H(x_1, x_2, x_3) - y_i)^2 $$

## Multi-variables

In general, if we have $n$ variables for regression, the hypothesis will be,

$$ H(x_1, x_2, x_3, \dots, x_n) = w_1 x_1 + w_2 x_2 + w_3 x_3 + \dots + w_n x_n + b $$

If we express with mathematical form, it is hard to display it in one line. Instead, we can express it with matrix multiplication form. Suppose $X$ is the vector of $x$, and $W$ is the vector of $w$, hypothesis with matrix form will be like this,

$$ H(X) = W \cdot X = \begin{bmatrix} w_1 & w_2 & \dots & w_n \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \\ \dots \\ x_n \end{bmatrix} = w_1 x_1 + w_2 x_2 + \dots + w_n x_n $$

Or we can reverse the order of $W$ and $X$, (same notation)

$$ H(X) = X \cdot W = \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_n \end{bmatrix} = w_1 x_1 + w_2 x_2 + \dots + w_n x_n $$

**Note:**We omit the bias term($b$) for simplicity. If you really want to express the hypothesis with bias, just increase the shape by 1 and adding bias term,

The advantage of matrix multiplication is parallelization. In the previous example, we just express the formula with just one row of $X$. What if $X$ has lots of rows? It can also expand from previous formula.

$$ H(X) = X \cdot W = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1n} \\ x_{21} & x_{22} & \dots & x_{2n} \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_n \end{bmatrix} = \begin{bmatrix} w_1 x_{11} + w_2 x_{12} + \dots + w_n x_{1n} \\ w_1 x_{21} + w_2 x_{22} + \dots + w_n x_{2n} \end{bmatrix}$$

Also, we can expand the dimension of weight term, meaning that there is another layer of weight vector. With matrix multiplication, we don't need to expand it manually, and usually GPU (short for Graphic Processing Unit) has an advantage to calculate the matrix multiplication thanks to its architecture.

```
x1 = [73., 93., 89., 96., 73.]
x2 = [80., 88., 91., 98., 66.]
x3 = [75., 93., 90., 100., 70.]
y = [152., 185., 180., 196., 142.]
# random weights
w1 = tf.Variable(tf.random.normal([1]))
w2 = tf.Variable(tf.random.normal([1]))
w3 = tf.Variable(tf.random.normal([1]))
b = tf.Variable(tf.random.normal([1]))
# Hypothesis
h = w1 * x1 + w2 * x2 + w3 * x3 + b
```

We can build the training process with Gradient Descent.

```
learning_rate = 0.00001
for e in range(1000):
# Record the gradient history of the cost function
with tf.GradientTape() as tape:
h = w1 * x1 + w2 * x2 + w3 * x3 + b
cost = tf.reduce_mean(tf.square(h - y))
# Calculate the gradient of each weight
w1_grad, w2_grad, w3_grad, b_grad = tape.gradient(cost, [w1, w2, w3, b])
# update the weight
w1.assign_sub(learning_rate * w1_grad)
w2.assign_sub(learning_rate * w2_grad)
w3.assign_sub(learning_rate * w3_grad)
b.assign_sub(learning_rate * b_grad)
if e % 100 == 0:
print("epoch: {:5} | cost: {:12.4f}".format(e, cost.numpy()))
```

We can express it with matrix multiplication form. To do this, the dataset is merged with one numpy array.

```
data = np.array([
[73., 80., 75., 152.],
[93., 88., 93., 185.],
[89., 91., 90., 180.],
[96., 98., 100., 196.],
[73., 66., 70., 142. ]
], dtype=np.float32)
X = data[:, :-1]
y = data[:, [-1]]
W = tf.Variable(tf.random.normal(shape=[X.shape[1], 1]))
b = tf.Variable(tf.random.normal(shape=[1]))
# Replace hypothesis with predict function
def predict(X):
return tf.matmul(X, W) + b
```

**Note:**it may be confused to make y from slicing with

`[1]`

. Because it must maintain the matrix form, we generate the y like that.
```
data[:, -1].shape
```

```
data[:, [-1]].shape
```

Same learning process with gradient descent is applied,

```
learning_rate = 0.00001
for e in range(2000):
# Record the gradient history of the cost function
with tf.GradientTape() as tape:
cost = tf.reduce_mean(tf.square(predict(X) - y))
# Calculate the gradient of each weight
W_grad, b_grad = tape.gradient(cost, [W, b])
# update the weight
W.assign_sub(learning_rate * W_grad)
b.assign_sub(learning_rate * b_grad)
if e % 100 == 0:
print("epoch: {:5} | cost: {:12.4f}".format(e, cost.numpy()))
```

As you can see from the result, cost is decreased significantly while 100 epoch are passed. And you can also notice the advantage from matrix multiplication that we don't need to define weight vector manually. ($w_1, w_2, w_3 \to W$)