Softmax Regression
In this post, it will cover the basic concept of softmax regression, also known as multinomial classification. And it will explain what the hypothesis and cost function, and how to solve it with gradient descent as we saw previously. Also we will try to implement it with tensorflow 2.x
- Logistic Regression
- Multinomial Classification
- Softmax function
- Cost function of Multinomial classification
- Implment with Tensorflow
- Softmax Regression for animal classification
- Summary
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
plt.rcParams['figure.figsize'] = (16, 10)
plt.rcParams['text.usetex'] = True
plt.rc('font', size=15)
Logistic Regression
Previously, we covered logistic regression, which can handle the classification task, especially on binary classification. Basic concept of logistic regression is the same as the one in linear regression. For the simplicity, we omit the bias term.
$$ H_{\theta}(X) = \theta^TX $$
But we need to classify the data, not predict the value. So we tried to predict the probability of whether it is True or False, and decided the label based on decision boundary. So we introduced new type of hypothesis, the sigmoid (or logistic) function.
$$ g(z) = \frac{1}{1 + e^{-z}} $$
Its output range is from 0 to 1. So it is reasonable choice to calculate the probability. All we need to do it calculating the original hypothesis ($H_{\theta}(X)$) and use it in sigmoid function as an argument. ($g(H_{\theta}(X)$)
z = np.arange(-10, 10, 0.1)
y = 1 / (1 + np.exp(-z))
plt.figure(figsize=(10, 8))
plt.plot(z, y, color='red');
plt.xlabel('$z$')
plt.ylabel('$y$')
plt.title('Sigmoid function $g(z)$');
plt.grid()
plt.show()
That's the way how we can handle the binary classification. Then how can we apply it multinomial classification that have more than two labels to classify?
Multinomial Classification
Actually, Multinomial classification is the extended version of binary classification. Suppose we have three labels, $A, B, C$, And we don't know the trick of multinomial classification. how can we classfity them?
The simplest method is divide and conquer. We can divide the big problem into three small problem like,
- Whether it is $A$ or not.
- Whether it is $B$ or not.
- Whether it is $C$ or not.
The hypothesis of binary classification is to predict the probability. So we can combine the three hypothesis of binary classification.
$$ \begin{aligned} H_{\theta}(X) = WX &= \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \\ &= \begin{bmatrix} w_{11} x_1 + w_{12} x_2 + w_{13} x_3 \\ w_{21} x_1 + w_{22} x_2 + w_{23} x_3 \\ w_{31} x_1 + w_{32} x_2 + w_{33} x_3 \end{bmatrix} = \begin{bmatrix} \bar{y_{A}} \\ \bar{y_{B}} \\ \bar{y_{C}} \end{bmatrix} \end{aligned} $$
It's (3,1) matrix. And the shape of output applying sigmoid function will also (3,1) matrix. What does it mean of each row? If we apply the sigmoid function on each row, It'll be the probability of whether it is True or False. So the first element is the probability of whether it is $A$ or not, and so on. All we have to do is applying sigmoid function on each row, right?
Softmax function
It will work, eventually. But there is more effective way to calculate the probability. If we apply the sigmoid on each value, let's say the probability of $A$ is 0.55, and $B$ is 0.66%, and $C$ is 0.44. How can we classify it. It has high probability of $B$, but we cannot ignore the probability of $A$ and $C$. So this kind of result is hard to interpret what label it is.
There is the way to calculate the probability of all, that sums up to 1. He we introduce the new additional function the softmax function.
$$ \sigma(y_i) = \frac{e^{y_i}}{\sum_{j=1}^{K} e^{y_j}} $$
The rule of softmax function is to convert the score (the output of matrix multiplication) to probability. And Sum of all probability is 1. All we need to do is find the maximum probability of each row, define its labels. Usually, it can be calculated with argmax function, that is to find the argument to make maximum of its value.
Cost function of Multinomial classification
We can bring the cost function of binary classification here, the cross-entropy.
$$ \text{C.E.} = -y\log(p) - (1-y)\log(1-p) $$.
In multinomial classification, we can modify the cross entropy function.
$$ \text{C.E.} = -\sum_{i} y_i \log (\bar{y_i}) $$
The rule of cost function is measure the score of classification. So if the model incorrectly classify the label, cost function must return the low cost, and cost function must give the high cost if it is correctly classified. Even if it is in the case of multinomial classification, it can apply it with same manner.
As a result, we define the cost function. And we can apply gradient descent to find weight vector to make the cost minimum.
We will try to implment softmax regression with simple dataset. First, define the dataset.
x_data = np.array(
[[1, 2, 1, 1],
[2, 1, 3, 2],
[3, 1, 3, 4],
[4, 1, 5, 5],
[1, 7, 5, 5],
[1, 2, 5, 6],
[1, 6, 6, 6],
[1, 7, 7, 7]], dtype=np.float32)
y_data = np.array(
[[0, 0, 1],
[0, 0, 1],
[0, 0, 1],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[1, 0, 0],
[1, 0, 0]], dtype=np.float32
)
We have three classes, but for the simplicity, y_data
is modified with one-hot encoding.
And we need to initialize Weight vector $W$ and bias $b$. Usually, it initializes with random value from normal distribution.
W = tf.Variable(tf.random.normal((x_data.shape[1], y_data.shape[1])), name='weight')
b = tf.Variable(tf.random.normal((y_data.shape[1], )), name='bias')
print(W, b)
Then, we can define the softmax function with tensorflow. Of course, you can implement it manually with numpy, but tensorflow also offers softmax fuction as an API.
h = tf.nn.softmax(tf.matmul(X, W) + b)
def softmax(X):
return tf.nn.softmax(tf.matmul(X, W) + b)
Test it with sample data,
print(softmax([x_data[0]]))
print(softmax([x_data[0]]).numpy().sum())
We can also define the cost function and gradient function.
def loss_fn(X, Y):
logits = softmax(X)
cost = -tf.reduce_sum(Y * tf.math.log(logits), axis=1)
cost_mean = tf.reduce_mean(cost)
return cost_mean
def gradient(X, Y):
with tf.GradientTape() as tape:
loss = loss_fn(X, Y)
grads = tape.gradient(loss, [W, b])
return grads
Finally, we implement the training step.
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
for e in range(5000):
grads = gradient(x_data, y_data)
optimizer.apply_gradients(grads_and_vars=zip(grads, [W, b]))
if e % 500 == 0:
print('Epoch: {}, Loss: {:.4f}'.format(e, loss_fn(x_data, y_data).numpy()))
To check the model performance, we need to check validity.
b = softmax(x_data)
print(tf.argmax(b, 1))
print(tf.argmax(y_data, 1))
Compare the predicted result with actual data, we can find that one data is mis-classified, but most of data are correctly classified.
In this section, we apply softmax regressino animal classification. The dataset is from UCI ML Repository.
import pandas as pd
df = pd.read_csv('./dataset/zoo.data', header=None)
df.head()
But in our case, we don't need the name of each animals. We want to classify animal's type, regardless of name.
df = df.drop(0, axis=1)
df.head()
And when we look at the label of data (the 17th column), we can find that the range is from 1 to 7. we need to shift it by 1.
df.info()
df.describe()
df[17] = df[17] - 1
df.describe()
And as you can see from the contents, we need to convert the numeric data with one-hot encoding. Tensorflow offers some APIs for one-hot encoding.
X = df.iloc[:, :-1].to_numpy(dtype=np.float32)
y = df.iloc[:, [-1]].to_numpy()
# Make y data with onehot encoding
Y_one_hot = tf.one_hot(list(y), depth=7)
Y_one_hot = tf.reshape(Y_one_hot, [-1, 7])
print(Y_one_hot[:3].numpy())
print(y[:3])
All tasks we have done is a part of data preprocessing. So now it is the time to building dataset, and implement the learning process.
W = tf.Variable(tf.random.normal([X.shape[1], 7]), name='weight')
b = tf.Variable(tf.random.normal([7, ]), name='bias')
variables = [W, b]
Previously, we built the softmax function and cost function(cross entropy) manually, But the Tensorflow also made a fancy API to do it once. As we define the logit function ($H_{\theta}(X)$) and labels, we can use it for tensorflow API.
def logit_fn(X):
return tf.matmul(X, W) + b
# Softmax function
def softmax(X):
return tf.nn.softmax(logit_fn(X))
# Loss function for cross entropy
def loss_fn(X, y):
logits = logit_fn(X)
cost_i = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y)
return tf.reduce_mean(cost_i)
# Calculate gradient
def grad_fn(X, y):
with tf.GradientTape() as tape:
loss = loss_fn(X, y)
grads = tape.gradient(loss, variables)
return grads
# Predict function for validation
def prediction(X, y):
pred = tf.argmax(softmax(X), 1)
correct = tf.equal(pred, tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
return accuracy
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
for e in range(2000):
grads = grad_fn(X, Y_one_hot)
optimizer.apply_gradients(grads_and_vars=zip(grads, variables))
if e % 100 == 0:
print('Epoch: {}, Loss: {:.4f}, Acc: {:.4f}'.format(e,
loss_fn(X, Y_one_hot).numpy(),
prediction(X, Y_one_hot).numpy()))
Summary
In this post, we covered the softmax function for multinomial classification. Actually, Multinomial classification is extended version of binary classification, so we can apply almost same approach of logistic regression here. But, to easily interpret the result, we substitute sigmoid function to softmax, and get the probability of each labels. We can also implement it with tensorflow 2.x.