Likelihood Estimation
In this post, We will review the process of Likelihood Estimation. Through this post, we will optimize Poisson Regression with gradient descent algorithm and Newton-Raphson methods.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
We can derive the Likelihood function like this,
$$ l(\beta_0, \beta_1) = \frac{1}{n} \big( \sum_{i=1}^n y_i (x_i^T \beta) - exp(x_i^T \beta) - \log(y_i !) \big) \\ \beta = (\beta_0, \beta_1)^T, x_i = (1, x_{i1})^T $$Usually, Likelihood function is used under some specific distribution (most of normal distribution). In this example, we will use Poisson distribution.
p = 2
n = 1000
true_beta = np.array([[1], [0.5]])
Here, we generate 1000 data samples from normal distribution that has 0 mean and 0.2 std. The reason we add 1 in data sample is that we need to use intercept term for regression.
x = np.random.normal(loc=0, scale=0.2, size=(n, 1))
x = np.hstack((np.ones((n, 1)), x))
x[:5, :]
We build the poisson model with exponential. In this case, exponential function is used as link function.
$$ \lambda(x_i) = \mathbb{E}(Y \vert X = x_i) = \exp(x_i^T \beta) $$
param = np.exp(x @ true_beta)
param[:5, :]
y = np.random.poisson(param)
y[:5, :]
So, How can we find true beta from the data samples? Most of use will use a sort of Optimization methods. Here, we will use Gradient Descent. The mathmatical term can be express like this,
$$ -\nabla l(\beta_0^{(t)}, \beta_1^{(t)}) $$
You can see the negative term of likelihood function. Because we need to find the minimum point of likelihood function. As we add minus in function, optimization process occur the direction of minimization. We borrow the previous poisson function and take the negative gradient of likelihood function.
$$ -\nabla l(\beta_0, \beta_1) = -\frac{1}{n} \sum_{i=1}^n (y_i \cdot x_i - \exp(x_i^T \beta) \cdot x_i) \in R^2 $$
beta = np.array([.5, 0.5]).reshape((p, 1))
param = np.exp(x @ beta)
grad = -np.mean(y * x - param * x, axis=0).reshape((p, 1))
grad
Based on these values, we find that $\beta_0$ must be changed the way of opposite direction, which is $(+)$. In order to do that, we need to update the beta. So learning rate $\eta$ is used.
learning_rate = 0.7
# initial beta
beta = np.zeros((p, 1))
beta_0 = []
beta_1 = []
for i in range(500):
param = np.exp(x @ beta)
grad = -np.mean(y * x - param * x, axis=0).reshape((p, 1))
# beta update
beta_new = beta - learning_rate * grad
if np.sum(np.abs(beta_new - beta)) < 1e-8:
beta = beta_new
beta_0.append(beta[0])
beta_1.append(beta[1])
print('Iteration {} beta: '.format(i))
print(beta_new, '\n')
break
else:
beta = beta_new
beta_0.append(beta[0])
beta_1.append(beta[1])
print('Iteration {} beta: '.format(i))
print(beta_new, '\n')
After that, we find that the beta is converged to true beta after 213 iterations. Here is value change of each beta.
fig, ax = plt.subplots(2, 1, figsize=(16, 10))
ax[0].set_ylim(0, 1.3)
ax[0].plot(beta_0, label='beta')
ax[0].axhline(true_beta[0], color='red', linestyle='--', label='true')
ax[0].legend(loc='best')
ax[1].set_ylim(0, 1.3)
ax[1].plot(beta_1, label='beta')
ax[1].axhline(true_beta[1], color='red', linestyle='--', label='true')
ax[1].legend(loc='best')
plt.show()
grad
Of course, we can change the learning rate to get accurate betas.
Newton-Raphson Method
Unlike Gradient Descent, Newton-Raphson method uses 2nd-order differentiation. So in order to use newton-raphson method, we need to calculate Hessian matrix,
$$ \begin{aligned} H &= - \nabla^2 l (\beta_0^{(t)}, \beta_1^{(t)}) \\ &= \frac{1}{n} \sum_{i=1}^n \exp(x_i^T \beta) \cdot x_i x_i^T \end{aligned}$$beta = np.array([.5, .5]).reshape((p, 1))
param = np.exp(x @ beta)
D = np.diag(np.squeeze(param))
D
H = x.T @ D @ x / n
H
beta = np.zeros((p, 1))
beta_0 = []
beta_1 = []
for i in range(500):
param = np.exp(x @ beta)
grad = -np.mean(y * x - param * x, axis=0).reshape((p, 1))
D = np.diag(np.squeeze(param))
H = x.T @ D @ x / n
beta_new = beta - np.linalg.inv(H) @ grad
if np.sum(np.abs(beta_new - beta)) < 1e-8:
beta = beta_new
beta_0.append(beta[0])
beta_1.append(beta[1])
print('Iteration {} beta: '.format(i))
print(beta_new, '\n')
break
else:
beta = beta_new
beta_0.append(beta[0])
beta_1.append(beta[1])
print('Iteration {} beta: '.format(i))
print(beta_new, '\n')
fig, ax = plt.subplots(2, 1, figsize=(16, 10))
ax[0].set_ylim(0, 2)
ax[0].plot(beta_0, label='beta')
ax[0].axhline(true_beta[0], color='red', linestyle='--', label='true')
ax[0].legend(loc='best')
ax[1].set_ylim(0, 2)
ax[1].plot(beta_1, label='beta')
ax[1].axhline(true_beta[1], color='red', linestyle='--', label='true')
ax[1].legend(loc='best')
plt.show()
Summary
We tried find the minimum of likelihood function with optimization method, like gradient descent and Newton-Raphson Methods. The key point is that differentiation is required for optimization. In this post, we manually calculate the gradient. But actually there are helper functions for calculating gradient (and Hessian matrix) in Numpy
.