Independent Distribution
In this post, we will find the meaning of the independent distribution, which is the bridge between univariate distribution and multivariate distribution. This is the summary of lecture "Probabilistic Deep Learning with Tensorflow 2" from Imperial College London.
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
tfd = tfp.distributions
plt.rcParams['figure.figsize'] = (10, 6)
print("Tensorflow Version: ", tf.__version__)
print("Tensorflow Probability Version: ", tfp.__version__)
Actually, the Independent distribution is not a formal name of specific distribution. In tensorflow probability, Independent
distribution is that converts from batch of univariate distribution to multivariate distribution. In previous notebook, you may see that when the batch univariate distribution is formed, its shape is same as multivariate distribution. If we can define the index of batch size to be reinterpret, we can convert it.
# Combine them into a bivariate Gaussian with independent components
locs = [-1., 1]
scales = [0.5, 1.]
batched_normal = tfd.Normal(loc=locs, scale=scales)
t = np.linspace(-4, 4, 10000)
# each column is a vector of densities for one distribution
densities = batched_normal.prob(np.repeat(t[:, np.newaxis], 2, axis=1))
sns.lineplot(x=t, y=densities[:, 0], label='loc={}, scale={}'.format(locs[0], scales[0]))
sns.lineplot(x=t, y=densities[:, 1], label='loc={}, scale={}'.format(locs[1], scales[1]))
plt.xlabel('Probability Density')
plt.ylabel('Value')
plt.legend(loc='best')
plt.show()
batched_normal
As you can see, this distribution has batch_shape of 2. So How can we convert it to identical-shaped multivariate distribution?
bivariate_normal_from_Independent = tfd.Independent(batched_normal, reinterpreted_batch_ndims=1)
bivariate_normal_from_Independent
The meaning is that, the batch dimension of specific distribution will be regarded as events in the new distribution. So you can that the output distribution has the shape of 2, not batch size of 2. In order to visualize it, we can use joint plot.
samples = bivariate_normal_from_Independent.sample(10000)
sns.jointplot(x=samples[:, 0], y=samples[:, 1], kind='kde', space=0, color='b', xlim=[-4, 4], ylim=[-4, 4], fill=True)
plt.show()
So is it the same as multivariate one? Let's check this out.
# Note that diagonal covariance matrix => no correlation => independence (for the multivariate normal distribution)
bivariate_normal_from_Multivariate = tfd.MultivariateNormalDiag(loc=locs, scale_diag=scales)
bivariate_normal_from_Multivariate
samples = bivariate_normal_from_Multivariate.sample(10000)
sns.jointplot(x=samples[:, 0], y=samples[:, 1], kind='kde', color='r', xlim=[-4, 4], ylim=[-4, 4], fill=True)
plt.show()
# By default, all batch dims except the first are transferred to event dims
loc_grid = [[-100., -100.],
[100., 100.],
[0., 0.]]
scale_grid = [[1., 10.],
[1., 10.],
[1., 1.]]
normals_batch_3by2_event_1 = tfd.Normal(loc=loc_grid, scale=scale_grid)
normals_batch_3by2_event_1
np.array(loc_grid).shape
We now have a batch of 3 bivariate normal distributions, and each paramterized by a column of our original parameter grid.
normals_batch_3_event_2 = tfd.Independent(normals_batch_3by2_event_1)
normals_batch_3_event_2
normals_batch_3_event_2.log_prob(loc_grid)
And we can also reinterpret all batch dimensions as event dimensions.
normals_batch_1_event_3by2 = tfd.Independent(normals_batch_3by2_event_1, reinterpreted_batch_ndims=2)
normals_batch_1_event_3by2
normals_batch_1_event_3by2.log_prob(loc_grid)
Independent
to build a Naive Bayes classifier
Using
newsgroup
dataset
Introduction to In this tutorial, just load the dataset, fetch train/test splits, probably choose a subset of the data.
Construct the class conditional feature distribution (with Independent, using the Naive Bayes assumption) and sample from it.
We can just use the ML estimates for parameters, in later tutorials we will learn them.
# Usenet was a forerunner to modern internet forums
# Users could post and read articles
# Newsgroup corresponded to a topic
# Example topics in this data set: IBM computer hardware, baseball
# Our objective is to use an article's contents to predict its newsgroup,
# a 20-class classification problem.
# 18000 newsgroups, posts on 20 topics
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
newsgroup_data = fetch_20newsgroups(data_home='./dataset/20_Newsgroup_Data/', subset='train')
print(newsgroup_data['DESCR'])
print(newsgroup_data['data'][0])
label = newsgroup_data['target'][0]
label
newsgroup_data['target_names'][label]
It means that this news is related on cars (or autos).
To handle this ML pipeline, we need to preprocess it.
n_documents = len(newsgroup_data['data'])
cv = CountVectorizer(input='content', binary=True, max_df=0.25, min_df=1.01 / n_documents)
binary_bag_of_words = cv.fit_transform(newsgroup_data['data'])
binary_bag_of_words.shape
We can check the output by using inverse transform from CountVectorizer.
cv.inverse_transform(binary_bag_of_words[0, :])
And it will be more convenient if this matrix is handled with dictionary.
inv_voc = {v:k for k, v in cv.vocabulary_.items()}
Each feature vector $x$ is a list of indicators for whether a word appears in the article. $x_i$ is 1 if the $i$th word appears, and 0 otherwise. inv_voc
matches word indices $i$ to words.
Each label $y$ is a value in $0, 1, \ldots, 19$.
The parts of a naive Bayes classifier for this problem can be summarised as:
-
A probability distribution for the feature vector by class, $p(x|y = j)$ for each $j = 0, 1, \ldots, 19$. These probability distributions are assumed to have independent components: we can factorize the joint probability as a product of marginal probabilities \begin{equation} p(x|y = j) = \prod_{i=1}^d p(x_i|y = j) \end{equation} These marginal probability distributions are Bernoulli distributions, each of which has a single parameter $\theta_{ji} := p(x_i = 1|y = j)$. This parameter is the probability of observing word $i$ in an article of class $j$.
-
We will use the Laplace smoothed maximum likelihood estimate to compute these parameters. Laplace smoothing involves adding small counts to every feature for each class. Else, if a feature did not appear in the training set of a class, but then we observed it in our test data the log probability would be undefined.
-
A collection of class prior probabilities $p(y = j)$. These will be set by computing the class base rates in the training set.
- A function for computing the probability of class membership via Bayes' theorem:
Keep in mind that we need to consider about the vocabulary exists in test set, but not in training set.
n_classes = newsgroup_data['target'].max() + 1
y = newsgroup_data['target']
n_words = binary_bag_of_words.shape[1]
alpha = 1e-6
# Stores parameter values - prob. word given class
theta = np.zeros([n_classes, n_words])
for c_k in range(n_classes):
class_mask = (y == c_k)
# The number of articles in class
N = class_mask.sum()
theta[c_k, :] = (binary_bag_of_words[class_mask, :].sum(axis=0) + alpha) / (N + alpha * 2)
# Most probable word for each class
most_probable_word_i = theta.argmax(axis=1)
for j, i in enumerate(most_probable_word_i):
print("Most probable word in class {} is \"{}\".".format(newsgroup_data['target_names'][j], inv_voc[i]))
Now it's time to build the model of each keyword with distribution. We will assume that its dataset is bernoulli distribution.
batch_of_bernoullis = tfd.Bernoulli(probs=theta)
p_x_given_y = tfd.Independent(batch_of_bernoullis, reinterpreted_batch_ndims=1)
p_x_given_y
samples = p_x_given_y.sample(10)
samples
chosen_class = 10
newsgroup_data['target_names'][chosen_class]
class_sample = samples[:, chosen_class, :]
class_sample
cv.inverse_transform(class_sample)[0]
Based on google search, the first sentence of sample data contains keywords related on ice hockey. For example, ccohen
will be Colby Cohen in NHL. and goaltender
is the player reponsible for preventing hockey puck.