Packages

import tensorflow as tf
import tensorflow_probability as tfp

from sklearn.metrics import accuracy_score
from sklearn import datasets, model_selection

import numpy as np
import matplotlib.pyplot as plt

tfd = tfp.distributions

plt.rcParams['figure.figsize'] = (10, 6)

print("Tensorflow Version: ", tf.__version__)
print("Tensorflow Probability Version: ", tfp.__version__)

Tensorflow Version:  2.5.0
Tensorflow Probability Version:  0.13.0

The Iris Dataset

You will use the Iris dataset. It consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. For a reference, see the following papers:

R. A. Fisher. "The use of multiple measurements in taxonomic problems". Annals of Eugenics. 7 (2): 179–188, 1936.

Your goal is to construct a Naive Bayes classifier model that predicts the correct class from the sepal length and sepal width features. Under certain assumptions about this classifier model, you will explore the relation to logistic regression.

Load and prepare the data

We will first read in the Iris dataset, and split the dataset into training and test sets.

iris = datasets.load_iris()

data = iris.data[:, :2]
targets = iris.target

x_train, x_test, y_train, y_test = model_selection.train_test_split(data, targets, test_size=0.2)

labels = {0: 'Iris-Setosa', 1: 'Iris-Versicolour', 2: 'Iris-Virginica'}
label_colours = ['blue', 'orange', 'green']

def plot_data(x, y, labels, colours):
    for c in np.unique(y):
        inx = np.where(y == c)
        plt.scatter(x[inx, 0], x[inx, 1], label=labels[c], c=colours[c])
    plt.title("Training set")
    plt.xlabel("Sepal length (cm)")
    plt.ylabel("Sepal width (cm)")
    plt.legend()
    
plt.figure(figsize=(8, 5))
plot_data(x_train, y_train, labels, label_colours)
plt.show()

Naive Bayes classifier

We will briefly review the Naive Bayes classifier model. The fundamental equation for this classifier is Bayes' rule:

$$ P(Y=y_k | X_1,\ldots,X_d) = \frac{P(X_1,\ldots,X_d | Y=y_k)P(Y=y_k)}{\sum_{k=1}^K P(X_1,\ldots,X_d | Y=y_k)P(Y=y_k)} $$

In the above, $d$ is the number of features or dimensions in the inputs $X$ (in our case $d=2$), and $K$ is the number of classes (in our case $K=3$). The distribution $P(Y)$ is the class prior distribution, which is a discrete distribution over $K$ classes. The distribution $P(X | Y)$ is the class-conditional distribution over inputs.

The Naive Bayes classifier makes the assumption that the data features $X_i$ are conditionally independent give the class $Y$ (the 'naive' assumption). In this case, the class-conditional distribution decomposes as

$$ \begin{aligned} P(X | Y=y_k) &= P(X_1,\ldots,X_d | Y=y_k)\\ &= \prod_{i=1}^d P(X_i | Y=y_k) \end{aligned} $$

This simplifying assumption means that we typically need to estimate far fewer parameters for each of the distributions $P(X_i | Y=y_k)$ instead of the full joint distribution $P(X | Y=y_k)$.

Once the class prior distribution and class-conditional densities are estimated, the Naive Bayes classifier model can then make a class prediction $\hat{Y}$ for a new data input $\tilde{X} := (\tilde{X}_1,\ldots,\tilde{X}_d)$ according to

$$ \begin{aligned} \hat{Y} &= \text{argmax}_{y_k} P(Y=y_k | \tilde{X}_1,\ldots,\tilde{X}_d) \\ &= \text{argmax}_{y_k}\frac{P(\tilde{X}_1,\ldots,\tilde{X}_d | Y=y_k)P(Y=y_k)}{\sum_{k=1}^K P(\tilde{X}_1,\ldots,\tilde{X}_d | Y=y_k)P(Y=y_k)}\\ &= \text{argmax}_{y_k} P(\tilde{X}_1,\ldots,\tilde{X}_d | Y=y_k)P(Y=y_k) \end{aligned} $$

Define the class prior distribution

We will begin by defining the class prior distribution. To do this we will simply take the maximum likelihood estimate, given by

$$ P(Y=y_k) = \frac{\sum_{n=1}^N \delta(Y^{(n)}=y_k)}{N}, $$

where the superscript $(n)$ indicates the $n$-th dataset example, $\delta(Y^{(n)}=y_k) = 1$ if $Y^{(n)}=y_k$ and 0 otherwise, and $N$ is the total number of examples in the dataset. The above is simply the proportion of data examples belonging to class $k$.

You should now write a function that builds the prior distribution from the training data, and returns it as a Categorical Distribution object.

The input to your function y will be a numpy array of shape (num_samples,)
The entries in y will be integer labels $k=0, 1,\ldots, K-1$
Your function should build and return the prior distribution as a Categorical distribution object
- The probabilities for this distribution will be a length-$K$ vector, with entries corresponding to $P(Y = y_k)$ for $k=0,1,\ldots,K-1$
- Your function should work for any value of $K\ge 1$
- This Distribution will have an empty batch shape and empty event shape

def get_prior(y):
    """
    This function takes training labels as a numpy array y of shape (num_samples,) as an input.
    This function should build a Categorical Distribution object with empty batch shape 
    and event shape, with the probability of each class given as above. 
    Your function should return the Distribution object.
    """
    probs = np.unique(y, return_counts=True)[1] / len(y)
    distribution = tfd.Categorical(probs=probs)
    return distribution

prior = get_prior(y_train)

labels = ['Iris-Setosa', 'Iris-Versicolour', 'Iris-Virginica']
plt.bar([0, 1, 2], prior.probs.numpy(), color=label_colours)
plt.xlabel("Class")
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1, 2], labels)
plt.show()

Define the class-conditional densities

We now turn to the definition of the class-conditional distributions $P(X_i | Y=y_k)$ for $i=0, 1$ and $k=0, 1, 2$. In our model, we will assume these distributions to be univariate Gaussian:

$$ \begin{aligned} P(X_i | Y=y_k) &= N(X_i | \mu_{ik}, \sigma_{ik})\\ &= \frac{1}{\sqrt{2\pi\sigma_{ik}^2}} \exp\left\{-\frac{1}{2} \left(\frac{x - \mu_{ik}}{\sigma_{ik}}\right)^2\right\} \end{aligned} $$

with mean parameters $\mu_{ik}$ and standard deviation parameters $\sigma_{ik}$, twelve parameters in all. We will again estimate these parameters using maximum likelihood. In this case, the estimates are given by

$$ \begin{aligned} \hat{\mu}_{ik} &= \frac{\sum_n X_i^{(n)} \delta(Y^{(n)}=y_k)}{\sum_n \delta(Y^{(n)}=y_k)} \\ \hat{\sigma}^2_{ik} &= \frac{\sum_n (X_i^{(n)} - \hat{\mu}_{ik})^2 \delta(Y^{(n)}=y_k)}{\sum_n \delta(Y^{(n)}=y_k)} \end{aligned} $$

Note that the above are just the means and variances of the sample data points for each class.

You should now write a function the computes the class-conditional Gaussian densities, using the maximum likelihood parameter estimates given above, and returns them in a single, batched MultivariateNormalDiag Distribution object.

The inputs to the function are
- a numpy array x of shape (num_samples, num_features) for the data inputs
- a numpy array y of shape (num_samples,) for the target labels
Your function should work for any number of classes $K\ge 1$ and any number of features $d\ge 1$

def get_class_conditionals(x, y):
    """
    This function takes training data samples x and labels y as inputs.
    This function should build the class-conditional Gaussian distributions above. 
    It should construct a batch of distributions for each feature and each class, using the 
    parameter estimates above for the means and standard deviations.
    The batch shape of this distribution should be rank 2, where the first dimension corresponds
    to the number of classes and the second corresponds to the number of features.
    Your function should then return the Distribution object.
    """
    counts = np.zeros(3)
    loc = np.zeros((3, 2))
    scale_diag = np.zeros((3, 2))
    for i in range(2):
        for c_k in range(3):
            counts[c_k] = np.sum(np.where(y==c_k))
            loc[c_k, i] = np.mean(x[np.where(y==c_k), i])
            scale_diag[c_k, i] = np.std(x[np.where(y==c_k), i])
    distribution = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
    
    return distribution

class_conditionals = get_class_conditionals(x_train, y_train)
class_conditionals

<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[3] event_shape=[2] dtype=float64>

We can visualise the class-conditional densities with contour plots by running the cell below. Notice how the contours of each distribution correspond to a Gaussian distribution with diagonal covariance matrix, since the model assumes that each feature is independent given the class.

def get_meshgrid(x0_range, x1_range, num_points=100):
    x0 = np.linspace(x0_range[0], x0_range[1], num_points)
    x1 = np.linspace(x1_range[0], x1_range[1], num_points)
    return np.meshgrid(x0, x1)

def contour_plot(x0_range, x1_range, prob_fn, batch_shape, colours, levels=None, num_points=100):
    X0, X1 = get_meshgrid(x0_range, x1_range, num_points=num_points)
    Z = prob_fn(np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, 1))
    Z = np.array(Z).T.reshape(batch_shape, *X0.shape)
    for batch in np.arange(batch_shape):
        if levels:
            plt.contourf(X0, X1, Z[batch], alpha=0.2, colors=colours, levels=levels)
        else:
            plt.contour(X0, X1, Z[batch], colors=colours[batch], alpha=0.3)

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), class_conditionals.prob, 3, label_colours)
plt.title("Training set with class-conditional density contours")
plt.show()

Make predictions from the model

Now the prior and class-conditional distributions are defined, you can use them to compute the model's class probability predictions for an unknown test input $\tilde{X} = (\tilde{X}_1,\ldots,\tilde{X}_d)$, according to

$$ P(Y=y_k | \tilde{X}_1,\ldots,\tilde{X}_d) = \frac{P(\tilde{X}_1,\ldots,\tilde{X}_d | Y=y_k)P(Y=y_k)}{\sum_{k=1}^K P(\tilde{X}_1,\ldots,\tilde{X}_d | Y=y_k)P(Y=y_k)} $$

The class prediction can then be taken as the class with the maximum probability:

$$ \hat{Y} = \text{argmax}_{y_k} P(Y=y_k | \tilde{X}_1,\ldots,\tilde{X}_d) $$

You should now write a function to return the model's class probabilities for a given batch of test inputs of shape (batch_shape, 2), where the batch_shape has rank at least one.

The inputs to the function are the prior and class_conditionals distributions, and the inputs x
Your function should use these distributions to compute the probabilities for each class $k$ as above
- As before, your function should work for any number of classes $K\ge 1$
It should then compute the prediction by taking the class with the highest probability
The predictions should be returned in a numpy array of shape (batch_shape)

def predict_class(prior, class_conditionals, x):
    """
    This function takes the prior distribution, class-conditional distribution, and 
    a batch of inputs in a numpy array of shape (batch_shape, 2).
    This function should compute the class probabilities for each input in the batch, using
    the prior and class-conditional distributions, according to the above equation.
    Note that the batch_shape of x could have rank higher than one!
    Your function should then return the class predictions by taking the class with the 
    maximum probability in a numpy array of shape (batch_shape,).
    """
    class_probs = class_conditionals.log_prob(x[:, None])
    joint_likelihood = tf.add(tf.cast(class_probs, dtype=tf.float64), tf.math.log(prior.probs)[tf.newaxis, ...])
    norm_factor = tf.math.reduce_logsumexp(joint_likelihood, axis=-1, keepdims=True)
    log_prob = joint_likelihood - norm_factor
    y = np.argmax(np.exp(log_prob), axis=-1)
    return y

predictions = predict_class(prior, class_conditionals, x_test)

accuracy = accuracy_score(y_test, predictions)
print("Test accuracy: {:.4f}".format(accuracy))

Test accuracy: 0.7667

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), 
             lambda x: predict_class(prior, class_conditionals, x), 
             1, label_colours, levels=[-0.5, 0.5, 1.5, 2.5],
             num_points=500)
plt.title("Training set with decision regions")
plt.show()

Binary classifier

We will now draw a connection between the Naive Bayes classifier and logistic regression.

First, we will update our model to be a binary classifier. In particular, the model will output the probability that a given input data sample belongs to the 'Iris-Setosa' class: $P(Y=y_0 | \tilde{X}_1,\ldots,\tilde{X}_d)$. The remaining two classes will be pooled together with the label $y_1$.

y_train_binary = np.array(y_train)
y_train_binary[np.where(y_train_binary == 2)] = 1

y_test_binary = np.array(y_test)
y_test_binary[np.where(y_test_binary == 2)] = 1

labels_binary = {0: 'Iris-Setosa', 1: 'Iris-Versicolour / Iris-Virginica'}
label_colours_binary = ['blue', 'red']

plt.figure(figsize=(8, 5))
plot_data(x_train, y_train_binary, labels_binary, label_colours_binary)
plt.show()

We will also make an extra modelling assumption that for each class $k$, the class-conditional distribution $P(X_i | Y=y_k)$ for each feature $i=0, 1$, has standard deviation $\sigma_i$, which is the same for each class $k$.

This means there are now six parameters in total: four for the means $\mu_{ik}$ and two for the standard deviations $\sigma_i$ ($i, k=0, 1$).

We will again use maximum likelihood to estimate these parameters. The prior distribution will be as before, with the class prior probabilities given by

$$ P(Y=y_k) = \frac{\sum_{n=1}^N \delta(Y^{(n)}=y_k)}{N}, $$

We will use your previous function get_prior to redefine the prior distribution.

prior_binary = get_prior(y_train_binary)

plt.bar([0, 1], prior_binary.probs.numpy(), color=label_colours_binary)
plt.xlabel("Class")
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1], labels_binary)
plt.show()

For the class-conditional densities, the maximum likelihood estimate for the means are again given by

$$ \hat{\mu}_{ik} = \frac{\sum_n X_i^{(n)} \delta(Y^{(n)}=y_k)}{\sum_n \delta(Y^{(n)}=y_k)} \\ $$

However, the estimate for the standard deviations $\sigma_i$ is updated. There is also a closed-form solution for the shared standard deviations, but we will instead learn these from the data.

You should now write a function that takes the training inputs and target labels as input, as well as an optimizer object, number of epochs and a TensorFlow Variable. This function should be written according to the following spec:

The inputs to the function are:
- a numpy array x of shape (num_samples, num_features) for the data inputs
- a numpy array y of shape (num_samples,) for the target labels
- a tf.Variable object scales of length 2 for the standard deviations $\sigma_i$
- optimiser: an optimiser object
- epochs: the number of epochs to run the training for
The function should first compute the means $\mu_{ik}$ of the class-conditional Gaussians according to the above equation
Then create a batched multivariate Gaussian distribution object using MultivariateNormalDiag with the means set to $\mu_{ik}$ and the scales set to scales
Run a custom training loop for epochs number of epochs, in which:
- the average per-example negative log likelihood for the whole dataset is computed as the loss
- the gradient of the loss with respect to the scales variables is computed
- the scales variables are updated by the optimiser object
At each iteration, save the values of the scales variable and the loss
The function should return a tuple of three objects:
- a numpy array of shape (epochs,) of loss values
- a numpy array of shape (epochs, 2) of values for the scales variable at each iteration
- the final learned batched MultivariateNormalDiag distribution object

NB: ideally, we would like to constrain the scales variable to have positive values. We are not doing that here, but in later weeks of the course you will learn how this can be implemented.

def learn_stdevs(x, y, scales, optimiser, epochs):
    """
    This function takes the data inputs, targets, scales variable, optimiser and number of
    epochs as inputs.
    This function should set up and run a custom training loop according to the above 
    specifications, by setting up the class conditional distributions as a MultivariateNormalDiag
    object, and updating the trainable variables (the scales) in a custom training loop.
    Your function should then return the a tuple of three elements: a numpy array of loss values
    during training, a numpy array of scales variables during training, and the final learned
    MultivariateNormalDiag distribution object.
    """
    n_classes = len(np.unique(y))
    n_features = x.shape[-1]
    loc = np.zeros((n_classes, n_features), dtype=np.float32)
    
    for f in range(n_features):
        for c in range(n_classes):
            samples = x[y==c][:, f]
            loc[c, f] = np.mean(samples)
    
    distribution = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scales)
    x_reshape = np.expand_dims(x.astype(np.float32), 1)
    
    def nll(x, y, distribution):
        predictions = - distribution.log_prob(x)
        probs = []
        for c_k in range(n_classes):
            probs.append(tf.reduce_sum(predictions[y == c_k][:, c_k]))
        return tf.reduce_sum(probs)
    
    @tf.function
    def get_loss_and_grads(x, distribution):
        with tf.GradientTape() as tape:
            tape.watch(distribution.trainable_variables)
            loss = nll(x, y, distribution)
            
        grads = tape.gradient(loss, distribution.trainable_variables)
        return loss, grads
    
    train_loss_results = []
    train_scale_results = []
    
    for epoch in range(epochs):
        loss, grads = get_loss_and_grads(x_reshape, distribution)
        optimiser.apply_gradients(zip(grads, distribution.trainable_variables))
        train_loss_results.append(loss)
        train_scale_results.append(distribution.parameters['scale_diag'].numpy())
        
        if epoch % 100 == 0:
            print(f'epoch: {epoch}, Loss: {loss}')
    
    return np.array(train_loss_results), np.array(train_scale_results), distribution

scales = tf.Variable([1., 1.])
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
epochs = 1000

nlls, scales_arr, class_conditionals_binary = learn_stdevs(x_train, y_train_binary, scales, opt, epochs)

epoch: 0, Loss: 248.52313232421875
epoch: 100, Loss: 229.68798828125
epoch: 200, Loss: 210.25640869140625
epoch: 300, Loss: 191.16299438476562
epoch: 400, Loss: 173.79574584960938
epoch: 500, Loss: 159.8612518310547
epoch: 600, Loss: 152.19459533691406
epoch: 700, Loss: 150.69192504882812
epoch: 800, Loss: 150.63607788085938
epoch: 900, Loss: 150.63558959960938

print("Class conditional means:")
print(class_conditionals_binary.loc.numpy())
print("\nClass conditional standard deviations:")
print(class_conditionals_binary.stddev().numpy())

Class conditional means:
[[5.0214286 3.4095237]
 [6.25      2.853846 ]]

Class conditional standard deviations:
[[0.5859885  0.35059664]
 [0.5859885  0.35059664]]

fig, ax = plt.subplots(1, 2, figsize=(14, 5))
ax[0].plot(nlls)
ax[0].set_title("Loss vs epoch")
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Average negative log-likelihood")
for k in [0, 1]:
    ax[1].plot(scales_arr[:, k], color=label_colours_binary[k], label=labels_binary[k])
ax[1].set_title("Standard deviation ML estimates vs epoch")
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Standard deviation")
plt.legend()
plt.show()

We can also plot the contours of the class-conditional Gaussian distributions as before, this time with just binary labelled data. Notice the contours are the same for each class, just with a different centre location.

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train_binary, labels_binary, label_colours_binary)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), class_conditionals_binary.prob, 2, label_colours_binary)
plt.title("Training set with class-conditional density contours")
plt.show()

We can also plot the decision regions for this binary classifier model, notice that the decision boundary is now linear.

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train_binary, labels_binary, label_colours_binary)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), 
             lambda x: predict_class(prior_binary, class_conditionals_binary, x), 
             1, label_colours_binary, levels=[-0.5, 0.5, 1.5],
             num_points=500)
plt.title("Training set with decision regions")
plt.show()

Link to logistic regression

In fact, we can see that our predictive distribution $P(Y=y_0 | X)$ can be written as follows:

$$ \begin{aligned} P(Y=y_0 | X) =& ~\frac{P(X | Y=y_0)P(Y=y_0)}{P(X | Y=y_0)P(Y=y_0) + P(X | Y=y_1)P(Y=y_1)}\\ =& ~\frac{1}{1 + \frac{P(X | Y=y_1)P(Y=y_1)}{P(X | Y=y_0)P(Y=y_0)}}\\ =& ~\sigma(a) \end{aligned} $$

where $\sigma(a) = \frac{1}{1 + e^{-a}}$ is the sigmoid function, and $a = \log\frac{P(X | Y=y_0)P(Y=y_0)}{P(X | Y=y_1)P(Y=y_1)}$ is the log-odds.

With our additional modelling assumption of a shared covariance matrix $\Sigma$, it can be shown (using the Gaussian pdf) that $a$ is in fact a linear function of $X$:

$$ a = w^T X + w_0 $$

where

$$ \begin{aligned} w =& ~\Sigma^{-1} (\mu_0 - \mu_1)\\ w_0 =& -\frac{1}{2}\mu_0^T \Sigma^{-1}\mu_0 + \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 + \log\frac{P(Y=y_0)}{P(Y=y_1)} \end{aligned} $$

The model therefore takes the form $P(Y=y_0 | X) = \sigma(w^T X + w_0)$, with weights $w\in\mathbb{R}^2$ and bias $w_0\in\mathbb{R}$. This is the form used by logistic regression, and explains why the decision boundary above is linear.

In the above we have outlined the derivation of the generative logistic regression model. The parameters are typically estimated with maximum likelihood, as we have done.

Finally, we will use the above equations to directly parameterise the output Bernoulli distribution of the generative logistic regression model.

You should now write the following function, according to the following specification:

The inputs to the function are:
- the prior distribution prior over the two classes
- the (batched) class-conditional distribution class_conditionals
The function should use the parameters of the above distributions to compute the weights and bias terms $w$ and $w_0$ as above
The function should then return a tuple of two numpy arrays for $w$ and $w_0$

def get_logistic_regression_params(prior, class_conditionals):
    """
    This function takes the prior distribution and class-conditional distribution as inputs.
    This function should compute the weights and bias terms of the generative logistic
    regression model as above, and return them in a 2-tuple of numpy arrays of shapes
    (2,) and () respectively.
    """
    mu0 = class_conditionals.parameters['loc'][0]
    mu1 = class_conditionals.parameters['loc'][1]
    
    cov = np.linalg.inv(class_conditionals.covariance())
    
    # TODO: Why this covariance matrix has shape of (2, 2, 2), not (2, 2)
    # In tfp.__version__ == 0.9.0, it has (2, 2)
    # But tfp.__version__ == 0.13.0, (2, 2, 2)
    print(cov.shape)
    w = np.matmul(cov, (mu0 - mu1))
    w0 = - 0.5 * (np.matmul(np.transpose(mu0), np.matmul(cov, mu0)))\
         + 0.5 * (np.matmul(np.transpose(mu1), np.matmul(cov, mu1)))\
         + np.log(prior.parameters['probs'][0] / prior.parameters['probs'][1])
    return w, w0

w, w0 = get_logistic_regression_params(prior_binary, class_conditionals_binary)

(2, 2, 2)

We can now use these parameters to make a contour plot to display the predictive distribution of our logistic regression model.

fig, ax = plt.subplots(1, 1, figsize=(10, 6))
plot_data(x_train, y_train_binary, labels_binary, label_colours_binary)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
X0, X1 = get_meshgrid((x0_min, x0_max), (x1_min, x1_max))

logits = np.dot(np.array([X0.ravel(), X1.ravel()]).T, w) + w0
Z = tf.math.sigmoid(logits)
lr_contour = ax.contour(X0, X1, np.array(Z).T.reshape(*X0.shape), levels=10)
ax.clabel(lr_contour, inline=True, fontsize=10)
contour_plot((x0_min, x0_max), (x1_min, x1_max), 
             lambda x: predict_class(prior_binary, class_conditionals_binary, x), 
             1, label_colours_binary, levels=[-0.5, 0.5, 1.5],
             num_points=300)
plt.title("Training set with prediction contours")
plt.show()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-133-72304ad00156> in <module>
      9 logits = np.dot(np.array([X0.ravel(), X1.ravel()]).T, w) + w0
     10 Z = tf.math.sigmoid(logits)
---> 11 lr_contour = ax.contour(X0, X1, np.array(Z).T.reshape(*X0.shape), levels=10)
     12 ax.clabel(lr_contour, inline=True, fontsize=10)
     13 contour_plot((x0_min, x0_max), (x1_min, x1_max), 

ValueError: cannot reshape array of size 20000 into shape (100,100)