Broadcasting Rules in Tensorflow Probability
In this post, it will introduce you to numpy's broadcasting rules and show how you can use broadcasting when specifying batches of distributions in TensorFlow, as well as with the `prob` and `log_prob` methods. This is the summary of lecture "Probabilistic Deep Learning with Tensorflow 2" from Imperial College London.
- Packages
- Operations on arrays of different sizes in numpy
- Numpy's broadcasting rule
- Broadcasting for univariate TensorFlow Distributions
- Broadcasting for multivariate TensorFlow distributions
- collapse-hide
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
tfd = tfp.distributions
print("Tensorflow Version: ", tf.__version__)
print("Tensorflow Probability Version: ", tfp.__version__)
a = np.array([[1.],
[2.],
[3.],
[4.]]) # shape (4, 1)
b = np.array([0., 1., 2.]) # shape (3,)
a.shape, b.shape
a + b
(a + b).shape
This is the addition
[ [1.], + [0., 1., 2.]
[2.],
[3.],
[4.] ]
To execute it, numpy:
- Aligned the shapes of
a
andb
on the last axis and prepended 1s to the shape with fewer axes:a: 4 x 1 ---> a: 4 x 1 b: 3 ---> b: 1 x 3
- Checked that the sizes of the axes matched or were equal to 1:
a: 4 x 1 b: 1 x 3
a
andb
satisfied this criterion.
- Stretched both arrays on their 1-valued axes so that their shapes matched, then added them together.
a
was replicated 3 times in the second axis, whileb
was replicated 4 times in the first axis.
This meant that the addition in the final step was
[ [1., 1., 1.], + [ [0., 1., 2.],
[2., 2., 2.], [0., 1., 2.],
[3., 3., 3.], [0., 1., 2.],
[4., 4., 4.] ] [0., 1., 2.] ]
Addition was then carried out element-by-element, as you can verify by referring back to the output of the code cell above.
This resulted in an output with shape 4 x 3.
Numpy's broadcasting rule
Broadcasting rules describe how values should be transmitted when the inputs to an operation do not match.
In numpy, the broadcasting rule is very simple:
Prepend 1s to the smaller shape,
check that the axes of both arrays have sizes that are equal or 1,
then stretch the arrays in their size-1 axes.
A crucial aspect of this rule is that it does not require the input arrays have the same number of axes.
Another consequence of it is that a broadcasting output will have the largest size of its inputs in each axis.
Take the following multiplication as an example 3 x 7 x 1 b: 1 x 5
a * b: 3 x 7 x 5
You can see that the output shape is the maximum of the sizes in each axis.
Numpy's broadcasting rule also does not require that one of the arrays has to be bigger in all axes.
This is seen in the following example, where a
is smaller than b
in its third axis but is bigger in its second axis.
a = np.array([[[0.01], [0.1]],
[[1.00], [10.]]]) # shape (2, 2, 1)
b = np.array([[[2., 2.]],
[[3., 3.]]]) # shape (2, 1, 2)
a.shape, b.shape
a * b # shape (2, 2, 2)
print((a * b).shape)
Broadcasting behaviour also points to an efficient way to compute an outer product in numpy:
a = np.array([-1., 0., 1.])
b = np.array([0., 1., 2., 3.])
a.shape, b.shape
a[:, np.newaxis] * b # outer product ab^T, where a and b are column vectors
(a[:, np.newaxis] * b).shape
a[:, np.newaxis].shape
The idea of numpy stretching the arrays in their size-1 axes is useful and is functionally correct. But this is not what numpy literally does behind the scenes, since that would be an inefficient use of memory. Instead, numpy carries out the operation by looping over singleton (size-1) dimensions.
To give you some practice with broadcasting, try predicting the output shapes for the following operations:
a = [[1.], [2.], [3.]]
b = np.zeros(shape=[10, 1, 1])
c = np.ones(shape=[4])
b.shape, c.shape
Actually, a
is 2D list, not numpy array. But numpy addition can automatically convert list to numpy array.
(a + b).shape
(a * c).shape
(a * b + c).shape
The broadcasting rule for TensorFlow is the same as that for numpy. For example, TensorFlow also allows you to specify the parameters of Distribution objects using broadcasting.
What is meant by this can be understood through an example with the univariate normal distribution. Say that we wish to specify a parameter grid for six Gaussians. The parameter combinations to be used, (loc, scale)
, are:
(0, 1)
(0, 10)
(0, 100)
(1, 1)
(1, 10)
(1, 100)
A laborious way of doing this is to explicitly pass each parameter to tfd.Normal
:
batch_of_normals = tfd.Normal(loc=[0., 0., 0., 1., 1., 1.,], scale=[1., 10., 100., 1., 10., 100.])
batch_of_normals
batch_of_normals.loc
batch_of_normals.scale
A more succinct way to create a batch of distributions for this parameter grid is to use broadcasting.
Consider what would happen if we were to broadcast these arrays according the rule discussed earlier:
loc = [ [0.],
[1.] ]
scale = [1., 10., 100.]
The shapes would be stretched according to
loc: 2 x 1 ---> 2 x 3
scale: 1 x 3 ---> 2 x 3
resulting in
loc = [ [0., 0., 0.],
[1., 1., 1.] ]
scale = [ [1., 10., 100.],
[1., 10., 100.] ]
which are compatible with the loc
and scale
arguments of tfd.Normal
.
Sure enough, this is precisely what TensorFlow does:
loc = [[0.], [1.]] # (2, 1)
scale = [1., 10., 100.] # (3, )
another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
another_batch_of_normals.loc
another_batch_of_normals.scale
In summary, TensorFlow broadcasts parameter arrays: it stretches them according to the broadcasting rule, then creates a distribution on an element-by-element basis.
When using prob
and log_prob
with broadcasting, we follow the same principles as before. Let's make a new batch of normals as before but with means which are centered at different locations to help distinguish the results we get.
loc = [[0.], [10.]]
scale = [1., 1., 1.]
another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
We can feed in samples of any shape as long as it can be broadcast agasint our batch shape for this example.
sample = tf.random.uniform((2, 1))
sample
another_batch_of_normals.prob(sample)
Or broadcasting along the first axis instead:
sample = tf.random.uniform((1, 3))
sample
another_batch_of_normals.prob(sample)
Or even both axes:
sample = tf.random.uniform((1, 1))
sample
another_batch_of_normals.prob(sample)
log_prob
works in the exact same way with broadcasting. We can replace prob
with log_prob
in any of the previous examples:
sample = tf.random.uniform((1, 3))
another_batch_of_normals.log_prob(sample)
Broadcasting behaviour for multivariate distributions is only a little more sophisticated than it is for univariate distributions.
Recall that MultivariateNormalDiag
has two parameter arguments: loc
and scale_diag
. When specifying a single distribution, these arguments are vectors of the same length:
single_mvt_normal = tfd.MultivariateNormalDiag(loc=[0., 0.], scale_diag=[1., 0.5])
single_mvt_normal
single_mvt_normal.loc
single_mvt_normal.covariance()
Covariance Matrix is the diagonal matrix with scale_diag^2
The size of the final axis of the inputs determines the event shape for each distribution in the batch. This means that if we pass
loc = [ [0., 0.],
[1., 1.] ]
scale_diag = [1., 0.5]
such that
loc: 2 x 2
scale_diag: 1 x 2
^ final dimension is interpreted as event dimension
^ other dimensions are interpreted as batch dimensions
then a batch of two bivariate normal distributions will be created.
loc = [[0., 0.],
[1., 1.]]
scale_diag = [1., 0.5]
batch_of_mvt_normals = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
batch_of_mvt_normals
# There is a batch of two distributions with different means and same covariance
batch_of_mvt_normals.parameters
Knowing that, for multivariate distributions, TensorFlow
- interprets the final axis of an array of parameters as the event shape,
- and broadcasts over the remaining axes,
can you predict what the batch and event shapes will if we pass the arguments
loc = [ [ 1., 1., 1.],
[-1., -1., -1.] ] # shape (2, 3)
scale_diag = [ [[0.1, 0.1, 0.1]],
[[10., 10., 10.]] ] # shape (2, 1, 3)
to MultivariateNormalDiag
?
collapse-hide
Solution:
Align the parameter array shapes on their last axis, prepending 1s where necessary:
loc: 1 x 2 x 3
scale_diag: 2 x 1 x 3
The final axis has size 3, so event_shape = (3)
. The remaining axes are broadcast over to yield
loc: 2 x 2 x 3
scale_diag: 2 x 2 x 3
so batch_shape = (2, 2)
.
Let's see if this is correct!
loc = [ [ 1., 1., 1.],
[-1., -1., -1.] ] # shape (2, 3)
scale_diag = [ [[0.1, 0.1, 0.1]],
[[10., 10., 10.]] ] # shape (2, 1, 3)
another_batch_of_mvt_normals = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
another_batch_of_mvt_normals
another_batch_of_mvt_normals.parameters
As we did before lets also look at broadcasting when we have batches of multivariate distributions.
loc = [[0.],
[1.],
[0.]]
scale = [1., 10., 100., 1., 10, 100.]
another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
And to refresh our memory of Independent
we'll use it below to roll the rightmost batch shape into the event shape.
another_batch_of_mvt_normals = tfd.Independent(another_batch_of_normals)
another_batch_of_mvt_normals
Now, onto the broadcasting:
# Batch_size shaped input (broadcast over event)
sample = tf.random.uniform((3, 1))
another_batch_of_mvt_normals.prob(sample)
# Event_shape shaped input (broadcast over batch)
sample = tf.random.uniform((1, 6))
another_batch_of_mvt_normals.prob(sample)
# [Samples,Batch_size,Events] shaped input (broadcast over samples)
sample = tf.random.uniform((2, 3, 6))
another_batch_of_mvt_normals.prob(sample)
sample = tf.random.uniform((2, 1, 6))
another_batch_of_mvt_normals.prob(sample)
As a final example with log_prob
instead of prob
# [S,b,e] shaped input where [b,e] can be broadcast agaisnt [B,E]
sample = tf.random.uniform((2, 3, 1))
another_batch_of_mvt_normals.prob(sample)
You should now feel confident specifying batches of distributions using broadcasting. As you may have already guessed, broadcasting is especially useful when specifying grids of hyperparameters.
If you don't feel entirely comfortable with broadcasting quite yet, don't worry: re-read this notebook, go through the further reading provided below, and experiment with broadcasting in both numpy and TensorFlow, and you'll be broadcasting in no time.
Further reading and resources
- Numpy documentation on broadcasting: https://numpy.org/devdocs/user/theory.broadcasting.html
- https://www.tensorflow.org/xla/broadcasting