# Broadcasting Rules in Tensorflow Probability

In this post, it will introduce you to numpy's broadcasting rules and show how you can use broadcasting when specifying batches of distributions in TensorFlow, as well as with the `prob` and `log_prob` methods. This is the summary of lecture "Probabilistic Deep Learning with Tensorflow 2" from Imperial College London.

- Packages
- Operations on arrays of different sizes in numpy
- Numpy's broadcasting rule
- Broadcasting for univariate TensorFlow Distributions
- Broadcasting for multivariate TensorFlow distributions
- collapse-hide

```
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
tfd = tfp.distributions
```

```
print("Tensorflow Version: ", tf.__version__)
print("Tensorflow Probability Version: ", tfp.__version__)
```

```
a = np.array([[1.],
[2.],
[3.],
[4.]]) # shape (4, 1)
b = np.array([0., 1., 2.]) # shape (3,)
a.shape, b.shape
```

```
a + b
```

```
(a + b).shape
```

This is the addition

```
[ [1.], + [0., 1., 2.]
[2.],
[3.],
[4.] ]
```

To execute it, numpy:

- Aligned the shapes of
`a`

and`b`

on the last axis and prepended 1s to the shape with fewer axes:`a: 4 x 1 ---> a: 4 x 1 b: 3 ---> b: 1 x 3`

- Checked that the sizes of the axes matched or were equal to 1:
`a: 4 x 1 b: 1 x 3`

`a`

and`b`

satisfied this criterion.

- Stretched both arrays on their 1-valued axes so that their shapes matched, then added them together.

`a`

was replicated 3 times in the second axis, while`b`

was replicated 4 times in the first axis.

This meant that the addition in the final step was

```
[ [1., 1., 1.], + [ [0., 1., 2.],
[2., 2., 2.], [0., 1., 2.],
[3., 3., 3.], [0., 1., 2.],
[4., 4., 4.] ] [0., 1., 2.] ]
```

Addition was then carried out element-by-element, as you can verify by referring back to the output of the code cell above.

This resulted in an output with shape 4 x 3.

## Numpy's broadcasting rule

Broadcasting rules describe how values should be transmitted when the inputs to an operation do not match.

In numpy, the broadcasting rule is very simple:

Prepend 1s to the smaller shape,

check that the axes of both arrays have sizes that are equal or 1,

then stretch the arrays in their size-1 axes.

A crucial aspect of this rule is that it does not require the input arrays have the same number of axes.

Another consequence of it is that a broadcasting output will have the largest size of its inputs in each axis.

Take the following multiplication as an example 3 x 7 x 1 b: 1 x 5

a * b: 3 x 7 x 5

You can see that the output shape is the maximum of the sizes in each axis.

Numpy's broadcasting rule also does not require that one of the arrays has to be bigger in all axes.

This is seen in the following example, where `a`

is smaller than `b`

in its third axis but is bigger in its second axis.

```
a = np.array([[[0.01], [0.1]],
[[1.00], [10.]]]) # shape (2, 2, 1)
b = np.array([[[2., 2.]],
[[3., 3.]]]) # shape (2, 1, 2)
a.shape, b.shape
```

```
a * b # shape (2, 2, 2)
```

```
print((a * b).shape)
```

Broadcasting behaviour also points to an efficient way to compute an outer product in numpy:

```
a = np.array([-1., 0., 1.])
b = np.array([0., 1., 2., 3.])
a.shape, b.shape
```

```
a[:, np.newaxis] * b # outer product ab^T, where a and b are column vectors
```

```
(a[:, np.newaxis] * b).shape
```

```
a[:, np.newaxis].shape
```

The idea of numpy stretching the arrays in their size-1 axes is useful and is functionally correct. But this is not what numpy literally does behind the scenes, since that would be an inefficient use of memory. Instead, numpy carries out the operation by looping over singleton (size-1) dimensions.

To give you some practice with broadcasting, try predicting the output shapes for the following operations:

```
a = [[1.], [2.], [3.]]
b = np.zeros(shape=[10, 1, 1])
c = np.ones(shape=[4])
```

```
b.shape, c.shape
```

Actually, `a`

is 2D list, not numpy array. But numpy addition can automatically convert list to numpy array.

```
(a + b).shape
```

```
(a * c).shape
```

```
(a * b + c).shape
```

The broadcasting rule for TensorFlow is the same as that for numpy. For example, TensorFlow also allows you to specify the parameters of Distribution objects using broadcasting.

What is meant by this can be understood through an example with the univariate normal distribution. Say that we wish to specify a parameter grid for six Gaussians. The parameter combinations to be used, `(loc, scale)`

, are:

```
(0, 1)
(0, 10)
(0, 100)
(1, 1)
(1, 10)
(1, 100)
```

A laborious way of doing this is to explicitly pass each parameter to `tfd.Normal`

:

```
batch_of_normals = tfd.Normal(loc=[0., 0., 0., 1., 1., 1.,], scale=[1., 10., 100., 1., 10., 100.])
batch_of_normals
```

```
batch_of_normals.loc
```

```
batch_of_normals.scale
```

A more succinct way to create a batch of distributions for this parameter grid is to use broadcasting.

Consider what would happen if we were to broadcast these arrays according the rule discussed earlier:

```
loc = [ [0.],
[1.] ]
scale = [1., 10., 100.]
```

The shapes would be stretched according to

```
loc: 2 x 1 ---> 2 x 3
scale: 1 x 3 ---> 2 x 3
```

resulting in

```
loc = [ [0., 0., 0.],
[1., 1., 1.] ]
scale = [ [1., 10., 100.],
[1., 10., 100.] ]
```

which are compatible with the `loc`

and `scale`

arguments of `tfd.Normal`

.

Sure enough, this is precisely what TensorFlow does:

```
loc = [[0.], [1.]] # (2, 1)
scale = [1., 10., 100.] # (3, )
another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
```

```
another_batch_of_normals.loc
```

```
another_batch_of_normals.scale
```

In summary, TensorFlow broadcasts parameter arrays: it stretches them according to the broadcasting rule, then creates a distribution on an element-by-element basis.

When using `prob`

and `log_prob`

with broadcasting, we follow the same principles as before. Let's make a new batch of normals as before but with means which are centered at different locations to help distinguish the results we get.

```
loc = [[0.], [10.]]
scale = [1., 1., 1.]
another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
```

We can feed in samples of any shape as long as it can be broadcast agasint our batch shape for this example.

```
sample = tf.random.uniform((2, 1))
sample
```

```
another_batch_of_normals.prob(sample)
```

Or broadcasting along the first axis instead:

```
sample = tf.random.uniform((1, 3))
sample
```

```
another_batch_of_normals.prob(sample)
```

Or even both axes:

```
sample = tf.random.uniform((1, 1))
sample
```

```
another_batch_of_normals.prob(sample)
```

`log_prob`

works in the exact same way with broadcasting. We can replace `prob`

with `log_prob`

in any of the previous examples:

```
sample = tf.random.uniform((1, 3))
another_batch_of_normals.log_prob(sample)
```

Broadcasting behaviour for multivariate distributions is only a little more sophisticated than it is for univariate distributions.

Recall that `MultivariateNormalDiag`

has two parameter arguments: `loc`

and `scale_diag`

. When specifying a single distribution, these arguments are vectors of the same length:

```
single_mvt_normal = tfd.MultivariateNormalDiag(loc=[0., 0.], scale_diag=[1., 0.5])
single_mvt_normal
```

```
single_mvt_normal.loc
```

```
single_mvt_normal.covariance()
```

Covariance Matrix is the diagonal matrix with scale_diag^2

The size of the final axis of the inputs determines the event shape for each distribution in the batch. This means that if we pass

```
loc = [ [0., 0.],
[1., 1.] ]
scale_diag = [1., 0.5]
```

such that

```
loc: 2 x 2
scale_diag: 1 x 2
^ final dimension is interpreted as event dimension
^ other dimensions are interpreted as batch dimensions
```

then a batch of two bivariate normal distributions will be created.

```
loc = [[0., 0.],
[1., 1.]]
scale_diag = [1., 0.5]
```

```
batch_of_mvt_normals = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
batch_of_mvt_normals
```

```
# There is a batch of two distributions with different means and same covariance
batch_of_mvt_normals.parameters
```

Knowing that, for multivariate distributions, TensorFlow

- interprets the final axis of an array of parameters as the event shape,

- and broadcasts over the remaining axes,

can you predict what the batch and event shapes will if we pass the arguments

```
loc = [ [ 1., 1., 1.],
[-1., -1., -1.] ] # shape (2, 3)
scale_diag = [ [[0.1, 0.1, 0.1]],
[[10., 10., 10.]] ] # shape (2, 1, 3)
```

to `MultivariateNormalDiag`

?

# collapse-hide

Solution:

Align the parameter array shapes on their last axis, prepending 1s where necessary:

```
loc: 1 x 2 x 3
scale_diag: 2 x 1 x 3
```

The final axis has size 3, so `event_shape = (3)`

. The remaining axes are broadcast over to yield

```
loc: 2 x 2 x 3
scale_diag: 2 x 2 x 3
```

so `batch_shape = (2, 2)`

.

Let's see if this is correct!

```
loc = [ [ 1., 1., 1.],
[-1., -1., -1.] ] # shape (2, 3)
scale_diag = [ [[0.1, 0.1, 0.1]],
[[10., 10., 10.]] ] # shape (2, 1, 3)
```

```
another_batch_of_mvt_normals = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
another_batch_of_mvt_normals
```

```
another_batch_of_mvt_normals.parameters
```

As we did before lets also look at broadcasting when we have batches of multivariate distributions.

```
loc = [[0.],
[1.],
[0.]]
scale = [1., 10., 100., 1., 10, 100.]
```

```
another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
```

And to refresh our memory of `Independent`

we'll use it below to roll the rightmost batch shape into the event shape.

```
another_batch_of_mvt_normals = tfd.Independent(another_batch_of_normals)
another_batch_of_mvt_normals
```

Now, onto the broadcasting:

```
# Batch_size shaped input (broadcast over event)
sample = tf.random.uniform((3, 1))
another_batch_of_mvt_normals.prob(sample)
```

```
# Event_shape shaped input (broadcast over batch)
sample = tf.random.uniform((1, 6))
another_batch_of_mvt_normals.prob(sample)
```

```
# [Samples,Batch_size,Events] shaped input (broadcast over samples)
sample = tf.random.uniform((2, 3, 6))
another_batch_of_mvt_normals.prob(sample)
```

```
sample = tf.random.uniform((2, 1, 6))
another_batch_of_mvt_normals.prob(sample)
```

As a final example with `log_prob`

instead of `prob`

```
# [S,b,e] shaped input where [b,e] can be broadcast agaisnt [B,E]
sample = tf.random.uniform((2, 3, 1))
another_batch_of_mvt_normals.prob(sample)
```

You should now feel confident specifying batches of distributions using broadcasting. As you may have already guessed, broadcasting is especially useful when specifying grids of hyperparameters.

If you don't feel entirely comfortable with broadcasting quite yet, don't worry: re-read this notebook, go through the further reading provided below, and experiment with broadcasting in both numpy and TensorFlow, and you'll be broadcasting in no time.

## Further reading and resources

- Numpy documentation on broadcasting: https://numpy.org/devdocs/user/theory.broadcasting.html
- https://www.tensorflow.org/xla/broadcasting