Packages

import tensorflow as tf
import tensorflow_probability as tfp

import numpy as np

tfd = tfp.distributions
print("Tensorflow Version: ", tf.__version__)
print("Tensorflow Probability Version: ", tfp.__version__)
Tensorflow Version:  2.5.0
Tensorflow Probability Version:  0.13.0

Operations on arrays of different sizes in numpy

Numpy operations can be applied to arrays that are not of the same shape, but only if the shapes satisfy certain conditions.

As a demonstration of this, let us add together two arrays of different shapes:

a = np.array([[1.],
              [2.],
              [3.],
              [4.]])  # shape (4, 1)

b = np.array([0., 1., 2.])  # shape (3,) 

a.shape, b.shape
((4, 1), (3,))
a + b
array([[1., 2., 3.],
       [2., 3., 4.],
       [3., 4., 5.],
       [4., 5., 6.]])
(a + b).shape
(4, 3)

This is the addition

[ [1.],    +  [0., 1., 2.]  
  [2.],  
  [3.],  
  [4.] ]

To execute it, numpy:

  1. Aligned the shapes of a and b on the last axis and prepended 1s to the shape with fewer axes:
     a: 4 x 1     --->    a: 4 x 1
     b:     3     --->    b: 1 x 3
  1. Checked that the sizes of the axes matched or were equal to 1:
     a: 4 x 1  
     b: 1 x 3
    
    a and b satisfied this criterion.
  1. Stretched both arrays on their 1-valued axes so that their shapes matched, then added them together.
    a was replicated 3 times in the second axis, while b was replicated 4 times in the first axis.

This meant that the addition in the final step was

[ [1., 1., 1.],    +  [ [0., 1., 2.],  
  [2., 2., 2.],         [0., 1., 2.],  
  [3., 3., 3.],         [0., 1., 2.],  
  [4., 4., 4.] ]        [0., 1., 2.] ]

Addition was then carried out element-by-element, as you can verify by referring back to the output of the code cell above.
This resulted in an output with shape 4 x 3.

Numpy's broadcasting rule

Broadcasting rules describe how values should be transmitted when the inputs to an operation do not match.
In numpy, the broadcasting rule is very simple:

Prepend 1s to the smaller shape,
check that the axes of both arrays have sizes that are equal or 1,
then stretch the arrays in their size-1 axes.

A crucial aspect of this rule is that it does not require the input arrays have the same number of axes.
Another consequence of it is that a broadcasting output will have the largest size of its inputs in each axis.
Take the following multiplication as an example:a: 3 x 7 x 1 b: 1 x 5
a * b: 3 x 7 x 5

You can see that the output shape is the maximum of the sizes in each axis.

Numpy's broadcasting rule also does not require that one of the arrays has to be bigger in all axes.
This is seen in the following example, where a is smaller than b in its third axis but is bigger in its second axis.

a = np.array([[[0.01], [0.1]],
              [[1.00], [10.]]])  # shape (2, 2, 1)
b = np.array([[[2., 2.]],
              [[3., 3.]]])       # shape (2, 1, 2)

a.shape, b.shape
((2, 2, 1), (2, 1, 2))
a * b # shape (2, 2, 2)
array([[[2.e-02, 2.e-02],
        [2.e-01, 2.e-01]],

       [[3.e+00, 3.e+00],
        [3.e+01, 3.e+01]]])
print((a * b).shape)
(2, 2, 2)

Broadcasting behaviour also points to an efficient way to compute an outer product in numpy:

a = np.array([-1., 0., 1.])
b = np.array([0., 1., 2., 3.])

a.shape, b.shape
((3,), (4,))
a[:, np.newaxis] * b  # outer product ab^T, where a and b are column vectors
array([[-0., -1., -2., -3.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  1.,  2.,  3.]])
(a[:, np.newaxis] * b).shape
(3, 4)
a[:, np.newaxis].shape
(3, 1)

The idea of numpy stretching the arrays in their size-1 axes is useful and is functionally correct. But this is not what numpy literally does behind the scenes, since that would be an inefficient use of memory. Instead, numpy carries out the operation by looping over singleton (size-1) dimensions.

To give you some practice with broadcasting, try predicting the output shapes for the following operations:

a = [[1.], [2.], [3.]]
b = np.zeros(shape=[10, 1, 1])
c = np.ones(shape=[4])
b.shape, c.shape
((10, 1, 1), (4,))

Actually, a is 2D list, not numpy array. But numpy addition can automatically convert list to numpy array.

(a + b).shape
(10, 3, 1)
(a * c).shape
(3, 4)
(a * b + c).shape
(10, 3, 4)

Broadcasting for univariate TensorFlow Distributions

The broadcasting rule for TensorFlow is the same as that for numpy. For example, TensorFlow also allows you to specify the parameters of Distribution objects using broadcasting.

What is meant by this can be understood through an example with the univariate normal distribution. Say that we wish to specify a parameter grid for six Gaussians. The parameter combinations to be used, (loc, scale), are:

(0, 1)  
(0, 10)  
(0, 100)  
(1, 1)  
(1, 10)  
(1, 100)

A laborious way of doing this is to explicitly pass each parameter to tfd.Normal:

batch_of_normals = tfd.Normal(loc=[0., 0., 0., 1., 1., 1.,], scale=[1., 10., 100., 1., 10., 100.])
batch_of_normals
<tfp.distributions.Normal 'Normal' batch_shape=[6] event_shape=[] dtype=float32>
batch_of_normals.loc
<tf.Tensor: shape=(6,), dtype=float32, numpy=array([0., 0., 0., 1., 1., 1.], dtype=float32)>
batch_of_normals.scale
<tf.Tensor: shape=(6,), dtype=float32, numpy=array([  1.,  10., 100.,   1.,  10., 100.], dtype=float32)>

A more succinct way to create a batch of distributions for this parameter grid is to use broadcasting.
Consider what would happen if we were to broadcast these arrays according the rule discussed earlier:

loc = [ [0.],
        [1.] ]
scale = [1., 10., 100.]

The shapes would be stretched according to

loc:   2 x 1 ---> 2 x 3
scale: 1 x 3 ---> 2 x 3

resulting in

loc = [ [0., 0., 0.],
        [1., 1., 1.] ]
scale = [ [1., 10., 100.],
          [1., 10., 100.] ]

which are compatible with the loc and scale arguments of tfd.Normal.
Sure enough, this is precisely what TensorFlow does:

loc = [[0.], [1.]] # (2, 1)
scale = [1., 10., 100.] # (3, )

another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
<tfp.distributions.Normal 'Normal' batch_shape=[2, 3] event_shape=[] dtype=float32>
another_batch_of_normals.loc
<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[0.],
       [1.]], dtype=float32)>
another_batch_of_normals.scale
<tf.Tensor: shape=(3,), dtype=float32, numpy=array([  1.,  10., 100.], dtype=float32)>

In summary, TensorFlow broadcasts parameter arrays: it stretches them according to the broadcasting rule, then creates a distribution on an element-by-element basis.

Broadcasting with prob and log_prob methods

When using prob and log_prob with broadcasting, we follow the same principles as before. Let's make a new batch of normals as before but with means which are centered at different locations to help distinguish the results we get.

loc = [[0.], [10.]]
scale = [1., 1., 1.]

another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
<tfp.distributions.Normal 'Normal' batch_shape=[2, 3] event_shape=[] dtype=float32>

We can feed in samples of any shape as long as it can be broadcast agasint our batch shape for this example.

sample = tf.random.uniform((2, 1))
sample
<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[0.30983818],
       [0.7367121 ]], dtype=float32)>
another_batch_of_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[3.8024542e-01, 3.8024542e-01, 3.8024542e-01],
       [9.2860834e-20, 9.2860834e-20, 9.2860834e-20]], dtype=float32)>

Or broadcasting along the first axis instead:

sample = tf.random.uniform((1, 3))
sample
<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0.48120987, 0.05952108, 0.13428748]], dtype=float32)>
another_batch_of_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[3.5532588e-01, 3.9823624e-01, 3.9536136e-01],
       [8.4288889e-21, 1.3928772e-22, 2.9206223e-22]], dtype=float32)>

Or even both axes:

sample = tf.random.uniform((1, 1))
sample
<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.61676073]], dtype=float32)>
another_batch_of_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[3.2984403e-01, 3.2984403e-01, 3.2984403e-01],
       [3.0348733e-20, 3.0348733e-20, 3.0348733e-20]], dtype=float32)>

log_prob works in the exact same way with broadcasting. We can replace prob with log_prob in any of the previous examples:

sample = tf.random.uniform((1, 3))
another_batch_of_normals.log_prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[ -1.2130027 ,  -0.96663547,  -1.0695213 ],
       [-43.54405   , -47.878044  , -45.58167   ]], dtype=float32)>

Broadcasting for multivariate TensorFlow distributions

Broadcasting behaviour for multivariate distributions is only a little more sophisticated than it is for univariate distributions.

Recall that MultivariateNormalDiag has two parameter arguments: loc and scale_diag. When specifying a single distribution, these arguments are vectors of the same length:

single_mvt_normal = tfd.MultivariateNormalDiag(loc=[0., 0.], scale_diag=[1., 0.5])
single_mvt_normal
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[] event_shape=[2] dtype=float32>
single_mvt_normal.loc
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0., 0.], dtype=float32)>
single_mvt_normal.covariance()
<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[1.  , 0.  ],
       [0.  , 0.25]], dtype=float32)>

Covariance Matrix is the diagonal matrix with scale_diag^2

The size of the final axis of the inputs determines the event shape for each distribution in the batch. This means that if we pass

loc = [ [0., 0.],
        [1., 1.] ]
scale_diag = [1., 0.5]

such that

loc:        2 x 2
scale_diag: 1 x 2
                ^ final dimension is interpreted as event dimension
            ^ other dimensions are interpreted as batch dimensions  

then a batch of two bivariate normal distributions will be created.

loc = [[0., 0.],
       [1., 1.]]
scale_diag = [1., 0.5]
batch_of_mvt_normals = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
batch_of_mvt_normals
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[2] event_shape=[2] dtype=float32>
# There is a batch of two distributions with different means and same covariance

batch_of_mvt_normals.parameters
{'loc': ListWrapper([ListWrapper([0.0, 0.0]), ListWrapper([1.0, 1.0])]),
 'scale_diag': ListWrapper([1.0, 0.5]),
 'scale_identity_multiplier': None,
 'validate_args': False,
 'allow_nan_stats': True,
 'experimental_use_kahan_sum': False,
 'name': 'MultivariateNormalDiag'}

Knowing that, for multivariate distributions, TensorFlow

  • interprets the final axis of an array of parameters as the event shape,
  • and broadcasts over the remaining axes,

can you predict what the batch and event shapes will if we pass the arguments

loc = [ [ 1.,  1.,  1.],
        [-1., -1., -1.] ] # shape (2, 3)
scale_diag = [ [[0.1, 0.1, 0.1]],
               [[10., 10., 10.]] ] # shape (2, 1, 3)

to MultivariateNormalDiag?

collapse-hide

Solution:

Align the parameter array shapes on their last axis, prepending 1s where necessary:

       loc: 1 x 2 x 3  
scale_diag: 2 x 1 x 3  

The final axis has size 3, so event_shape = (3). The remaining axes are broadcast over to yield

       loc: 2 x 2 x 3  
scale_diag: 2 x 2 x 3  

so batch_shape = (2, 2).

Let's see if this is correct!

loc = [ [ 1.,  1.,  1.],
        [-1., -1., -1.] ]  # shape (2, 3)
scale_diag = [ [[0.1, 0.1, 0.1]],
               [[10., 10., 10.]] ]  # shape (2, 1, 3)
another_batch_of_mvt_normals = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
another_batch_of_mvt_normals
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[2, 2] event_shape=[3] dtype=float32>
another_batch_of_mvt_normals.parameters
{'loc': ListWrapper([ListWrapper([1.0, 1.0, 1.0]), ListWrapper([-1.0, -1.0, -1.0])]),
 'scale_diag': ListWrapper([ListWrapper([ListWrapper([0.1, 0.1, 0.1])]), ListWrapper([ListWrapper([10.0, 10.0, 10.0])])]),
 'scale_identity_multiplier': None,
 'validate_args': False,
 'allow_nan_stats': True,
 'experimental_use_kahan_sum': False,
 'name': 'MultivariateNormalDiag'}

As we did before lets also look at broadcasting when we have batches of multivariate distributions.

loc = [[0.],
       [1.],
       [0.]]
scale = [1., 10., 100., 1., 10, 100.]
another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
<tfp.distributions.Normal 'Normal' batch_shape=[3, 6] event_shape=[] dtype=float32>

And to refresh our memory of Independent we'll use it below to roll the rightmost batch shape into the event shape.

another_batch_of_mvt_normals = tfd.Independent(another_batch_of_normals)
another_batch_of_mvt_normals
<tfp.distributions.Independent 'IndependentNormal' batch_shape=[3] event_shape=[6] dtype=float32>

Now, onto the broadcasting:

# Batch_size shaped input (broadcast over event)

sample = tf.random.uniform((3, 1))
another_batch_of_mvt_normals.prob(sample)
<tf.Tensor: shape=(3,), dtype=float32, numpy=array([3.6341732e-09, 3.8412056e-09, 2.6659681e-09], dtype=float32)>
# Event_shape shaped input (broadcast over batch)

sample = tf.random.uniform((1, 6))
another_batch_of_mvt_normals.prob(sample)
<tf.Tensor: shape=(3,), dtype=float32, numpy=array([2.4391147e-09, 2.7616289e-09, 2.4391147e-09], dtype=float32)>
# [Samples,Batch_size,Events] shaped input (broadcast over samples)

sample = tf.random.uniform((2, 3, 6))
another_batch_of_mvt_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[2.5010654e-09, 3.3668932e-09, 3.8643830e-09],
       [3.0422742e-09, 2.6408655e-09, 3.9074304e-09]], dtype=float32)>
sample = tf.random.uniform((2, 1, 6))
another_batch_of_mvt_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[3.2072505e-09, 2.8448774e-09, 3.2072505e-09],
       [3.9072363e-09, 1.9765971e-09, 3.9072363e-09]], dtype=float32)>

As a final example with log_prob instead of prob

# [S,b,e] shaped input where [b,e] can be broadcast agaisnt [B,E]

sample = tf.random.uniform((2, 3, 1))
another_batch_of_mvt_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[3.5190979e-09, 2.3076914e-09, 3.0966967e-09],
       [1.6677665e-09, 2.7270144e-09, 3.7685051e-09]], dtype=float32)>

You should now feel confident specifying batches of distributions using broadcasting. As you may have already guessed, broadcasting is especially useful when specifying grids of hyperparameters.

If you don't feel entirely comfortable with broadcasting quite yet, don't worry: re-read this notebook, go through the further reading provided below, and experiment with broadcasting in both numpy and TensorFlow, and you'll be broadcasting in no time.

Further reading and resources