Advanced Guide#
This guide is intended to give the user an idea of the various options and possibilities available in pocoMC
.
We will start by explaining in detail how to define a problem of Bayesian inference. Then, we will demonstrate
how to use pocoMC
in order to solve the aforementioned problem as effectively and robustly as possible.
Likelihood function#
We will begin by defining our problem. To this end we need to define log-likelihood function \(\log\mathcal{L}(\theta)=\log p(d\vert\theta,\mathcal{M})\) and log-prior probability density function \(\log \pi(\theta) = \log p(\theta\vert \mathcal{M})\). We will start with the former, the likelihood. If you are not familiar with these terms I encourage you to visit the Background section for some theory.
Suppose that we want our likelihood function to be a Gaussian density with 10 parameters or in 10-D, we would do something like:
import numpy as np
# Define the dimensionality of our problem.
n_dim = 10
# Define our 3-D correlated multivariate normal log-likelihood.
C = np.identity(n_dim)
C[C==0] = 0.95
Cinv = np.linalg.inv(C)
lnorm = -0.5 * (np.log(2 * np.pi) * n_dim + np.log(np.linalg.det(C)))
def log_like(x):
return -0.5 * np.dot(x, np.dot(Cinv, x)) + lnorm
The inclusion of the normalisation factor lnorm
is not strictly necessary as it does not depend on x
and thus does
vary.
Prior probability distribution#
Standard priors#
The next step is to define the prior probability distribution. This encodes our knowledge about the parameters of the model before we have seen any data.
pocoMC
offers two ways to define a prior. The first is to utilise ready-made priors from the scipy.stats
package. For instance,
if we want our prior to be a uniform distribution on the interval \([-10,10]\) for all 10 of the parameters, we would do:
from scipy.stats import uniform
prior = pc.Prior(n_dim * [uniform(loc=-10.0, scale=20.0)]) # Uniform prior on [-10,10] for all 10 parameters.
Suppose now that we want a different prior for each parameter. For instance, we want the first five parameters to have a flat/uniform prior \(x_{i}\sim\mathcal{U}(-10,10)\) for \(i=0,1,\dots,4\) and the last five to have a Gaussian/normal prior with mean \(\mu=0\) and standard deviation \(\sigma=3\), i.e. \(x_{i}\sim\mathcal{N}(0,3^{2})\) for \(i=5,6,\dots,9\). We would do:
from scipy.stats import uniform, norm
prior = pc.Prior([uniform(loc=-10.0, scale=20.0), # Uniform prior on [-10,10] for the first parameter.
uniform(loc=-10.0, scale=20.0), # Uniform prior on [-10,10] for the second parameter.
uniform(loc=-10.0, scale=20.0), # Uniform prior on [-10,10] for the third parameter.
uniform(loc=-10.0, scale=20.0), # Uniform prior on [-10,10] for the fourth parameter.
uniform(loc=-10.0, scale=20.0), # Uniform prior on [-10,10] for the fifth parameter.
norm(loc=0.0, scale=3.0), # Normal prior with mean=0 and std=3 for the sixth parameter.
norm(loc=0.0, scale=3.0), # Normal prior with mean=0 and std=3 for the seventh parameter.
norm(loc=0.0, scale=3.0), # Normal prior with mean=0 and std=3 for the eighth parameter.
norm(loc=0.0, scale=3.0), # Normal prior with mean=0 and std=3 for the ninth parameter.
norm(loc=0.0, scale=3.0), # Normal prior with mean=0 and std=3 for the tenth parameter.
])
or simply:
from scipy.stats import uniform, norm
prior = pc.Prior([uniform(loc=-10.0, scale=20.0)] * 5 + [norm(loc=0.0, scale=3.0)] * 5)
One is free to use any of the priors available in the scipy.stats
package. For a full list see here.
Custom priors#
The second way to define a prior is to define a class including the logpdf
and rvs
methods and dim
and bounds
attributes. This can be useful when the prior has some conditional/hierarchical structure.
As an example, let us assume we have a three-parameter model where the prior for the third parameter depends
on the values for the first two. This might be the case in, e.g., a hierarchical model where the prior over c
is a Normal distribution whose mean m
and standard deviation s
are determined by a corresponding
“hyper-prior”. We can easily set up a prior transform for this model by just going through the variables in order.
This would look like:
import numpy as np
from scipy.stats import norm
class CustomPrior:
def __init__(self):
self.dim = 3
self.bounds = np.array([[-np.inf, np.inf],
[0.0, 10],
[-np.inf, np.inf]])
self.hyper_mean = 0.0
self.hyper_std = 3.0
def logpdf(self, x):
m, s, c = x
return norm.logpdf(c, loc=m, scale=s)
def rvs(self, size=1):
m = np.random.normal(loc=self.hyper_mean, scale=self.hyper_std, size=size)
s = np.random.uniform(low=0.0, high=10.0, size=size)
c = np.random.normal(loc=m, scale=s, size=size)
return np.array([m, s, c]).T
prior = CustomPrior()
Sampler options#
Having defined the Bayesian components of the problem (e.g. likelihood, prior, etc.) we can now turn our attention to
configuring pocoMC
in order to solve this inference problem.
The next step is to import pocoMC
and initialise the Sampler
class:
import pocomc as pc
sampler = pc.Sampler(prior = prior,
likelihood = log_like,
)
The sampler also accepts other arguments, for a full list see API Reference. Those include:
An optional
n_dim
argument that specifies the dimensionality of the problem. If not provided, the sampler will try to infer it from theprior
.An optional
n_ess
argument that specifies the number of effectively independent particles to use. If not provided,1000
is used by default. This determines the speed and robustness of the sampling procedure. Generally, higher values ofn_ess
are better but they also require more computational resources. Higher-dimensional problems or problems with complex posteriors may require higher values ofn_ess
.An optional
n_active
argument that specifies the number of active particles to use. If not provided,250
is used by default. Generally,n_active
should be less thann_ess // 2
. The active particles are the ones (out of the totaln_ess
particles) that are updated in each iteration. Therefore, one can think ofn_active
as the batch size used for the sampling procedure.Additional arguments passed to the log-likelihood using the arguments
likelihood_args
andlikelihood_kwargs
. For instance, if the log-likelihood function accepts an extra argumentdata
we would do something like:sampler = pc.Sampler(prior = prior, likelihood = log_like, likelihood_args = [data], )
The argument
vectorize
that accepts a boolean value (i.e.,True
orFalse
) allows the user to use a vectorized log-likelihood function. This can be useful when the vectorized log-likelihood function is cheaper to evaluate or a parallel version of the log-likelihood function is available. For instance, if we have a vectorized log-likelihood functionlog_like_vec
we would do something like:sampler = pc.Sampler(prior = prior, likelihood = log_like_vec, vectorize = True, )
The
pool
argument allows the user to specify apool
object to use for parallelisation. For more details see the Parallelisation section below.The
flow
argument allows the user to specify the normalising flow to use. For more details see the Normalizing Flow section below section. Generally, the default normalising flow is a good choice for most problems.The
train_config
argument allows the user to specify the training configuration of the normalising flow. For more details see the Normalizing Flow section below. Generally, the default training configuration is a good choice for most problems.The
precondition
argument allows the user to specify whether to use the preconditioning of the normalising flow. For more details see the Normalizing Flow section below. Generally, using normalizing flow preconditioning is a good choice for most problems as the latent space is usually less correlated than the parameter space.The
n_prior
argument allows the user to specify the number of prior samples to draw in the beginning of the run. By defaultn_prior=2*(n_ess//n_active)*n_active
is selected. This is useful for the initialisation of the sampler and we do not recommend changing it.The
sample
argument determines the MCMC sampling method to use. By defaultsampler='pcn'
is selected and the preconditioned Crank-Nicolson (PCN) method is used. The alternative is to usesampler='rwm'
and the standard Random-walk Metropolis method.The
max_steps
argument determines the maximum number of MCMC steps to use per iteration. By defaultmax_steps=5*n_dim
is selected. This is the maximum number of steps that the MCMC sampler will take in order to update the active particles. The actual number of steps is determined adaptively by the sampler. The default value ofmax_steps
is usually a good choice for most problems.The
patience
arguments determines the maximum number of MCMC steps after which the sampler will stop updating the active particles if the averagelogP
has not increased. By default this parameter is determined automatically by the sampler and depends on the dimensionality of the problem, the acceptance rate, and the proposal scale. The default value ofpatience
is usually a good choice for most problems and we do not recommend changing it.The
ess_threshold
defines the minimum effective sample size (ESS) of an iteration for the sampler to consider using particles from that iteration to update the active particles. By defaultess_threshold=4*n_dim
is selected. This means that if the ESS of an iteration is less than4*n_dim
then the sampler will not use any particles from that iteration to update the active particles during resampling. The default value ofess_threshold
is usually a good choice for most problems and we do not recommend changing it.The
output_dir
argument allows the user to specify the directory in which the output files will be saved. By defaultoutput_dir=None
is selected and the output files will be saved in thestate
directory.The
output_label
is the label used in state files. Defaullt isNone
which corresponds to"pmc"
. The saved states are named as"{output_dir}/{output_label}_{i}.state"
wherei
is the iteration index. Output files can be used to resume a run.The
random_state
argument allows the user to specify the random state of the sampler (i.e., any integer value). By defaultrandom_state=None
is selected and the random state is not fixed. This can be useful for debugging purposes.
Running the sampler#
Running the actual sampling procedure that will produce, among other things, a collection of samples from the posterior as well as
an unbiased estimate of the model evidence, can be done by using the run
method of the sampler:
sampler.run()
Running the above also produces a progress bar similar to the one shown below:
Iter: 64it [03:50, 3.59s/it, calls=77500, beta=1, logZ=-21.5, ESS=5e+3, acc=0.704, steps=3, logP=-25.2, eff=1]
The calls
argument shows the total number of log-likelihood calls. The beta
argument shows the current value of the inverse
temperature. The logZ
argument shows the current estimate of the logarithm of the model evidence. The ESS
argument shows the
current estimate of the effective sample size. The acc
argument shows the current acceptance rate. The steps
argument shows
the current number of MCMC steps per iteration. The logP
argument shows the current average log-posterior. The eff
argument
shows the current efficiency of the sampling procedure. Generally, as long as the acceptance rate remains above 0.15
and the
efficiency remains above 0.2
the sampling procedure is working well. Low values of the efficiency can indicate that the normalizing
flow is struggling to approximate the posterior. In such cases, one can try to use a more powerful normalizing flow (e.g. more layers
or more hidden units per layer) or a different training configuration (e.g. more epochs or a smaller learning rate).
We can also use the run
method to specify the desired number of total effective samples and the number of evidence samples to use
for the estimate of the model evidence. For instance, if we want to use 5000
total effective samples and 5000
evidence samples
we would do something like:
sampler.run(n_total=5000,
n_evidence=5000,
progress=True,
)
The progress
argument allows the user to specify whether to show a progress bar or not. By default progress=True
is selected
and a progress bar is shown. If progress=False
is selected then no progress bar is shown. This can be useful when running the
sampler in a computing cluster.
Results#
Once the run is complete we can look at the results. This can be done in two ways. The first is to use the posterior
and evidence
methods of the sampler. For instance, if we want to get the samples from the posterior we would do:
samples, weights, logl, logp = sampler.posterior()
The samples
argument is an array with the samples from the posterior. The weights
argument is an array with the weights of the
samples from the posterior. The logl
argument is an array with the values of the log-likelihood for the samples from the posterior.
The logp
argument is an array with the values of the log-prior for the samples from the posterior.
If we want to get samples from the posterior without the weights we would do:
samples, logl, logp = sampler.posterior(resample=True)
This resamples the particles and is useful when we want to use the samples for parameter inference and we do not want to deal with the weights.
The samples from the posterior can be used for parameter inference. For instance, we can get the mean and standard deviation of the posterior for each parameter by doing:
mean = np.mean(samples, axis=0)
std = np.std(samples, axis=0)
Or, we can utilize a third-party package such as corner
(available here) to plot the posterior samples:
import corner
fig = corner.corner(samples, labels=[f"$x_{i}$" for i in range(n_dim)]); # If we do not want to use the weights.
# fig = corner.corner(samples, weights=weights, labels=[f"$x_{i}$" for i in range(n_dim)]); # If we want to use the weights.
Similarly, we can also get the estimate of the model evidence / marginal likelihood by doing:
logZ, logZerr = sampler.evidence()
The logZ
argument is the estimate of the logarithm of the model evidence. The logZerr
argument is the error on the estimate
of the logarithm of the model evidence. The error is estimated using the bootstrap method.
An alternative, and more advanced way, to look at the results is to use the results
dictionary of the sampler, as follows:
results = sampler.results
This is a dictionary includes the following keys:
``u``, ``x``, ``logdetj``, ``logl``, ``logp``, ``logw``, ``iter``, ``logz``, ``calls``, ``steps``, ``efficiency``, ``ess``, ``accept``, ``beta``.
The u
key is an array with the samples from the latent space. The x
key is an array with the samples from the parameter space.
The logdetj
key is an array with the values of the log-determinant of the Jacobian of the normalizing flow for each sample. The logl
key is an array with the values of the log-likelihood for each sample. The logp
key is an array with the values of the log-prior for
each sample. The logw
key is an array with the values of the log-importance weights for each sample. The iter
key is an array with
the iteration index for each sample. The logz
key is an array with the values of the logarithm of the model evidence for each iteration.
The calls
key is an array with the total number of log-likelihood calls for each iteration. The steps
key is an array with the
number of MCMC steps per iteration. The efficiency
key is an array with the efficiency of the sampling procedure for each iteration.
The ess
key is an array with the effective sample size for each iteration. The accept
key is an array with the acceptance rate
for each iteration. The beta
key is an array with the value of the inverse temperature for each iteration.
Normalizing Flow#
The default normalizing flow used by pocoMC
is a Masked Autoregressive Flow (MAF) with 6 blocks of 3 layers each. Each layer has 64
hidden units with a residual connection and uses a relu
activation function. Both the normalizing flow and the training configuration
can be changed by the user. For instance, if we want to use a MAF with 12 blocks of 2 layers each and 128 hidden units per layer we would do:
import zuko
flow = zuko.flows.MAF(n_dim,
transforms=12,
hidden_features=[128] * 3,
residual=True,
)
sampler = pc.Sampler(prior=prior,
likelihood = log_like,
flow = flow,
)
Any normalizing flow provided by the zuko
package can be used. For a full list see here.
Additionally, the training configuration of the normalizing flow can also be changed as follows:
sampler = pc.Sampler(prior = prior,
likelihood = log_like,
flow = flow,
train_config=dict(validation_split=0.5,
epochs=2000,
batch_size=512,
patience=50,
learning_rate=1e-3,
annealing=False,
gaussian_scale=None,
laplace_scale=None,
noise=None,
shuffle=True,
clip_grad_norm=1.0,
verbose=0,
)
)
The above is the default training configuration and is a good choice for most problems. The validation_split
argument determines
the fraction of the training data to use as validation data. The epochs
argument determines the maximum number of epochs to use for training.
The batch_size
argument determines the batch size to use for training. The patience
argument determines the number of epochs to wait
before early stopping if the validation loss does not improve. The learning_rate
argument determines the learning rate to use for training.
The annealing
argument determines whether to use learning rate annealing or not. The gaussian_scale
argument determines the scale of the Gaussian
prior applied to the weights of the normalizing flow. The laplace_scale
argument determines the scale of the Laplace prior applied to the weights
of the normalizing flow. The noise
argument determines the standard deviation of the Gaussian noise to add to the input of the normalizing flow.
The shuffle
argument determines whether to shuffle the training data or not. The clip_grad_norm
argument determines the maximum norm of the
gradients to use for training. The verbose
argument determines whether to print the training progress or not.
Parallelisation#
If you want to run computations in parallel, pocoMC
can use a user-defined pool
to execute a variety of expensive operations
in parallel rather than in serial. This can be done by passing the pool
object to the sampler upon initialization:
sampler = pc.Sampler(prior=prior,
likelihood = log_like,
pool = pool,
)
sampler.run()
By default pocoMC
will use the pool
to execute the calculation of the log_likelihood
in parallel for the particles.
Commonly used pools are offered by standard Python in the multiprocessing
package and the multiprocess
package. The benefit of
the latter is that it uses dill
to perform the serialization so it can actually work with a greater variety of log-likelihood
functions. The disadvantage is that it needs to be installed manually. An example of how to use such a pool is the following:
from multiprocessing import Pool
n_cpus = 4
with Pool(n_cpus) as pool:
sampler = pc.Sampler(prior=prior,
likelihood = log_like,
pool = pool,
)
sampler.run()
where n_cpus
is the number of available CPUs in our machine. Since numpy
and torch
are doing some internal parallelisation
it is a good idea to specify how many CPUs should be used for that using:
import os
os.environ["OMP_NUM_THREADS"] = "1"
at the beginning of the code. This can affect the speed of the normalising flow training.
Finally, other pools can also be used, particularly if you plan to use pocoMC
is a supercomputing cluster you may want to use
an mpi4py
pool so that you can utilise multiple nodes.
The speed-up offered by parallelisation in pocoMC
is expected to be linear in the number of particles.
Saving and resuming runs#
A useful option, especially for long runs, is to be able to store the state of pocoMC
in a file and also the to use
that file in order to later continue the same run. This can help avoid disastrous situations in which a run is interrupted
or terminated prematurely (e.g. due to time limitation in computing clusters or possible crashes).
Fortunately, pocoMC
offers both options to save and load a previous state of the sampler.
Saving the state of the sampler#
In order to save the state of the sampler during the run, one has to specify how often to save the state in a file. This is
done using the save_every
argument in the run
method. The default is save_every=None
which means that no state
is saved during the run. If instead we want to store the state of pocoMC
every e.g. 3
iterations, we would do
something like:
sampler.run(
save_every = 3,
)
The default directory in which the state files are saved is a folder named states
in the current directory. One can change
this using the output_dir
argument when initialising the sampler (e.g. output_dir = "new_run"
). By default, the state
files follow the naming convention pmc_{i}.state
where i
is the iteration index. For instance, if save_every=3
was
specified then the output_dir
directory will include the files pmc_3.state
, pmc_6.state
, etc. One can also change
the label from pmc
to anything else by using the output_label
argument when initialising the sampler (e.g.
output_label="grav_waves"
).
Loading the state of the sampler#
Loading a previous state of the sampler and resuming the run from that point requires to provide the path to the specific state
file to the run
method using the resume_state_path
argument. For instance, if we want to continue the run from the
pmc_3.state
which is in the states
directory, we would do:
sampler.run(
resume_state_path = "states/pmc_3.state"
)