Archive for neural density estimator

statistical accuracy of neural posterior and likelihood estimation

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , on March 17, 2025 by xi'an

As I have been aiming at mentioning this news for quite a while, David Frazier, Ryan Kelly, Christopher Drovandi, and David Warne arXived last November a paper that parallels our paper (with David and Gael) on ABC consistency and some earlier papers of theirs for synthetic likelihood in the case of neural posterior approximations, under similar conditions (see, e.g., Assumptions 1 and 2), with potential reduced computational cost in some situations.

“NLE requires additional MCMC steps to produce a posterior approximation, whereas NPE produces a posterior approximation directly and does not require any additional sampling”

Convergence is achieved when the neural  learning size grows fast enough with the sample size. And when the tolerance decreases fast enough with respect to the convergence rate of the summary statistic. Two options are possible, that is either approximating the likelihood and then exploiting this approximation in an MCMC algorithm, or directly approximating the posterior distribution, as a function of of the summary statistic Sn (rather than for the observed S⁰n), with arguments favouring the second option.

“if the intractable posterior Π(· | Sn) is asymptotically Gaussian a nd calibrated, then so long as νnγN = o(1), the NPE is also asymptotically Gaussian and calibrated”

where γN denotes the rate at which the neural approximation of the posterior converges to the ideal posterior (for the Kullback-Leibler divergence) in N the size of the learning sample. And νn is the rate of convergence of the statistic Sn to its asymptotic mean. The convergence result does not make explicit assumptions on the class of neural posteriors, but it requires that the observed statistic must fit within the range of the simulated values (a possibility illustrated in the paper with an MA(2) model that was already used in several of our papers (as I noticed when giving an ABC masterclass in Warwick this very week).

“While neural methods and normalizing flows are common choices for the approximating class Q, the diversity of such methods, along with their complicated tuning and training regimes, makes establishing theoretical results on the rate of convergence, γN difficult”

Under stronger and hard to check assumptions, namely on the minimaxity of the posterior density estimator within the class of locally β-Hölder functions, they recover a closed form γN . Which unravels how N should be chosen (with a surprising addition of the dimensions of the parameter θ and of the summary Sn. With a resulting explosion in the theoretical minimal value of N one should use. (And decent performances of the method with smaller values of N!) Concerning minimaxity, I have no intuition how this impacts the sparseness (lack thereof) of the neural networks that can be used.

I am wondering at strategies to remove superfluous statistics since their dimension matters so much and in detecting or evaluating the misspecification (or its complement, the compatibility, as discussed on page 31). But all in all this paper represents a massive addition to the consistency results for approximate Bayesian inference methods!

learning optimal summary statistics

Posted in Books, pictures, Statistics with tags , , , , , , , , , on July 27, 2022 by xi'an

Despite the pursuit of the holy grail of sufficient statistics, most applications will have to settle for the weakest concept of optimal statistics.”Quiz #1: How does Bayes sufficiency [which preserves the posterior density] differ from sufficiency [which preserves the likelihood function]?

Quiz #2: How does Fisher-information sufficiency [which preserves the information matrix] differ from standard sufficiency [which preserves the likelihood function]?

Read a recent arXival by Till Hoffmann and Jukka-Pekka Onnela that I frankly found most puzzling… Maybe due to the Norman train where I was traveling being particularly noisy.

The argument in the paper is to find a summary statistic that minimises the [empirical] expected posterior entropy, which equivalently means minimising the expected Kullback-Leibler distance to the full posterior.  And maximizing the mutual information between parameters θ and summaries t(.). And maximizing the expected surprise. Which obviously requires breaking the sample into iid components and hence considering the gain brought by a specific transform of a single observation. The paper also contains a long comparison with other criteria for choosing summaries.

“Minimizing the posterior entropy would discard the sufficient statistic t such that the posterior is equal to the prior–we have not learned anything from the data.”

Furthermore, the expected aspect of the criterion takes us away from a proper Bayes analysis (and exhibits artifacts as the one above), which somehow makes me question the relevance of comparing entropies under different distributions. It took me a long while to realise that the collection of summaries was set by the user and quite limited. Like a neural network representation of the posterior mean. And the intractable posterior is further approximated by a closed-form function of the parameter θ and of the summary t(.). Using there a neural density estimator. Or a mixture density network.

GANs as density estimators

Posted in Books, Statistics with tags , , , , , , , on October 15, 2021 by xi'an

I recently read an arXival entitled Conditional Sampling With Monotone GAN by Kovakchi et al., who construct  a mapping T that transforms or pushes forward a reference measure þ() like a multivariate Normal distribution to a target conditional distribution ð(dθ|x).  Which makes the proposal a type of normalising flow, except it does not require a Jacobian derivation… The mapping T is monotonous and block triangular in order to be invertible. It is learned from data by minimising a functional divergence between Tþ(dθ) and ð(dθ|x), for instance GAN least square or GAN Wasserstein penalties and representing T as a neural network.  Where monotonicity is imposed by a Lagrangian. The authors “note that global minimizers of [their GAN criterion] can also be used for conditional density estimation” but I fail to understand the distinction in that once T is constructed, the estimated conditional density is automatically available. However my main source of puzzlement is at the worth of this construction, since it does not provide an exact generative process for the conditional distribution, while requiring many generations from the joint distribution. Rather than a comparison with MCMC, which is not applicable in untractable generative models, a comparison with less expensive ABC solutions would have been appropriate, I think. And the paper is missing any quantification on the quality or asymptotics of the density estimate provided by this involved approximation, as most of the recent literature on normalising flows and friends. (A point acknowledged by the authors in the supplementary material section.)

“In this regard, the MGANs approach introduced in the article belongs to the category of sampling techniques such as MCMC, whose goal is to generate independent samples from the law of y|x, as opposed to assuming some structural form of the probability measure directly.”

I am unsure I understand the above remark as MCMC methods are intrinsically linked with the exact probability distribution, exploiting either some conditional representations as in Gibbs or at the very least the ability to compute the joint density…

 

sequential neural likelihood estimation as ABC substitute

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on May 14, 2020 by xi'an

A JMLR paper by Papamakarios, Sterratt, and Murray (Edinburgh), first presented at the AISTATS 2019 meeting, on a new form of likelihood-free inference, away from non-zero tolerance and from the distance-based versions of ABC, following earlier papers by Iain Murray and co-authors in the same spirit. Which I got pointed to during the ABC workshop in Vancouver. At the time I had no idea as to autoregressive flows meant. We were supposed to hold a reading group in Paris-Dauphine on this paper last week, unfortunately cancelled as a coronaviral precaution… Here are some notes I had prepared for the meeting that did not take place.

A simulator model is a computer program, which takes a vector of parameters θ, makes internal calls to a random number generator, and outputs a data vector x.”

Just the usual generative model then.

“A conditional neural density estimator is a parametric model q(.|φ) (such as a neural network) controlled by a set of parameters φ, which takes a pair of datapoints (u,v) and outputs a conditional probability density q(u|v,φ).”

Less usual, in that the outcome is guaranteed to be a probability density.

“For its neural density estimator, SNPE uses a Mixture Density Network, which is a feed-forward neural network that takes x as input and outputs the parameters of a Gaussian mixture over θ.”

In which theoretical sense would it improve upon classical or Bayesian density estimators? Where are the error evaluation, the optimal rates, the sensitivity to the dimension of the data? of the parameter?

“Our new method, Sequential Neural Likelihood (SNL), avoids the bias introduced by the proposal, by opting to learn a model of the likelihood instead of the posterior.”

I do not get the argument in that the final outcome (of using the approximation within an MCMC scheme) remains biased since the likelihood is not the exact likelihood. Where is the error evaluation? Note that in the associated Algorithm 1, the learning set is enlarged on each round, as in AMIS, rather than set back to the empty set ∅ on each round.

…given enough simulations, a sufficiently flexible conditional neural density estimator will eventually approximate the likelihood in the support of the proposal, regardless of the shape of the proposal. In other words, as long as we do not exclude parts of the parameter space, the way we propose parameters does not bias learning the likelihood asymptotically. Unlike when learning the posterior, no adjustment is necessary to account for our proposing strategy.”

This is a rather vague statement, with the only support being that the Monte Carlo approximation to the Kullback-Leibler divergence does converge to its actual value, i.e. a direct application of the Law of Large Numbers! But an interesting point I informally made a (long) while ago that all that matters is the estimate of the density at x⁰. Or at the value of the statistic at x⁰. The masked auto-encoder density estimator is based on a sequence of bijections with a lower-triangular Jacobian matrix, meaning the conditional density estimate is available in closed form. Which makes it sounds like a form of neurotic variational Bayes solution.

The paper also links with ABC (too costly?), other parametric approximations to the posterior (like Gaussian copulas and variational likelihood-free inference), synthetic likelihood, Gaussian processes, noise contrastive estimation… With experiments involving some of the above. But the experiments involve rather smooth models with relatively few parameters.

“A general question is whether it is preferable to learn the posterior or the likelihood (…) Learning the likelihood can often be easier than learning the posterior, and it does not depend on the choice of proposal, which makes learning easier and more robust (…) On the other hand, methods such as SNPE return a parametric model of the posterior directly, whereas a further inference step (e.g. variational inference or MCMC) is needed on top of SNL to obtain a posterior estimate”

A fair point in the conclusion. Which also mentions the curse of dimensionality (both for parameters and observations) and the possibility to work directly with summaries.

Getting back to the earlier and connected Masked autoregressive flow for density estimation paper, by Papamakarios, Pavlakou and Murray:

“Viewing an autoregressive model as a normalizing flow opens the possibility of increasing its flexibility by stacking multiple models of the same type, by having each model provide the source of randomness for the next model in the stack. The resulting stack of models is a normalizing flow that is more flexible than the original model, and that remains tractable.”

Which makes it sound like a sort of a neural network in the density space. Optimised by Kullback-Leibler minimisation to get asymptotically close to the likelihood. But a form of Bayesian indirect inference in the end, namely an MLE on a pseudo-model, using the estimated model as a proxy in Bayesian inference…