Archive for post-processing

multimodal challenges

Posted in Books, Statistics, University life with tags , , , , , , , , , , on April 30, 2026 by xi'an

At the last mostly Monte Carlo seminar, Pierre Monmarché presented a recent work on post-sampling for multimodal targets: while I  consider the main problem in sampling from generic multimodal targets stands with finding the modes, rather than with exploring local aspects or estimating relative weights of said modes, this made me ponder whether or not this could be accelerated by removing chunks of the already explored modes to induce moves elsewhere, which is a form of radical, brute-force, tempering, or of Wang-Landau.  As for the relative weights, a multiple move proposal can be considered, including our folding idea. Or Geyer’s inverse logistic trick. Or the similar mixture trick we used in our Biometrika paper on nested sampling. Pierre’s approach was closer to adaptive importance sampling, with a self-imposed constraint of fixed sample sizes from (approximate) distributions around each of the modes.

easily computed marginal likelihoods for multivariate mixture models using the THAMES estimator

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , on May 25, 2025 by xi'an

Martin Metodiev and his coauthor(es)s have produced another paper on the THAMES Monte Carlo method when specifically targetting marginal likelihoods for mixture models. Since this problem has long been a central interest of mine’s and since the method is closely connected with the harmonic mean solution we developed with Darren Wraith in 2009, (and also included in our 2009 survey with Jean-Michel Marin of evidence approximations, published in Frontiers of Statistical Decision Making and Bayesian Analysis for Jim Berger’s 60th birthday), I quickly went into the paper. The core purpose of this paper is to adapt THAMES to a multimodal setting since using an ellipsoidal region as the support of the Uniform reciprocal importance sampling distribution does not make sense for a multimodal target. After reading it a few times, and while some computational aspects remain obscure to me, I am not convinced this brings an adequate answer to the challenge.  Indeed, while the approach borrows directly from Berkhof et al. (2003) that inspired the resolution we proposed, Jeong (Kate) Lee and myself, the issues I have with the current proposal are that

1. the evacuation of earlier methods as not simple or not universal enough is rather disingenuous. For instance, software that do not return (latent) allocation vectors can easily be post-processed. And the current method uses allocation probabilities just the same (in Section 3.3). Similarly, the random shuffling answer to label (lack of) switching proposed by Sylvia Früwirth-Schnatter—which again can be achieved by post-processing—cannot be rejected on the sole basis that the component means (based on the MCMC sample) are all similar. It is furthermore debatable that the current proposal is simple, when involving relabelling à la Stephens, averaging over permutations, selecting over said permutations by constructing a graph over components (section 3.2.1) and  running a quadratic discriminant analysis (section 3.2.2) on the posterior sample, based on an arbitrary Normal representation of the distributions of the clusters, and finally defining a new ordering constraint (section 3.2.3). Computing efforts  required by the respective methods do not appear in the main text.

2. the handling of the label switching issue—the reason why Larry Wasserman saw mixtures at the same magnitude of evil as tequila!—is problematic for several reasons. My position (since at least 2000!) on the matter is that the proper posterior sample must exhibit label switching and come close to symmetry among the “components”. The label switching problem (section 3.1) is rather when the MCMC sample does not “switch their labels”. The relabelling approach (e.g., à la Stephens) allows for a differentiation between components, to some extent, which helps with computing basic posterior moments for point estimation or for the calibration of the support of the Uniform reciprocal importance sampling distribution, but the use of any relabelling procedure is tampering with the original MCMC sample and thus bound to impact the distribution of the resulting relabelled sample. Furthermore, relabelling depends on the value of G, whereas the actual number of (significant) modes in the posterior is also connected with the (partial) fit of the data to the model, meaning the creation of further modes than those linked with relabelling. Especially when the model is misspecified. Incidentally, the symmetrised version of THAMES (5) does not require relabelling. Neither does the Bayes factor. In addition, the experiment section (4.1.2) mentions that bridge sampling is biased by a factor of G!, which comes as a surprise to me since I associated this factor with the call to Sid Chib’s formula in the absence of label switching, i.e. when the MCMC sample was stuck on a mode, as exposed by Radford Neal in 1999. Is it because bridge sampling is applied to the relabelled sample? It is also surprising that the gap appears in the simulated datasets (Fig.3) and not in the real ones (Fig.5).

3. the (legitimate) purpose of using marginal likelihoods for selecting the number G of components is weakened by the intrusion of alternate proposals to assess G from the data, like the criterion of overlap (section 3.2.1), which instead aims at the number of clusters, with an elimination of “empty components” that should either remain a possibility (within a regular mixture model) or be evacuated with a different modelling (à la Diebolt & Robert, or à la Wasserman). This overlapping criterion is further used in the discriminant analysis that only applies to “non-overlapping components” of the mixture (section 3.2.3)—at which point I got lost in the reordering and simplification of the computation of THAMES (but got reminded of the results of Agostino Nobile in the 2000’s, with whom I used to discuss a lot in my yearly visit to the University of Glasgow).

4. several mentions are made of the other estimators being biased, which is indeed the case for bridge sampling (if not necessarily for importance sampling), but not necessarily a central issue, while the original generalised harmonic proposal by Gelfand and Dey (1994) and thus THAMES produce an unbiased estimator of the inverse of the evidence (thus neither of the evidence nor of the log-evidence). However, in the paper, the volume of the support of the Uniform reciprocal importance sampling distribution is estimated by a basic Monte Carlo coverage probability in (3), which induces the same type of bias as the other methods.

adaptive copulas for ABC

Posted in Statistics with tags , , , , , , , , on March 20, 2019 by xi'an

A paper on ABC I read on my way back from Cambodia:  Yanzhi Chen and Michael Gutmann arXived an ABC [in Edinburgh] paper on learning the target via Gaussian copulas, to be presented at AISTATS this year (in Okinawa!). Linking post-processing (regression) ABC and sequential ABC. The drawback in the regression approach is that the correction often relies on an homogeneity assumption on the distribution of the noise or residual since this approach only applies a drift to the original simulated sample. Their method is based on two stages, a coarse-grained one where the posterior is approximated by ordinary linear regression ABC. And a fine-grained one, which uses the above coarse Gaussian version as a proposal and returns a Gaussian copula estimate of the posterior. This proposal is somewhat similar to the neural network approach of Papamakarios and Murray (2016). And to the Gaussian copula version of Li et al. (2017). The major difference being the presence of two stages. The new method is compared with other ABC proposals at a fixed simulation cost, which does not account for the construction costs, although they should be relatively negligible. To compare these ABC avatars, the authors use a symmetrised Kullback-Leibler divergence I had not met previously, requiring a massive numerical integration (although this is not an issue for the practical implementation of the method, which only calls for the construction of the neural network(s)). Note also that sequential ABC is only run for two iterations, and also that none of the importance sampling ABC versions of Fearnhead and Prangle (2012) and of Li and Fearnhead (2018) are considered, all versions relying on the same vector of summary statistics with a dimension much larger than the dimension of the parameter. Except in our MA(2) example, where regression does as well. I wonder at the impact of the dimension of the summary statistic on the performances of the neural network, i.e., whether or not it is able to manage the curse of dimensionality by ignoring all but essentially the data  statistics in the optimisation.

postprocessing for ABC

Posted in Books, Statistics with tags , , , , on June 1, 2017 by xi'an

Two weeks ago, G.S. Rodrigues, Dennis Prangle and Scott Sisson have recently arXived a paper on recalibrating ABC output to make it correctly calibrated (in the frequentist sense). As in earlier papers, it takes advantage of the fact that the tail posterior probability should be uniformly distributed at the true value of the [simulated] parameter behind the [simulated] data. And as in Prangle et al. (2014), relies on a copula representation. The main notion is that marginals posteriors can be reasonably approximated by non-parametric kernel estimators, which means that an F⁰oF⁻¹ transform can be applied to an ABC reference table in a fully non-parametric extension of Beaumont et al.  (2002). Besides the issue that F is an approximation, I wonder about the computing cost of this approach, given that computing the post-processing transforms comes at a cost of O(pT²) when p is the dimension of the parameter and T the size of the ABC learning set… One question that came to me while discussing the paper with Jean-Michel Marin is why one would use F⁻¹(θ¹|s) instead of directly a uniform U(0,1) since in theory this should be a uniform U(0,1).