Martin Metodiev and his coauthor(es)s have produced another paper on the THAMES Monte Carlo method when specifically targetting marginal likelihoods for mixture models. Since this problem has long been a central interest of mine’s and since the method is closely connected with the harmonic mean solution we developed with Darren Wraith in 2009, (and also included in our 2009 survey with Jean-Michel Marin of evidence approximations, published in Frontiers of Statistical Decision Making and Bayesian Analysis for Jim Berger’s 60th birthday), I quickly went into the paper. The core purpose of this paper is to adapt THAMES to a multimodal setting since using an ellipsoidal region as the support of the Uniform reciprocal importance sampling distribution does not make sense for a multimodal target. After reading it a few times, and while some computational aspects remain obscure to me, I am not convinced this brings an adequate answer to the challenge. Indeed, while the approach borrows directly from Berkhof et al. (2003) that inspired the resolution we proposed, Jeong (Kate) Lee and myself, the issues I have with the current proposal are that
1. the evacuation of earlier methods as not simple or not universal enough is rather disingenuous. For instance, software that do not return (latent) allocation vectors can easily be post-processed. And the current method uses allocation probabilities just the same (in Section 3.3). Similarly, the random shuffling answer to label (lack of) switching proposed by Sylvia Früwirth-Schnatter—which again can be achieved by post-processing—cannot be rejected on the sole basis that the component means (based on the MCMC sample) are all similar. It is furthermore debatable that the current proposal is simple, when involving relabelling à la Stephens, averaging over permutations, selecting over said permutations by constructing a graph over components (section 3.2.1) and running a quadratic discriminant analysis (section 3.2.2) on the posterior sample, based on an arbitrary Normal representation of the distributions of the clusters, and finally defining a new ordering constraint (section 3.2.3). Computing efforts required by the respective methods do not appear in the main text.
2. the handling of the label switching issue—the reason why Larry Wasserman saw mixtures at the same magnitude of evil as tequila!—is problematic for several reasons. My position (since at least 2000!) on the matter is that the proper posterior sample must exhibit label switching and come close to symmetry among the “components”. The label switching problem (section 3.1) is rather when the MCMC sample does not “switch their labels”. The relabelling approach (e.g., à la Stephens) allows for a differentiation between components, to some extent, which helps with computing basic posterior moments for point estimation or for the calibration of the support of the Uniform reciprocal importance sampling distribution, but the use of any relabelling procedure is tampering with the original MCMC sample and thus bound to impact the distribution of the resulting relabelled sample. Furthermore, relabelling depends on the value of G, whereas the actual number of (significant) modes in the posterior is also connected with the (partial) fit of the data to the model, meaning the creation of further modes than those linked with relabelling. Especially when the model is misspecified. Incidentally, the symmetrised version of THAMES (5) does not require relabelling. Neither does the Bayes factor. In addition, the experiment section (4.1.2) mentions that bridge sampling is biased by a factor of G!, which comes as a surprise to me since I associated this factor with the call to Sid Chib’s formula in the absence of label switching, i.e. when the MCMC sample was stuck on a mode, as exposed by Radford Neal in 1999. Is it because bridge sampling is applied to the relabelled sample? It is also surprising that the gap appears in the simulated datasets (Fig.3) and not in the real ones (Fig.5).
3. the (legitimate) purpose of using marginal likelihoods for selecting the number G of components is weakened by the intrusion of alternate proposals to assess G from the data, like the criterion of overlap (section 3.2.1), which instead aims at the number of clusters, with an elimination of “empty components” that should either remain a possibility (within a regular mixture model) or be evacuated with a different modelling (à la Diebolt & Robert, or à la Wasserman). This overlapping criterion is further used in the discriminant analysis that only applies to “non-overlapping components” of the mixture (section 3.2.3)—at which point I got lost in the reordering and simplification of the computation of THAMES (but got reminded of the results of Agostino Nobile in the 2000’s, with whom I used to discuss a lot in my yearly visit to the University of Glasgow).
4. several mentions are made of the other estimators being biased, which is indeed the case for bridge sampling (if not necessarily for importance sampling), but not necessarily a central issue, while the original generalised harmonic proposal by Gelfand and Dey (1994) and thus THAMES produce an unbiased estimator of the inverse of the evidence (thus neither of the evidence nor of the log-evidence). However, in the paper, the volume of the support of the Uniform reciprocal importance sampling distribution is estimated by a basic Monte Carlo coverage probability in (3), which induces the same type of bias as the other methods.
A paper about Bayesian inference on mixtures was posted on arXiv last week, as of 13 Jan 2025. 
