Archive for clustering

easily computed marginal likelihoods for multivariate mixture models using the THAMES estimator

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , on May 25, 2025 by xi'an

Martin Metodiev and his coauthor(es)s have produced another paper on the THAMES Monte Carlo method when specifically targetting marginal likelihoods for mixture models. Since this problem has long been a central interest of mine’s and since the method is closely connected with the harmonic mean solution we developed with Darren Wraith in 2009, (and also included in our 2009 survey with Jean-Michel Marin of evidence approximations, published in Frontiers of Statistical Decision Making and Bayesian Analysis for Jim Berger’s 60th birthday), I quickly went into the paper. The core purpose of this paper is to adapt THAMES to a multimodal setting since using an ellipsoidal region as the support of the Uniform reciprocal importance sampling distribution does not make sense for a multimodal target. After reading it a few times, and while some computational aspects remain obscure to me, I am not convinced this brings an adequate answer to the challenge.  Indeed, while the approach borrows directly from Berkhof et al. (2003) that inspired the resolution we proposed, Jeong (Kate) Lee and myself, the issues I have with the current proposal are that

1. the evacuation of earlier methods as not simple or not universal enough is rather disingenuous. For instance, software that do not return (latent) allocation vectors can easily be post-processed. And the current method uses allocation probabilities just the same (in Section 3.3). Similarly, the random shuffling answer to label (lack of) switching proposed by Sylvia Früwirth-Schnatter—which again can be achieved by post-processing—cannot be rejected on the sole basis that the component means (based on the MCMC sample) are all similar. It is furthermore debatable that the current proposal is simple, when involving relabelling à la Stephens, averaging over permutations, selecting over said permutations by constructing a graph over components (section 3.2.1) and  running a quadratic discriminant analysis (section 3.2.2) on the posterior sample, based on an arbitrary Normal representation of the distributions of the clusters, and finally defining a new ordering constraint (section 3.2.3). Computing efforts  required by the respective methods do not appear in the main text.

2. the handling of the label switching issue—the reason why Larry Wasserman saw mixtures at the same magnitude of evil as tequila!—is problematic for several reasons. My position (since at least 2000!) on the matter is that the proper posterior sample must exhibit label switching and come close to symmetry among the “components”. The label switching problem (section 3.1) is rather when the MCMC sample does not “switch their labels”. The relabelling approach (e.g., à la Stephens) allows for a differentiation between components, to some extent, which helps with computing basic posterior moments for point estimation or for the calibration of the support of the Uniform reciprocal importance sampling distribution, but the use of any relabelling procedure is tampering with the original MCMC sample and thus bound to impact the distribution of the resulting relabelled sample. Furthermore, relabelling depends on the value of G, whereas the actual number of (significant) modes in the posterior is also connected with the (partial) fit of the data to the model, meaning the creation of further modes than those linked with relabelling. Especially when the model is misspecified. Incidentally, the symmetrised version of THAMES (5) does not require relabelling. Neither does the Bayes factor. In addition, the experiment section (4.1.2) mentions that bridge sampling is biased by a factor of G!, which comes as a surprise to me since I associated this factor with the call to Sid Chib’s formula in the absence of label switching, i.e. when the MCMC sample was stuck on a mode, as exposed by Radford Neal in 1999. Is it because bridge sampling is applied to the relabelled sample? It is also surprising that the gap appears in the simulated datasets (Fig.3) and not in the real ones (Fig.5).

3. the (legitimate) purpose of using marginal likelihoods for selecting the number G of components is weakened by the intrusion of alternate proposals to assess G from the data, like the criterion of overlap (section 3.2.1), which instead aims at the number of clusters, with an elimination of “empty components” that should either remain a possibility (within a regular mixture model) or be evacuated with a different modelling (à la Diebolt & Robert, or à la Wasserman). This overlapping criterion is further used in the discriminant analysis that only applies to “non-overlapping components” of the mixture (section 3.2.3)—at which point I got lost in the reordering and simplification of the computation of THAMES (but got reminded of the results of Agostino Nobile in the 2000’s, with whom I used to discuss a lot in my yearly visit to the University of Glasgow).

4. several mentions are made of the other estimators being biased, which is indeed the case for bridge sampling (if not necessarily for importance sampling), but not necessarily a central issue, while the original generalised harmonic proposal by Gelfand and Dey (1994) and thus THAMES produce an unbiased estimator of the inverse of the evidence (thus neither of the evidence nor of the log-evidence). However, in the paper, the volume of the support of the Uniform reciprocal importance sampling distribution is estimated by a basic Monte Carlo coverage probability in (3), which induces the same type of bias as the other methods.

miXtures on arXiv

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , on February 5, 2025 by xi'an

A paper about Bayesian inference on mixtures was posted on arXiv last week, as of 13 Jan 2025.  Fast sampling and model selection for Bayesian mixture models, by M. E. J. Newman is based on the notion that (genuine) parameters of a mixture model can be marginalized out when using conjugate priors. This is something that we pointed out quite a while ago, in a 1999 paper with George and Marty, which was devised in a long ride from Baltimore to Cornell after JSM 1999, and again in the 2002 Series B perfect sampling paper with George, Kerrie and Mike. (Also written in 1999.) And marginal likelihood can furthermore be approximated along this way as discussed in the more recent papers Bayesian Inference on Mixtures of Distributions with Kate, Kerrie & Jean-Michel, as well as Approximating the marginal likelihood in mixture models with Jean-Michel.

“Standard mixture models, as commonly formulated, also suffer from a technical, but important, difficulty: the existence of empty components. In many models (…) the number of observations in a component can be zero. Arguably this is acceptable for a model with a fixed number of components, but when the number of components is a free random variable it causes ambiguity, because a given division of observations into components can be represented in more than one way in the model. For instance, we could divide observations into two components, or we could divide them into three components, one of which is empty. This in turn creates difficulties when estimating the number of components—do we have two components or three?”

A very puzzling perspective, imho, since potentially empty components are inherent to (both finite and infinite) mixture models with connected issues of prohibiting some improper priors (if not all) and non-identifiability, including non-identifiability of the number of empty components (which remains random conditional on the data!), but different numbers of components lead to different models and their comparison is handled straightforwardly by a Bayesian analysis.

The author then proceeds to “prohibit empty components” [as a prior choice ?] as we did in the original (!) Gibbs sampler for mixtures in 1990 (published in 1994 in Series B!), seeking posterior properness, a trick later validated by Larry Wasserman (in again 1999, the year of mixtures!). Who called the construct the combination of a fixed prior and of a pseudo-likelihood, correctly imho (as the data dependent part is not properly normalised by a function of the parameters), rather than a prior choice. (The very one who stated that “mixtures, like tequila, are evil and should be avoided“.)

From there, the modelling is rather standard, with an arbitrary prior on k, number of components, a random partition model that prohibits empty components, even though the constraint could be more stringent depending on the number of parameters of a given component and the degree of improperness of the prior, as in our 1990 Series B paper. (Impropriety is not discussed in the paper.) Bayesian inference on k is based on the simulated (pseudo-)posterior. The choice therein as the estimated clustering is the most frequent partition (consensus clustering), connected to our proposal of (again!) 1999 with Merrilee and Gilles. While the estimated mixture is not explicited. The approach is assessed as running at an O(k) cost, with no parallel in terms of the data size n, even though the examples include a 59,946 dataset. One notable algorithmic trick when moving k is in selecting a component at random first rather than an observation index.

Some minor issues: detailed balance indicated as required for convergence (p14), label switching is called component switching (p5), higher acceptance rate indicated as meaning improved performances (p7)

All About that Bayes stroll

Posted in pictures, Statistics, University life with tags , , , , , , , , , , , , , on February 9, 2024 by xi'an

For all Bayesians and sympathisers in the Paris area, an incoming All about that Bayes seminars¹ by Elisabeth Gassiat (Institut de Mathématiques d’Orsay) on 13 February, 16h00, on Campus Pierre & Marie Curie, SCAI:

A stroll through hidden Markov models

Hidden Markov models are latent variables models producing dependent sequences. I will survey recent results providing guarantees for their use in various fields such as clustering, multiple testing, nonlinear ICA or variational autoencoders.


¹Incidentally, I came across an unrelated All about that Bayes YouTube video, a talk given by Kristin Lennox (Lawrence Livermore National Laboratory). And then found out a myriad of talks or courses using that pun.

Model-Based Clustering, Classification, and Density Estimation Using mclust in R [not a book review]

Posted in Statistics with tags , , , , , , , , on May 29, 2023 by xi'an

Bayesian inference: challenges, perspectives, and prospects

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , on March 29, 2023 by xi'an

Over the past year, Judith, Michael and I edited a special issue of Philosophical Transactions of the Royal Society on Bayesian inference: challenges, perspectives, and prospects, in celebration of the current President of the Royal Society, Adrian Smith, and his contributions to Bayesian analysis that have impacted the field up to this day. The issue is now out! The following is the beginning of our introduction of the series.

When contemplating his past achievements, it is striking to align the emergence of massive advances in these fields with some papers or books of his. For instance, Lindley’s & Smith’s ‘Bayes Estimates for the Linear Model’ (1971), a Read Paper at the Royal Statistical Society, is making the case for the Bayesian analysis of this most standard statistical model, as well as emphasizing the notion of exchangeability that is foundational in Bayesian statistics, and paving the way to the emergence of hierarchical Bayesian modelling. It thus makes a link between the early days of Bruno de Finetti, whose work Adrian Smith translated into English, and the current research in non-parametric and robust statistics. Bernardo’s & Smith’s masterpiece, Bayesian Theory (1994), sets statistical inference within decision- and information-theoretic frameworks in a most elegant and universal manner that could be deemed a Bourbaki volume for Bayesian statistics if this classification endeavour had reached further than pure mathematics. It also emphasizes the central role of hierarchical modelling in the construction of priors, as exemplified in Carlin’s et al.‘Hierarchical Bayesian analysis of change point problems’ (1992).

The series of papers published in 1990 by Alan Gelfand & Adrian Smith, esp. ‘Sampling-Based Approaches to Calculating Marginal Densities’ (1990), is overwhelmingly perceived as the birth date of modern Markov chain Monte Carlo (MCMC) methods, as itbrought to the whole statistics community (and the quickly wider communities) the realization that MCMC simulation was the sesame to unlock complex modelling issues. The consequences on the adoption of Bayesian modelling by non-specialists are enormous and long-lasting.Similarly, Gordon’set al.‘Novel approach to nonlinear/non-Gaussian Bayesian state estimation’ (1992) is considered as the birthplace of sequential Monte Carlo, aka particle filtering, with considerable consequences in tracking, robotics, econometrics and many other fields. Titterington’s, Smith’s & Makov’s reference book, ‘Statistical Analysis of Finite Mixtures(1984)  is a precursor in the formalization of heterogeneous data structures, paving the way for the incoming MCMC resolutions like Tanner & Wong (1987), Gelman & King (1990) and Diebolt & Robert (1990). Denison et al.’s book, ‘Bayesian methods for nonlinear classification and regression’ (2002) is another testimony to the influence of Adrian Smith on the field,stressing the emergence of robust and general classification and nonlinear regression methods to analyse complex data, prefiguring in a way the later emergence of machine-learning methods,with the additional Bayesian assessment of uncertainty. It is also bringing forward the capacity of operating Bayesian non-parametric modelling that is now broadly accepted, following a series of papers by Denison et al. in the late 1990s like CART and MARS.

We are quite grateful to the authors contributing to this volume, namely Joshua J. Bon, Adam Bretherton, Katie Buchhorn, Susanna Cramb, Christopher Drovandi, Conor Hassan, Adrianne L. Jenner, Helen J. Mayfield, James M. McGree, Kerrie Mengersen, Aiden Price, Robert Salomone, Edgar Santos-Fernandez, Julie Vercelloni and Xiaoyu Wang, Afonso S. Bandeira, Antoine Maillard, Richard Nickl and Sven Wang , Fan Li, Peng Ding and Fabrizia Mealli, Matthew Stephens, Peter D. Grünwald, Sumio Watanabe, Peter Müller, Noirrit K. Chandra and Abhra Sarkar, Kori Khan and Alicia Carriquiry, Arnaud Doucet, Eric Moulines and Achille Thin, Beatrice Franzolini, Andrea Cremaschi, Willem van den Boom and Maria De Iorio, Sandra Fortini and Sonia Petrone, Sylvia Frühwirth-Schnatter, Sara Wade, Chris C. Holmes and Stephen G. Walker, Lizhen Nie and Veronika Ročková. Some of the papers are open-access, if not all, hence enjoy them!