Archive for Statistical Science

mixture models [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , , on August 14, 2024 by xi'an

Strangely enough, I became aware of this new book on mixtures through one of these annoying emails “Your work has been cited n times this week“… Mixture Models (Parametric, Semiparametric, and New Directions) by Weixin Yao and Sijia Wang got published by CRC Press earlier this year, within the Monographs on Statistics and Applied Probability green series (#175), and covers across 380 pages most aspects of mixture (and hidden Markov) estimation, if with strong emphasis on maximum likelihood estimation, while the new directions are unsurprisingly those pursued by the authors, namely robust and semi-parametric estimation, as well as model selection by testing.

An early warning about this book review is that I co-edited a Handbook of Mixture Analysis with my friends Sylvia Früwirth-Schnatter and Gilles Celeux a few years ago. I am therefore biased in what I would have included in a new book on the topic, the more because I find the available literature already plentiful, even though the early (1984) book of Titterington et al. that was my entry to the field may have become an historical reference. For instance, Finite Mixtures by McLachlan and Peel (2000) remains relevant, with similar emphasis on maximum likelihood and the EM algorithm, while Sylvia’s Finite Mixture and Markov Switching Models is still a reference to this day.

And an additional warning on me not being a massive fan of semi- and non-parametric estimation in this setting…

Preliminaries that may explain my limited enthusiasm about the book and its limited originality. Not that I found significant errors there (even though “improper priors [do not always] yield improper posteriors” [p.145] as we demonstrated in several papers), however, I had trouble with the uneven pace adopted by the authors that often skim some topics of importance while spending an inconsiderate amount of space on less relevant once. Some items get many bibliographical references, while others do not. For instance, EM receives a lion’s share (see, e..g, Sections 6.6 and 6.7). Or the 12 pages of proof in Chapter 10. Declination of sections into mixtures, mixtures of regressions, multivariate mixtures, hidden Markov models, and so on feels somewhat repetitive. This is particularly the case for the “mixture regression models” chapter.

The book also contains Bayesian entries, with a first introduction (p.105) in the discrete data chapter that precedes the short Bayesian chapter #4 (p.145), the same issue arising for related algorithms like Gibbs (p.107) that “estimate properties of the joint posterior” and MCMC (p.112). Which sort of erases the specificity of a Bayesian approach by reducing it to one item in the toolbox (with the wrong stress on MAP estimates). In this Bayesian chapter, MCMC validation is handled for discrete state spaces while applied in general spaces. The focus is mostly on relabelling for the following label switching chapter, albeit a large collection of methods are compared if not mentioned.

Handing an unknown number of components by hypothesis testing is supported in the next short chapter, although very little is said about reversible jump MCMC. And there is no general discussion on the consistency of these tests, in particular with bootstrap. Or at least on the regularity conditions they request. An puzzling paradox (p.191) is the existence of an unbounded Fisher information of an exponential mixture

\pi\mathcal Exp(1)+(1-\pi)\mathcal Exp(2)

when the weight π is the parameter (and close to 1).

High-dimensional mixtures in Chapter 8 are mostly handled by linear projections in smaller subspaces, which is natural given that they preserve the mixture structure but open a Pandora box of a wide range of proposed methods, again with little comparison available. Except in the R final section opposing several R functions on the same dataset (if unconclusively).

The semi-parametric chapters mention Dirichlet process priors, albeit briefly, but fail to relate to the recent works on using these when inferring about the number of components. Or failing to do so. There is also a very limited connection pointed out with machine learning but little can be gathered from the three page presentation (pp.308-310). These chapters also have significant overlap with the review paper of Xiang et al. (2019) in Statistical Science.

Most chapters end up with an R section, which usually reads as a quick demo of a related R package, like BayesLCA or our own mixtool. Hence not massively helpful beyond pointers to these packages. The numerical illustrations also are unevenly distributed between chapters, from nothing at all to four pages of small font tables on an MSE comparison between more or less robust approaches undertaken by Yu et al. (2020).

The above thus explains why I am not particularly excited about this bibliographical addition to the analysis of mixtures. It does offer a reference for researchers in the field by adding recent references and approaches to the existing books mentioned above, but I could not recommend it as a textbook (as suggested on p.xiii).

[Disclaimer about potential self-plagiarism: this post or an edited version may eventually appear in my Books Review section in CHANCE.]

on the edge and online!

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , on March 11, 2024 by xi'an

safe Bayes & e-values & least favourable priors

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , on March 2, 2024 by xi'an

The paper by Peter Grünwald, Rianne de Heide and Wouter Koolen on safe testing was read before The Royal Statistical Society at a meeting organized by the Research Section on Wednesday, 24th January, 2024, after many years in the making, to the point that several papers based on this initial one have appeared in the meanwhile, incl. some submissions to Biometrika. Like this one in the current issue of Statistical Science dedicated to reproducibility and replicability. Joshua Bon and I wrote a discussion that synthesised the following and sometimes rambling remarks.

Overall, this is a mind-challenging paper with definitely original style and contents for which the authors are to be congratulated!

“…p-values are interpreted as indicating amounts of evidence against the null, and their definition does not need to refer to any specific alternative H¹. Exactly the same holds for e-values: the basic interpretation ‘a large e-value provides evidence against H’ holds no matter how the e-variable is defined, as long as it satisfies (1). If they are defined relative to H¹ that is close to the actual process generating the data they will grow fast and provide a lot of evidence, but the basic interpretation holds regardless.”

About the entry section, one may ask why would a Bayesian want to test the veracity of a null hypothesis. The debate has been raging since the early days, although Jeffreys spent two chapters of his book on the topic of testing. (While appearing in Example 5 p.14 for his point estimation prior.) From an opposite viewpoint, the construction of e-values and such in the paper is highly model dependent, but all models are wrong! and more to the point both hypotheses may turn out to be wrong for misspecified cases. The notion thus seems on the opposite to be very M-close, with no idea of what is happening under misspecified models or why is rejecting H⁰ the ultimate argument.

When introducing e-values, (1) is not a definition per se, since otherwise E≡1 would be an e-value. This is unfortunate as the topic is already confusing enough. E[E] must be larger than 1 under H¹, otherwise product of e-values would always degenerate to zero (?)

The points

  1. behaviour under optional continuation [by a martingale reasoning]
  2. interpretation as ‘evidence against the null’ as gambling [unethical!]
  3. in all cases preserving frequentist Type I error guarantees
  4. e-variables turn out to be Bayes factors based on the right Haar prior [rather than sometimes with highly unusual (e.g. degenerate) priors? p.4]
  5. e-variables need more extreme data than p-values in order to reject the null

are rather worthwhile, even though 2. is vague and 3. is firmly frequentist. Any theory involving Haar priors (and even better amenability) cannot be all wrong, though, even considering that Haar priors are improper. The optional continuation in 1. is a nice argument from a Bayesian viewpoint since it has also been used to defend the Bayesian approach. Point 4. brings a formal way to define least favourable priors in the testing sense. One may then wonder at the connection with the solution of Bayarri and Garcia-Donato (Biometrika, 2007). The perspective adopted therein is somehow an inverse of the more common stance when the prior on H⁰ is the starting point [and obviously known]. So, is there any dual version of e-values where this would happen, i.e. leading to deriving the optimal prior on H¹ for a given prior on H⁰? (Which would further offer a maximin interpretation.) Theorem 1 indeed sounds like the minimax=maximin result for test settings. (In Corollary 2, why is (10) necessarily a Bayes factor, given the two models?)

While I first thought that the approach leads to finding a proper prior, the “Almost Bayesian Case” [p.17] (ABC!!) comes to justify the use of a “common” improper prior over nuisance parameters under both hypotheses, which while more justifiable than in the original objective Bayes literature, remains unsatisfactory to me. But I like the notion in 2.2 [p.10] that a prior chosen on H¹ forces one to adopt a particular corresponding prior on H⁰, as it defines a form of automated projection that we also considered in Goutis [RIP] and Robert (Biometrika, 1998). Corollary 2 is most interesting as well. However, taking the toy example of H⁰ being a normal mean standing in (-a,a) seems to lead to the optimal prior on H⁰ being a point mass at +/- a for any marginal m(y) centred at zero. Which is a disappointing outcome when compared with the point mass situation. It is another disappointment that the Bayes Factor cannot be an e-value since (6) fails to hold, but (1) is not (6) and one could argue that the BF is an e-value when integrating under the marginals!

As a marginalia, the paper made me learn about the term (and theme) tragedy of the commons, a concept developed by [the neomalthusian and eugenist] Garett Hardin.

In conclusion, we congratulate the authors on this endeavour but it remains unclear to us (as Bayesians) (i) how to construct the least favourable prior on H0 on a general basis, especially from a computational viewpoint, and, more importantly, (ii) whether it is at all of inferential interest [i.e., whether it degenerates into a point mass]. With respect to the sequential directions of the paper, we also wonder at the potential connections with sequential Monte Carlo, for instance, towards conducting sequential model choice by constructing efficiently an amalgamated evidence value when the product of Bayes factors is not a Bayes factor (see Buchholz et al., 2023).

futuristic statistical science [editorial]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , on January 13, 2024 by xi'an

This special issue of Statistical Science is devoted to the future of Bayesian computational statistics, from several perspectives. It involves a large group of researchers who contributed to collective articles, bringing their own perspectives and research interests into these surveys. Somewhat paradoxically, it starts with the past—and a conference on a Gold Coast beach. Martin, Frazier, and Robert first submitted a survey on the history of Bayesian computation, written after Gael Martin delivered a plenary lecture at Bayes on the Beach, a conference held in November 2017 in Surfers Paradise, Gold Coast, Queensland, and organised by Bayesian Research and Applications Group (BRAG), the Bayesian research group headed by Kerrie Mengersen at the Queensland University of Technology (QUT). Following a first round of reviews, this paper got split into two separate articles, Computing Bayes: From Then ‘Til Now , retracing some of the history of Bayesian computation, and Approximating Bayes in the 21st Century, which is both a survey and a prospective on the directions and trends of approximate Bayesian approaches (and not solely ABC). At this point, Sonia Petrone, editor of Statistical Science, suggested we had a special issue on the whole issue of trends of interest and promise for Bayesian computational statistics. Joining forces, after some delays and failures to convince others to engage, or to produce multilevel papers with distinct vignettes, we eventually put together an additional four papers, where lead authors gathered further authors to produce this diverse picture of some incoming advances in the field. We have deliberated avoided topics which have excellent recent reviews— such as Stein’s method, sequential Monte Carlo, piecewise deterministic Markov processes— and topics which are still in their infancy, such as the relationship of Bayesian approaches to large language models (LLMs) and foundation models.

Within this issue, Past, Present, and Future of Software for Bayesian Inference from Erik Štrumbelj & al covers the state of the art in the most popular Bayesian software, reminding us of the massive impact BUGS has had on the adoption of Bayesian tools since its early introduction in the early 1990s (which I remember discovering at the Fourth Valencia meeting on Bayesian statistics in April 1991). With an interesting distinction between first and second generations, and a light foray of the potential third generation, maybe missing the role of LLMs in coding that are already impacting the approach to computing and the less immediate revolution brought by quantum computing. Winter & al.’s The Future of Bayesian Computation [TITLE TO CHANCE] is making a link with machine learning techniques, without looking at the scariest issue of how Bayesian inference can survive in a machine learning world! While it produces an additional foray into the blurry division between proper sampling (à la MCMC) and approximations, additional to the historical Martin et al. (2024), it articulates these aspects within a (deep) machine learning perspective, emphasizing the role of summaries produced by generative models exploiting the power of neural network computation/optimization. And the pivotal reliance on variational Bayes, which is the most active common denominator with machine learning. With further entries on major issues like distributed computing, opening on the important aspect of data protection and guaranteed  privacy. We particularly like the clinical presentation of this paper with attention to automation and limitations. Normalizing flows actually link this paper with Heng, Bortoli and Doucet’s coverage of the Schrödinger bridge, which is a more focussed coverage of recent advances on possibly the next generation of posterior samplers. The final paper, Bayesian experimental design by Rainforth & al., provides a most convincing application of the methods exposed in the earlier papers in that the field of Bayesian design has hugely benefited from the occurrence of such tools to become a prevalent way of designing statistical experiments in real settings.

We feel the future of Bayesian computing is bright! The Monte Carlo revolution of the 1990s continues to be a huge influence on today’s work, and now is complemented by an exciting range of new directions informed by modern machine learning.

Dennis Prangle and Christian P Robert

transport, diffusions, and sampling

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on November 19, 2022 by xi'an

At the Sampling, Transport, and Diffusions workshop at the Flatiron Institute, on Day #2, Marilou Gabrié (École Polytechnique) gave the second introductory lecture on merging sampling and normalising flows targeting the target distribution, when driven by a divergence criterion like KL, that only requires the shape of the target density. I first wondered about ergodicity guarantees in simultaneous MCMC and map training due to the adaptation of the flow but the update of the map only depends on the current particle cloud in (8). From an MCMC perspective, it sounds somewhat paradoxical to see the independent sampler making such an unexpected come-back when considering that no insider information is available about the (complex) posterior to drive the [what-you-get-is-what-you-see] construction of the transport map. However, the proposed approach superposed local (random-walk like) and global (transport) proposals in Algorithm 1.

Qiang Liu followed on learning transport maps, with the  Interesting notion of causalizing a graph by removing intersections (which are impossible for an ODE, as discussed by Eric Vanden-Eijden’s talk yesterday) through  coupling. Which underlies his notion of rectified flows. Possibly connecting with the next lightning talk by Jonathan Weare on spurious modes created by a variational Monte Carlo sampler and the use of stochastic gradient, corrected by (case-dependent?) regularisation.

Then came a whole series of MCMC talks!

Sam Livingstone spoke on Barker’s proposal (an incoming Biometrika paper!) as part of a general class of transforms g of the MH ratio, using jump processes based on a nasty normalising constant related with g (tractable for the original Barker algorithm). I then realised I had missed his StatSci paper on how to speak to statistical physics researchers!

Charles Margossian spoke about using a massive number of short parallel runs (many-short-chain regime) from a recent paper written with Aki,  Andrew, and Lionel Riou-Durand (Warwick) among others. Which brings us back to the challenge of producing convergence diagnostics and precisely the Gelman-Rubin R statistic or its recent nR avatar (with its linear limitations and dependence on parameterisation, as opposed to fuller distributional criteria). The core of the approach is in using blocks of GPUs to improve and speed-up the estimation of the between-chain variance. (D for R².) I still wonder at a waste of simulations / computing power resulting from stopping the runs almost immediately after warm-up is over, since reaching the stationary regime or an approximation thereof should be exploited more efficiently. (Starting from a minimal discrepancy sample would also improve efficiency.)

Lu Zhang also talked on the issue of cutting down warmup, presenting a paper co-authored with Bob, Andrew, and Aki, recommending Laplace / variational approximations for reaching faster high-posterior-density regions, using an algorithm called Pathfinder that relies on ELBO checks to counter poor performances of Laplace approximations. In the spirit of the workshop, it could be profitable to further transform / push-forward the outcome by a transport map.

Yuling Yao (of stacking and Pareto smoothing fame!) gave an original and challenging (in a positive sense) talk on the many ways of bridging densities [linked with the remark he shared with me the day before] and their statistical significance. Questioning our usual reliance on arithmetic or geometric mixtures. Ignoring computational issues, selecting a bridging pattern sounds not different from choosing a parameterised family of embedding distributions. This new typology of models can then be endowed with properties that are more or less appealing. (Occurences of the Hyvärinen score and our mixtestin perspective in the talk!)

Miranda Holmes-Cerfon talked about MCMC on stratification (illustrated by this beautiful picture of nanoparticle random walks). Which means sampling under varying constraints and dimensions with associated densities under the respective Hausdorff measures. This sounds like a perfect setting for reversible jump and in a sense it is, as mentioned in the talks. Except that the moves between manifolds are driven by the proximity to said manifold, helping with a higher acceptance rate, and making the proposals easier to construct since projections (or the reverses) have a physical meaning. (But I could not tell from the talk why the approach was seemingly escaping the symmetry constraint set by Peter Green’s RJMCMC on the reciprocal moves between two given manifolds).