Archive for IMS Monographs

scalable Monte Carlo for Bayesian learning [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , on September 26, 2025 by xi'an

This book by Paul Fearnhead, Christopher Nemeth, Chris Oates, and Chris Sherlock is part of the IMS Monograph series. And published by Cambridge University Press. It covers most recent developments in MCMC methods, namely stochastic gradient MCMC (Chap. 3), non-reversible MCMC (Chap. 4), continuous-time MCMC (Chap. 5), and assessing and improving MCMC (Chap. 6). I find the book remarkable in its attention to rigour and clarity, without falling into overly technical derivations. It is perfectly suited for a graduate course to students with a solid mathematical background. In short, had I considered a new edition of our Monte Carlo Statistical Methods book to incorporate these advances, I could not done such a good job!

The first chapter provides a quick refresher of the background, from Monte Carlo principles, to Markov chains, SDEs, and the kernel “trick” (which requires a dozen pages of exposition). Nonetheless, it contains side remarks of true interest, including some suggestions I had not previously seen, as for instance an unusual introduction of the HMC algorithm as an underdamped Langevin diffusion. Chapter 2 prolongates this recap by covering reversible MCMC algorithms and the attached optimal scalings. This is done in a particularly friendly presentation that I intend to use in my own course. The HMC section is probably the best coverage I have seen on the topic, including most naturally the leapfrog steps.

Chapter 3 gets into stochastic gradient MCMC as an approximate MCMC, with nice arguments and formal convergence bounds. Again quite efficiently, if focussing almost solely on Gaussian settings (but including a neural network example). Similarly, Chapter 4 provides intuitive (if informal) arguments on the worth of non-reversible algorithms that are well-suited to a textbook of this level. This chapter introduces a PDMP sampler like the discrete bouncy particle sampler.

Chapter 5 is a (nicely) monstrous coverage of continuous time MCMC samplers that reaches very recent advances on PDMPs. The focus is on expressing them as limits, in order to derive mixing rates without extreme mathematical steps. (The chapter even includes a mention to the coordinate sampler that my PhD student Wu Changye derived in 2018!) Again a chapter I plan to use when teaching MCM methods, if possibly skipping some of the 66 pages.

Chapter 6 completes the monograph with a presentation of convergence assessment tools and diagnostics, exploiting the kernel trick, as well as convergence bounds that reflect very recent research in that domain. The conclusive section on optimal weights and optimal thinning will presumably be new to most readers. (Making me wonder if a link can be found with our importance Markov chain construct.)

[Disclaimer about potential self-plagiarism as usual: this post or an edited version will eventually appear in my Books Review section in CHANCE.]

Scalable Monte Carlo for Bayesian Learning [not yet a book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , on May 11, 2025 by xi'an

That the likelihood principle does not hold…

Posted in Statistics, University life with tags , , , , , , , , , , on October 6, 2011 by xi'an

Coming to Section III in Chapter Seven of Error and Inference, written by Deborah Mayo, I discovered that she considers that the likelihood principle does not hold (at least as a logical consequence of the combination of the sufficiency and of the conditionality principles), thus that  Allan Birnbaum was wrong…. As well as the dozens of people working on the likelihood principle after him! Including Jim Berger and Robert Wolpert [whose book sells for $214 on amazon!, I hope the authors get a hefty chunk of that ripper!!! Esp. when it is available for free on project Euclid…] I had not heard of  (nor seen) this argument previously, even though it has apparently created enough of a bit of a stir around the likelihood principle page on Wikipedia. It does not seem the result is published anywhere but in the book, and I doubt it would get past a review process in a statistics journal. [Judging from a serious conversation in Zürich this morning, I may however be wrong!]

The core of Birnbaum’s proof is relatively simple: given two experiments and about the same parameter θ with different sampling distributions and , such that there exists a pair of outcomes (y¹,y²) from those experiments with proportional likelihoods, i.e. as a function of θ

f^1(y^1|\theta) = c f^2(y^2|\theta),

one considers the mixture experiment E⁰ where  and are each chosen with probability ½. Then it is possible to build a sufficient statistic T that is equal to the data (j,x), except when j=2 and x=y², in which case T(j,x)=(1,y¹). This statistic is sufficient since the distribution of (j,x) given T(j,x) is either a Dirac mass or a distribution on {(1,y¹),(2,y²)} that only depends on c. Thus it does not depend on the parameter θ. According to the weak conditionality principle, statistical evidence, meaning the whole range of inferences possible on θ and being denoted by Ev(E,z), should satisfy

Ev(E^0, (j,x)) = Ev(E^j,x)

Because the sufficiency principle states that

Ev(E^0, (j,x)) = Ev(E^0,T(j,x))

this leads to the likelihood principle

Ev(E^1,y^1)=Ev(E^0, (j,y^j)) = Ev(E^2,y^2)

(See, e.g., The Bayesian Choice, pp. 18-29.) Now, Mayo argues this is wrong because

“The inference from the outcome (Ej,yj) computed using the sampling distribution of [the mixed experiment] E⁰ is appropriately identified with an inference from outcome yj based on the sampling distribution of Ej, which is clearly false.” (p.310)

This sounds to me like a direct rejection of the conditionality principle, so I do not understand the point. (A formal rendering in Section 5 using the logic formalism of A’s and Not-A’s reinforces my feeling that the conditionality principle is the one criticised and misunderstood.) If Mayo’s frequentist stance leads her to take the sampling distribution into account at all times, this is fine within her framework. But I do not see how this argument contributes to invalidate Birnbaum’s proof. The following and last sentence of the argument may bring some light on the reason why Mayo considers it does:

“The sampling distribution to arrive at Ev(E⁰,(j,yj)) would be the convex combination averaged over the two ways that yj could have occurred. This differs from the  sampling distributions of both Ev(E1,y1) and Ev(E2,y2).” (p.310)

Indeed, and rather obviously, the sampling distribution of the evidence Ev(E*,z*) will differ depending on the experiment. But this is not what is stated by the likelihood principle, which is that the inference itself should be the same for and . Not the distribution of this inference. This confusion between inference and its assessment is reproduced in the “Explicit Counterexample” section, where p-values are computed and found to differ for various conditional versions of a mixed experiment. Again, not a reason for invalidating the likelihood principle. So, in the end, I remain fully unconvinced by this demonstration that Birnbaum was wrong. (If in a bystander’s agreement with the fact that frequentist inference can be built conditional on ancillary statistics.)