Archive for NUTS

comments from Bob

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on April 10, 2026 by xi'an

Bob replied to my short post with further items of information that I find worth sharing:

Thanks for the kind post, Christian. It’s amusing to be the subject of one of these posts given how many of them I’ve read about other people. I really appreciate your summaries. And thanks to everyone in the audience for all the great feedback during and after the talk. Here’s a link to my slides.

One of your students or postdocs mentioned an approach that does continuous adaptation on some kind of polynomial schedule that is provably correct, but I didn’t manage to write down the author/reference or the name of the person who recommended it. If you happen to know what that is, I’d be grateful for the reference.

I would also like to follow up on the Robert & Andrieu paper you mention, but I could not find the exact reference on your Google Scholar page. The closest match I can find is:

Controlled MCMC for optimal sampling. 2001. C Andrieu, CP Robert. INSEE.

Section 1.3 is titled “Criteria for local adaptation.” The section cites two things. The first is Haario et al.’s (1999) sliding window approach, for which HMC moves too fast to be useful locally. The second is the multiple try approach of Liu et al. (2000) and the delayed rejection approach Tierny and Mira (1999). We applied delayed rejection to HMC step size adaptation in a couple of papers before developing GIST (Modi, Barnett and Carpenter in Bayesian Analysis; Turok, Modi, and Carpenter in AISTATS); these mirror our second GIST paper and third GIST paper in doing the step size adaptation for a whole trajectory and at each leapfrog step. The GIST approach is easier to understand, easier to describe mathematically, easier to implement, and is more efficient.

The nice part about GIST compared to Riemannian HMC is that we do not need to do any volume adjustments (which must be autodiffed through), which are cubic, and we do not need an implicit integrator, which is incredibly fussy to tune. The tradeoff is the we require reversibility of the adaptation, which I think is going to be tricky with varying curvature. Of course, we can’t afford to compute Hessian matrices in high dimensions, but we could manage Hessian-vector products if we could figure out how to use just those and we could also manage low-rank plus diagonal approximations or sketches as described in the Nutpie paper.

We’ve arXived the Nutpie paper since the talk:

Preconditioning HMC by minimizing Fisher divergence. arXiv. 2026. Seyboldt, Carlsen, and Carpenter.

The WALNUTS paper has been accepted by JMLR, but currently only the arXiv version is available:

The within-orbit adaptive leapfrog no-U-turn sampler. 2026. Nawaf Bou-Rabee, Bob Carpenter, Tore Selland Kleppe, Sifan Liu. 2025 arXiv; 2026 to appear JMLR.

Working with Nawaf and Tore has made all the difference in the world on this—it’s not something I could have done by myself. Sifan’s the one who came up with the nice characterization of NUTS and Nawaf’s done a number of additional things like providing mixing time bounds for NUTS (with Milo Marsden, who’s sadly no longer with us—he’s gone into finance).

Furthermore, you can adjust the U-turn criterion from 180 degrees to whatever you want to control how much of a full orbit you get. Those tend to be even more wasteful of iterations, though—this is what the plot from the expected integration time of NUTS is supposed to show, but it was confusing in the talk.

The approach you took with Wu Chengye to randomize number of leapfrog steps made a deep impression on me. It’s also wasteful in leapfrog steps because any number of steps greater than or less than about 1/4 of an orbit is wasteful either in computation or because it leads to more diffusive sampling. You can see that it is roughly as gradient efficient as NUTS in a 1000-dimensional standard normal. Interestingly, it’s worse than NUTS for parameter estimates and better for squared parameter estimates, which is overall a win. Nawaf has also published on randomized HMC. I think we could turn down NUTS U-turn criterion below 180 degrees to get something similar with NUTS, but I haven’t tried it.

One important property of your randomized approach is that it is much much easier to code efficiently for GPUs than NUTS, because the conditionals in NUTS are hard to execute in SIMD fashion. There’s a very nice introduction to this problem by Sountsov, Carroll, and Hoffman, in their paper “Running Markov Chain Monte Carlo on Modern Hardware and Software,” which is out on arXiv and also going into the next edition of the Handbook of MCMC). The thing to read about how to code NUTS on GPU is Dance, Glaser, Orbanz, and Adams’s paper, “Efficiently Vectorized MCMC on Modern Accelerators,” which is on arXiv and ICML 2025.

You can also randomize step size to vary the integration time and avoid harmonics, e.g.,

Randomized Hamiltonian Monte Carlo. 2017. Bou-Rabee and Sanz-Serna. Annals of Applied Probability.

Bob’s talk at PariSanté

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , on March 25, 2026 by xi'an

We had a wonderful time (and an unusually large audience) at the mostly Monte Carlo seminar last week as Pierre del Moral and Bob Carpenter both presented on exciting recent developments of theirs! Pierre talked about Kantorovich contraction of Markov semigroups, which sounds rather daunting!, but actually covers fairly general and generic convergence results, using tools like potentials and Lyapunov contractions, reminding me of the early days of MCMC and the papers of Gareth Roberts (University of Warwick), Jeff Rosenthal, Richard Tweedie and others.

Bob then spoke about the latest version of NUTS, the within-orbit adaptive NUTS (WALNUTS) sampler, which adapts the step size at every leapfrog step in order to conserve the Hamiltonian and keep the path stable enough. The adaptation is facilitated by incorporating this step size as an extra parameter with an attached distribution, that the authors call Gibbs self tuning (GIST), for coupling tuning parameters and conditionally Gibbs-sampling them per iteration in Hamiltonian Monte Carlo. This has been done in the past, incl. in some of my papers (e.g., Andrieu & Robert, 2004), but I could not cite a particular reference during the seminar.

Further light reflections that came to mind during Bob’s talk:

  • with NUTS, if cycling is feasible in a finite time, we could wait for a second passage at the starting point and then get back halfway (with the difficulty of detecting this second passage)
  • changing the kinetic matrix at each leapfrog jump is actually Riemannian HMC (and with cubic cost!)
  • the doubling mechanism in both the original NUTS and in biased progressive NUTS is simulation wasting
  • but so is (surprise, surprise!) finding adaptive mass matrices for WALNUTS at reasonable costs

mostly Monte Carlo [13/03]

Posted in Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , on March 10, 2026 by xi'an

A new episode of our mostly Monte Carlo seminar, very soon coming near you (if in Paris):

On Friday 13/02/26, from 3-5pm at PariSanté Campus

15h00: Pierre Del Moral (INRIA, Bordeaux)

On the Kantorovich contraction of Markov semigroup

We present a novel operator theoretic framework to study the contraction properties of Markov semigroups with respect to a general class of Kantorovich semi-distances, which notably includes Wasserstein distances. This rather simple contraction cost framework combines standard Lyapunov techniques with local contraction conditions. Our results can be applied to both discrete time and continuous time Markov semigroups, and we illustrate their wide applicability in the context of (i) Markov transitions on models with boundary states, including bounded domains with entrance boundaries, (ii) operator products of a Markov kernel and its adjoint, including two-block-type Gibbs samplers, (iii) iterated random functions and (iv) diffusion models, including overdampted Langevin diffusion with convex at infinity potentials.

16h00: Bob Carpenter (Flatiron Institute, New York)

GIST, WALNUTS, and Continuous Nutpie: mass-matrix and step-size adaptation for Hamiltonian Monte Carlo

I will introduce Gibbs self tuning (GIST), our new technique for coupling tuning parameters and conditionally Gibbs-sampling them per iteration in Hamiltonian Monte Carlo. Then I will turn to the within-orbit adaptive NUTS (WALNUTS) sampler, which adapts the step size every leapfrog step in order to conserve the Hamiltonian. Empirical evaluations on varying multi-scale target distributions, including Neal’s funnel and the Stock-Watson stochastic volatility time-series model, demonstrate that WALNUTS achieves substantial improvements in sampling efficiency and robustness. I will review the Nutpie mass-matrix adaptation scheme, which is designed to minimize Fisher divergence by estimating the mass matrix as the geometric midpoint (aka barycenter) between the inverse covariance of the draws and the covariance of the scores of the draws. Then I will describe a continuously adapting version that adapts per iteration by continuously discounting the past rather than updating in fixed blocks. I will also show how the Adam optimizer outperforms dual averaging for step-size adaptation. I will conclude by considering a lock-free multi-threading implementation that automatically monitors adaptation and sampling for convergence for automatic stopping.

venISBA⁴⁻

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , on July 8, 2024 by xi'an

As I was released all of a sudden from the Ospedale Civile di Venezia around noon, I managed to attend the last session of ISBA 2024 (after stopping by my airbnb for an emergency coffee next to the hospital and stopping for showering, changing clothes, and eating something more substantial than the contents of IV bags).

My first of these last talks was about coresets by Trevor Campbell, for reducing sample sizes while keeping the likelihood roughly the same (and making me wondering if possibly getting some privacy on the side??) Original algorithm almost completely blind to the data, but a new version by subsample-optimize (KL distance to the posterior) version bringing huge improvements (although I missed the practical details on how the algorithm is reaching this minimum), namely a KL distance of order O(1), i.e., not growing in the sample size. Then, in the same session, a talk by Aikihiko Nakamura on mixing and PDMP, resulting in the novel bouncy Hamiltonian dynamics, which proves time reversible and volume preserving, with no U turns and the time within a given general Hamiltonian value being itself generated w/o rejection. (I am quite sorry to have missed other PDMP talks during the conference, eg, Paul Fearnhead’s, as well as the last poster session…) And I finally jumped rooms to listen to Sam Power on hybrid slice sampling with an MCMC extension to avoid simulating from the Uniform conditional. Reminding me of nested sampling, which also faces this difficulty of sampling from a possibly complex set. This was the end of a wonderful (if shortened by my personal issue) meeting. Next round, see you in Nagoya, Japan (on the Tōkaidō road!).


As a final word about this ISBA 2024 conference in Ca’Foscari, on many levels, I want to most warmly thank my friend Roberto Casarin for his investment and dedication for making the event running so efficiently, in an ideal environment for a meeting of this (800+) size that kept to the Aristotelian unities, especially keeping people together on a unique site without feeling crowded (and very few falling in a Venice canal). And many thanks as well to the local organisers (discounting my nominal inclusion in that group!), the Ca’Foscari staff, and all the students involved in the event!

on control variates

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , on May 27, 2023 by xi'an

A few months ago, I had to write a thesis evaluation of Rémi Leluc’s PhD, which contained several novel Monte Carlo proposals on control variates and importance techniques. For instance, Leluc et al. (Statistics and Computing, 2021) revisits the concept of control variables by adding a perspective of control variable selection using LASSO. This prior selection is relevant since control variables are not necessarily informative about the objective function being integrated and my experience is that the more variables the less reliable the improvement. The remarkable feature of the results is in obtaining explicit and non-asymptotic bounds.

The author obtains a concentration inequality on the error resulting from the use of control variables, under strict assumptions on the variables. The associated numerical experiment illustrates the difficulties of practically implementing these principles due to the number of parameters to calibrate. I found the example of a capture-recapture experiment on ducks (European Dipper) particularly interesting, not only because we had used it in our book but also because it highlights the dependence of estimates on the dominant measure.

Based on a NeurIPS 2022 poster presentation Chapter 3 is devoted to the use of control variables in sequential Monte Carlo, where a sequence of importance functions is constructed based on previous iterations to improve the approximation of the target distribution. Under relatively strong assumptions of importance functions dominating the target distribution (which could generally be achieved by using an increasing fraction of the data in a partial posterior distribution), of sub-Gaussian tails of an intractable distribution’s residual, a concentration inequality is established for the adaptive control variable estimator.

This chapter uses a different family of control variables, based on a Stein operator introduced in Mira et al. (2016). In the case where the target is a mixture in IRd, one of our benchmarks in Cappé et al. (2008), remarkable gains are obtained for relatively high dimensions. While the computational demands of these improvements are not mentioned, the comparison with an MCMC approach (NUTS) based on the same number of particles demonstrates a clear improvement in Bayesian estimation.

Chapter 4 corresponds to a very recent arXival and presents a very original approach to control variate correction by reproducing the interest rate law through an approximation using the closest neighbor (leave-one-out) method. It requires neither control function nor necessarily additional simulation, except for the evaluation of the integral, which is rather remarkable, forming a kind of parallel with the bootstrap. (Any other approximation of the distribution would also be acceptable if available at the same computational cost.) The thesis aims to establish the convergence of the method when integration is performed by a Voronoi tessellation, which leads to an optimal rate of order n-1-2/d for quadratic error (under conditions of integrand regularity). In the alternative where the integral must be evaluated by Monte Carlo, this optimality disappears, unless a massive amount of simulations are used. Numerical illustrations cover SDEs and a Bayesian hierarchical modeling already used in Oates et al. (2017), with massive gain in both cases.