Archive for basic statistics

probably overthinking it [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , on December 13, 2023 by xi'an

Probably overthinking it, written by Allen B. Downey (who wrote a series of books starting with Think, like Think Python, Think Bayes, Think Stats), belongs to this numerous collection of introductory books that aim at making statistics more palatable and enticing to the general public by making the fundamental concepts more intuitive and building upon real life examples. I would thus stop short of calling it “essential guide” as in the first flap of the dust jacket, since there exist many published books with a similar goal, some of which were actually reviews here. Now, there are ideas and examples therein I could borrow for my introductory stats course, except that I will cease teaching it next year! For instance, there are lots of examples related to COVID, which is great to engage (enrage?) the readers.

The book is quite pleasant to read, does not shy from mathematical formulae, and covers notions such as probability distributions, the Simpson, the Preston, the inspection, the Berkson paradoxes, and even some words on causality, sometimes at excessive lengths. (I have always been an adept of the concise church when it comes to textbook examples and fear that the multiplication of illustrations of a given concept may prove counterproductive.) The early chapters are heavily focussed on the Gaussian (or Normal) distribution. Making it appear as essential for conducting statistical analysis. When it does not, as in the ELO example, the explanations of a correction are less convincing.

I appreciated the book approach to model fit via the comparison of empirical cdfs with hypothetical ones. Also of primary interest is the systematic recourse to simulation, aka generative models, albeit without a systematic proper description. In the chapter (Chap 5) about durations, I think there are missed opportunities like the distributions of extremes (p 82) or the forgetfulness property of the Exponential distribution. Instead the focus is slightly diverging towards non-statistical issues on demography by the end of the chapter, with a potential for confusion between the Gomperz law and the Gomperz distribution. The Berkson paradox (Chap 6) is well-explained in terms of non-random populations (and reminded me when, years ago, when we tried to predict the first year success probability of undergrad applicants from their high school maths grade, the regression coefficient estimate ended up negative). Distributions of extremes do appear in Chap 8, if again seeking an ideal generic distribution seems to me rather misguided and misguiding. I would also argue that the author is missing the point of Taleb’s black swans by arguing in favour of a better modelling, when the later argues against the very predictability of extreme events in a non-stationary financial world… The chapter on fairness and fallacy (Chap 9) is actually about false positive/negative rates in different populations hence the ensuing unfairness (or the base fallacy). In that chapter there is no mention of Bayes (reserved for Think Bayes?!), but it is hitting hard enough at anti-vaxers (who will most likely not read the book). And does it again in the Simpson paradox chapter (Chap 10), whose proliferation is further stressed the following chapter on people becoming less racist or sexist or homophobic when they age, despite the proportion of racist/sexist/homophobic responses to a specific survey (GSS/Pew) increasing with age. This is prolonged into the rather minor final chapter.

Now that I have read the book, during a balmy afternoon in St Kilda (after an early start in the train to De Gaulle airport in freezing temperatures), I am a bit uncertain at what to make of it in terms of impact on the general public. For sure, the stories that accumulate chapter after chapter are nice and well argued, while introducing useful statistical concepts, but I do not see readers equipped enough to handle daily statistics with more than an healthy dose of scepticism, which obviously is a first step in the right direction!

Some nitpicking : the book is missing the historical connection to Quetelet’s “average man” when referring to the notion. And a potential explanation for the (approximate) log-Gaussianity of weights of individuals in a population through the fact that it is a volume, hence a third power of a sort.  Although birth weights are roughly Normal which kill my argument. I remain puzzled by the title, possibly missing a cultural reference (as there are tee-shirts sold with this sentence). It is the same as the name of a blog run by the author since 2011 and a fodder for the book. And the cover is terrible, breaking the words to fit the width making no sense, if I am not overthinking it! As often the book is rather US centric, although making no mention of US having much higher infant death rates than countries with similar GDPs when this data is discussed.

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE.]

a common confusion between sample and population moments

Posted in Books, Kids, R, Statistics with tags , , , , , , , on April 29, 2021 by xi'an

p-value graffiti in the lift [jatp]

Posted in Statistics with tags , , , , , , , , on January 3, 2019 by xi'an

hitting a wall

Posted in Books, Kids, R, Statistics, University life with tags , , , , , on July 5, 2018 by xi'an

Once in a while, or a wee bit more frequently (!), it proves impossible to communicate with a contributor of a question on X validated. A recent instance was about simulating from a multivariate kernel density estimate where the kernel terms at x¹,x²,… are Gaussian kernels applied to the inverses of the norms |x-x¹|, |x-x²|,… rather than to the norms as in the usual formulation. The reason for using this type of kernel is unclear, as it certainly does not converge to an estimate of the density of the sample x¹,x²,…  as the sample size grows, since it excludes a neighbourhood of each point in the sample. Since the kernel term tends to a non-zero constant at infinity, the support of the density estimate is restricted to the hypercube [0,1]x…x[0,1], again with unclear motivations. No mention being made of the bandwidth adopted for this kernel. If one takes this exotic density as a given, the question is rather straightforward as the support is compact, the density bounded and a vanilla accept-reject can be implemented. As illustrated by the massive number of comments on that entry, it did not work as the contributor adopted a fairly bellicose attitude about suggestions from moderators on that site and could not see the point in our requests for clarification, despite plotting a version of the kernel that had its maximum [and not its minimum] at x¹… After a few attempts, including writing a complete answer, from which the above graph is taken (based on an initial understanding of the support being for (x-x¹), …), I gave up and deleted all my entries.On that question.

A whistle-stop tour of statistics

Posted in Books, pictures, Statistics with tags , , , , on March 9, 2012 by xi'an

In a consequent package I got from CRC Press last month for CHANCE, there was this short book by Brian S. Everitt, A whistle-stop tour of statistics… Nice cover! The book is like an introductory undergraduate statistics course, except that it is much much terser and shorter, using 200 pages in A5 format with plenty of pictures. (It could also have been called a primer or a guidebook.) The table of contents is as follows

Some basics and describing data
Probability
Estimation
Inference
Analysis of variance models
Linear regression models
Logistic regression and the generalized linear model
Survival analysis
Longitudinal data and their analysis
Multivariate data and their analysis

There is nothing fundamentally wrong with the book, except that… I cannot fathom its purpose! Nor its readership. Again, it is way too short and terse to be used in an undergraduate course or for self-study. And it does not bring a new light on those standard topics when compared with most of introductory statistics books, being mostly traditional (even though it briefly mentions Bayesian inference on pp. 88-89). While the book is itself a summary of statistical methodology, it still finds room for a summary of the current notions at the end of each chapter. So I thus remain completely puzzled by the point in publishing A whistle-stop tour of statistics..!