Showing posts with label conjugate. Show all posts
Showing posts with label conjugate. Show all posts

Kalman filter example visualised with R

At the last Cologne R user meeting Holger Zien gave a great introduction to dynamic linear models (dlm). One special case of a dlm is the Kalman filter, which I will discuss in this post in more detail. I kind of used it earlier when I measured the temperature with my Arduino at home.

Over the last week I came across the wonderful quantitative economic modelling site quant-econ.net, designed and written by Thomas J. Sargent and John Stachurski. The site not only provides access to their lecture notes, including the Kalman fitler, but also code in Python and Julia.

I will take their example of the Kalman filter and go through it with R. I particularly liked their visuals of the various steps of the Kalman filter.

Motivation

Suppose I have a little robot that moves autonomously over my desk. How would the robot know where it is? Well, suppose the robot can calculate from its wheels' movements how far it went and an additional sensor (e.g. GPS-like) provides also information about its location. Of course both pieces of information will be imprecise. How can the various signals be combined?

This is a classic scenario for the Kalman filter. Its key assumptions are that the errors/noise are Gaussian and that the state space evolution \(x_t\) from one time step to the next is linear, so is the mapping to the sensor signals \(y_t\). For the example I will use it reads:
\[\begin{align}
x_{t+1} & = A x_t + w,\quad w \sim N(0,Q)\\
y_t &=G x_t + \nu, \quad \nu \sim N(0,R)\\
x_1 & \sim N(\hat{x}_0, \Sigma_0)
\end{align}
\]With \(A, Q, G, R, \Sigma\) matrices of appropriate dimensions. The Kalman filter provides recursive estimators for \(x_t\):
\[\begin{align}
K_{t} &= A \Sigma_t G' (G \Sigma_t G' + R)^{-1}\\
\hat{x}_{t+1} &= A \hat{x_t} + K_{t} (y_t - G \hat{x}_t) \\
\Sigma_{t+1} &= A \Sigma_t A' - K_{t} G \Sigma_t A' + Q
\end{align}
\]

Start with a prior

Let's say at time \(t_0\) the robot has the expected position \(\hat{x} = (0.2, -0.2)\). That means, the robot doesn't know its location on the table with certainty. Further I shall assume that the probability density function of the position follows a Normal distribution with covariance matrix \(\Sigma = \left[\begin{array}{cc}
0.4 & 0.3\\
0.3 & 0.45
\end{array}
\right]
\)
The prior distribution of the robot's position can be visualised in R with a contour plot.


library(mnormt)
xhat <- c(0.2, -0.2)
Sigma <- matrix(c(0.4, 0.3, 
                  0.3, 0.45), ncol=2)
x1 <- seq(-2, 4,length=151)
x2 <- seq(-4, 2,length=151)
f <- function(x1,x2, mean=xhat, varcov=Sigma) 
  dmnorm(cbind(x1,x2), mean,varcov)
z <- outer(x1,x2, f)
mycols <- topo.colors(100,0.5)
image(x1,x2,z, col=mycols, main="Prior density",
      xlab=expression('x'[1]), ylab=expression('x'[2]))
contour(x1,x2,z, add=TRUE)
points(0.2, -0.2, pch=19)
text(0.1, -0.2, labels = expression(hat(x)), adj = 1)

Sensor information

The 'GPS' sensor provides also data about the location of the robot, again the sensor signal can be regarded as the expected position with some noise. Suppose the reading of the sensor is \(y=(2.3, -1.9)\). I will assume that the sensor signal has an error following a Normal distribution, with a covariance matrix \(R=0.5 \Sigma\).

Conceptually I think about it like this, given that the robot is at position \(x\), what would be the likelihood that the sensor reading is \(y\,|\,x\,\)? The Kalman filter assumes a linear mapping from \(x\) to \(y\):
\[y = G x + \nu \mbox{ with } \nu \sim N(0,R)\]In my case \(G\) will be the identity matrix. Let's add the sensor information to the plot.


R <- 0.5 * Sigma
z2 <- outer(x1,x2, f, mean=c(2.3, -1.9), varcov=R)
image(x1, x2, z2, col=mycols, main="Sensor density")
contour(x1, x2, z2, add=TRUE)
points(2.3, -1.9, pch=19)
text(2.2, -1.9, labels = "y", adj = 1)
contour(x1, x2,z, add=TRUE)
points(0.2, -0.2, pch=19)
text(0.1, -0.2, labels = expression(hat(x)), adj = 1)

Filtering

I have two pieces of information about the location of the robot, both with a Normal distribution, my prior is \( x \sim N(\hat{x},\Sigma)\) and for the distribution of my sensor I have \(y\,|\,x \sim N(Gx, R)\).

What I would like to know is the location of the robot given the sensor reading:
\[p(x \,|\, y) = \frac{p(y \,|\, x) \, p(x)} {p(y)}\]Now, because \(x, y\) are Normal and \(G\) is the identity matrix, I know that the posterior distribution of \(x \,|\, y\) is Normal again, with parameters (see conjugate prior):
\[
\hat{x}_f = (\Sigma^{-1} + R^{-1})^{-1} (\Sigma^{-1} \hat{x} + R^{-1} y) \\
\Sigma_f = (\Sigma^{-1} + R^{-1})^{-1}\]Using the matrix inversion identity \[(A^{-1} + B^{-1})^{-1} = A - A (A + B)^{-1}A = A (A + B)^{-1} B \] I can write the above as:
\[
\begin{align}
\hat{x}_f & = (\Sigma - \Sigma (\Sigma + R)^{-1}\Sigma)(\Sigma^{-1} \hat{x} + R^{-1}y)\\
& =\hat{x} - \Sigma (\Sigma + R)^{-1} \hat{x} + \Sigma R^{-1} y -\Sigma (\Sigma + R)^{-1}\Sigma R^{-1}y\\
& = \hat{x} + \Sigma (\Sigma + R)^{-1} (y - \hat{x})\\
& = (1.667, -1.333)\\
\Sigma_f &= \Sigma - \Sigma (\Sigma + R)^{-1} \Sigma\\
&=\left[\begin{array}{lll}
0.133 & 0.10\\
0.100 & 0.15
\end{array}
\right]
\end{align}
\]In the more general case when \(G\) is not the identity matrix I have
\[
\begin{align}
\hat{x}_f & = \hat{x} + \Sigma G' (G \Sigma G' + R)^{-1} (y - G \hat{x})\\
\Sigma_f &= \Sigma - \Sigma G' (G \Sigma G' + R)^{-1} G \Sigma
\end{align}
\]The updated posterior probability density function \(p(x\,|\,y)=N(\hat{x}_f, \Sigma_f)\) is visualised in the chart below.
I can see how the prior and likelihood distributions have influenced the posterior distribution of the robot's location. The sensor information is regarded more credible than the prior information and hence the posterior is closer to the sensor data.

G = diag(2)
y <- c(2.4, -1.9)
xhatf <- xhat + Sigma %*% t(G) %*% solve(G %*% Sigma %*% t(G) + R) %*% (y - G %*% xhat)
Sigmaf <- Sigma - Sigma %*% t(G) %*% solve(G %*% Sigma %*% t(G) + R) %*% G %*% Sigma
z3 <- outer(x1, x2, f, mean=c(xhatf), varcov=Sigmaf)
image(x1, x2, z3, col=mycols,
      xlab=expression('x'[1]), ylab=expression('x'[2]),
      main="Filtered density")
contour(x1, x2, z3, add=TRUE)
points(xhatf[1], xhatf[2], pch=19)
text(xhatf[1]-0.1, xhatf[2],
     labels = expression(hat(x)[f]), adj = 1)
lb <- adjustcolor("black", alpha=0.5)
contour(x1, x2, z, add=TRUE, col=lb)
points(0.2, -0.2, pch=19, col=lb)
text(0.1, -0.2, labels = expression(hat(x)), adj = 1, col=lb)
contour(x1, x2, z2, add=TRUE, col=lb)
points(2.3, -1.9, pch=19, col=lb)
text(2.2, -1.9,labels = "y", adj = 1, col=lb)

Forecasting

Ok, I have a better understanding of the robot's position at time \(t_0\), but the robot is moving and what I'd like to know is where it will be after one time step.

Suppose I have a linear model that explains how the state evolves over time, e.g. based on wheel spin. Again, I will assume a Gaussian process. In this toy example I follow the assumptions of Sargent and Stachurski, although this doesn't make much sense for the robot:
\[
x_{t+1} = A x_t + w_{t+1} \mbox{, where } w_t \sim N(0, Q)
\]with\[
\begin{split}A
= \left(
\begin{array}{cc}
1.2 & 0.0 \\
0.0 & -0.2
\end{array}
\right),
\qquad
Q = 0.3 \Sigma\end{split}
\]As all the processes are linear and Normal the calculations are straightforward:
\[
\begin{align}
\mathbb{E} [A x_f + w] &= A \mathbb{E} [x_f] + \mathbb{E} [w]\\
&= A \hat{x}_f\\
&= A \hat{x} + A \Sigma G' (G \Sigma G' + R)^{-1}(y - G \hat x)
\end{align}\]\[
\begin{align}
\operatorname{Var} [A x_f + w] &= A \operatorname{Var}[x_f] A' + Q\\
&= A \Sigma_f A' + Q\\
&= A \Sigma A' - A \Sigma G' (G \Sigma G' + R)^{-1} G \Sigma A' + Q
\end{align}
\]The matrix \(A \Sigma G' (G \Sigma G' + R)^{-1}\) is often written as \(K_\Sigma\) and called the Kalman gain. Using this notation, I can summarise the results as follows:\[
\begin{split}\begin{align}
\hat{x}_{new} &:= A \hat{x} + K_{\Sigma} (y - G \hat{x}) \\
\Sigma_{new} &:= A \Sigma A' - K_{\Sigma} G \Sigma A' + Q \nonumber
\end{align}\end{split}
\]Adding the prediction to my chart gives me:

A <- matrix(c(1.2, 0,
              0, -0.2), ncol=2)
Q <- 0.3 * Sigma
K <- A %*% Sigma %*% t(G) %*% solve(G%*% Sigma %*% t(G) + R)
xhatnew <- A %*% xhat + K %*% (y - G %*% xhat)
Sigmanew <- A %*% Sigma %*% t(A) - K %*% G %*% Sigma %*% t(A) + Q
z4 <- outer(x1,x2, f, mean=c(xhatnew), varcov=Sigmanew)
image(x1, x2, z4, col=mycols,
      xlab=expression('x'[1]), ylab=expression('x'[2]),
      main="Predictive density")
contour(x1, x2, z4, add=TRUE)
points(xhatnew[1], xhatnew[2], pch=19)
text(xhatnew[1]-0.1, xhatnew[2],
     labels = expression(hat(x)[new]), adj = 1)
contour(x1, x2, z3, add=TRUE, col=lb)
points(xhatf[1], xhatf[2], pch=19, col=lb)
text(xhatf[1]-0.1, xhatf[2], col=lb, 
     labels = expression(hat(x)[f]), adj = 1)
contour(x1, x2, z, add=TRUE, col=lb)
points(0.2, -0.2, pch=19, col=lb)
text(0.1, -0.2, labels = expression(hat(x)), adj = 1, col=lb)
contour(x1, x2, z2, add=TRUE, col=lb)
points(2.3, -1.9, pch=19, col=lb)
text(2.2, -1.9,labels = "y", adj = 1, col=lb)
That's it, having gone through the updating process for one time step gives me the recursive Kalman filter iteration mentioned above.

Another way to visualise the four steps can be achieved with the lattice package:
library(lattice)
grid <- expand.grid(x=x1,y=x2)
grid$Prior <- as.vector(z)
grid$Likelihood <- as.vector(z2)
grid$Posterior <- as.vector(z3)
grid$Predictive <- as.vector(z4)
contourplot(Prior + Likelihood + Posterior + Predictive ~ x*y, 
            data=grid, col.regions=mycols, region=TRUE,
            as.table=TRUE, 
            xlab=expression(x[1]),
            ylab=expression(x[2]),
            main="Kalman Filter",
            panel=function(x,y,...){
              panel.grid(h=-1, v=-1)
              panel.contourplot(x,y,...)
            })

You can find the code of this post on Github.

Session Info

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] lattice_0.20-29 mnormt_1.5-1   

loaded via a namespace (and not attached):
[1] grid_3.1.2  tools_3.1.2

How cold is it? A Bayesian attempt to measure temperature

It is getting colder in London, yet it is still quite mild considering that it is late November. Well, indoors it still feels like 20°C (68°F) to me, but I have been told last week that I should switch on the heating.

Luckily I found an old thermometer to check. The thermometer showed 18°C. Is it really below 20°C?

The thermometer is quite old and I'm not sure that is works properly anymore. So, what shall I do now? Perhaps I should consider that both measurements are uncertain and try to combine them.

I believe that I can sense the temperature within ±3°C, while I think that the thermometer still works within ±2°C. Assuming that both measurements follow a Gaussian (Normal) distribution, with the uncertainties given as standard deviations, I can use Bayes' theorem to combine my hypothesis with the data. The posterior distribution will be Gaussian again with conjugated hyper-parameters:
\[
\mu=\left.\left(\frac{\mu_0}{\sigma_0^2} + \frac{\sum_{i=1}^n x_i}{s^2}\right)\right/\left(\frac{1}{\sigma_0^2} + \frac{n}{s^2}\right) \\
\sigma^2=\left(\frac{1}{\sigma_0^2} + \frac{n}{s^2}\right)^{-1}
\]With \(K := \frac{n\sigma_0^2}{s^2+n\sigma_0^2} \) this simplifies to:
\[
\mu = K\, \bar{x} + (1 - K)\, \mu_0 \mbox{, with } \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\\
\sigma = s \,\sqrt{K/n}
\]In my case I have: \(n=1,\; x_1=18^{\circ}C,\; s=2^{\circ}C,\; \mu_0=20^{\circ}C,\; \sigma_0=3^{\circ}C\).

Hence, the posterior distribution has parameters \(\mu=18.6^{\circ}C\) and \(\sigma=1.7^{\circ}C\). Thus, my best guess would be that is actually a little colder than I thought. One could argue that the probability that is below 20° is 80%.

Over the last five days my perception of the temperature didn't change, neither did the weather forecast, but the measurements showed: 18°C, 19°C, 17.5°C, 18°C, 18.5°C.

With that information the parameters update to \(\mu=18.3^{\circ}C\) and \(\sigma=0.9^{\circ}C\). I can't deny it any longer it has got colder. The probability that is below 20°C is now at 97% and the heating is on.


Without any prior knowledge I may have used a t-test to check the measurements. But here I believe that I have information about the thermometer and my own temperature sensing abilities which I don't want to ignore.

R code


Session Info

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] BayesianFirstAid_0.1 rjags_3-14           coda_0.16-1         
[4] lattice_0.20-29     

loaded via a namespace (and not attached):
[1] grid_3.1.2    MASS_7.3-35   mnormt_1.5-1  stringr_0.6.2

Binomial testing with buttered toast

Rasmus' post of last week on binomial testing made me think about p-values and testing again. In my head I was tossing coins, thinking about gender diversity and toast. The toast and tossing a buttered toast in particular was the most helpful thought experiment, as I didn't have a fixed opinion on the probabilities for a toast to land on either side. I have yet to carry out some real experiments.

Suppose I tossed 6 buttered toasts and observed that all but one toast landed on their buttered side.

Now I have two questions:
  • Would it be reasonable to assume that there is a 50/50 chance for a toast to land on either side?
  • Which probability should I assume?
If I believe the toast is fair, then the probability for landing on the buttered (B) or unbuttered (U) side is 50%.

The probability of observing one ore more B (right tail event) is:
(gt <- 1-1/2^6)
# [1] 0.984375
and the probability of observing one or fewer B (left tail event) is:
(lt <- 1/2^6*choose(6,1) + 1/2^6)
# [1] 0.109375
while the probability of either extreme event, one or fewer B (or U), is:
2*min(c(gt, lt))
# [1] 0.21875
In summary, if the toast has an equally probability to fall on either side, then there is 22% chance to observe one or fewer B (or U) in 6 tosses. That's not that unlikely and hence I would not dismiss the hypothesis that the toast is fair, despite the fact that the sample frequency is only 1/6.

Actually, the probabilities I calculated above are exactly the p-values I get from the classical binomal tests:
## Right tail event
binom.test(1, 6, alternative="greater")
## Left tail event
binom.test(1, 6, alternative="less")
## Double tail event
binom.test(1, 6, alternative="two.sided")
Additionally I can read from the tests that my assumption of a 50% probability is on the higher end of the 95 percent confidence interval. Thus, wouldn't it make sense to update my belief about the toast following my observations? In particular, I am not convinced that a 50/50 probability is a good assumption to start with. Arguably the toast is biased by the butter.

Here the concept of a conjugate prior becomes handy again. The idea is to assume that the parameter \(\theta\) of the binomial distribution is a random variable itself. Suppose I have no prior knowledge about the true probability of the toast falling on either side, then a flat prior, such as the Uniform distribution would be reasonable. However, the beta distribution with parameter \(\alpha=1\) and \(\beta=1\) has the same property and is a conjugate to the binomial distribution with parameter \(\theta\). That means there is an analytical solution, in this case the posterior distribution is beta-binomial with hyperparaemters:
\[\alpha':=\alpha + \sum_{i=1}^n x_i,\; \beta':=\beta + n - \sum_{i=1}^n x_i,\]
and the posterior predictor for one trial is given as
\[\frac{\alpha'}{\alpha' + \beta'}\]
so in my case:
alpha <- 1; beta <- 1; n <- 6; success <- 1
alpha1 <- alpha + success
beta1 <- beta + n - success
(theta <- alpha1 / ( alpha1 + beta1))
# [1] 0.25
My updated believe about the toast landing on the unbuttered side is a probability of 25%. That's lower than my prior of 50% but still higher than the sample frequency of 1/6. If I would have more toasts I could run more experiments and update my posterior predictor.

I get the same answer from Rasmus' bayes.binom.test function:
> library(BayesianFirstAid)
> bayes.binom.test(1, 6)

 Bayesian first aid binomial test

data: 1 and 6
number of successes = 1, number of trials = 6
Estimated relative frequency of success:
  0.25 
95% credible interval:
  0.014 0.527 
The relative frequency of success is more than 0.5 by a probability of 0.061 
and less than 0.5 by a probability of 0.939 
Of course I could change my view on the prior and come to a different conclusion. I could follow the Wikipedia article on buttered toast and believe that the chance of the toast landing on the buttered side is 62%. I further have to express my uncertainty, say a standard deviation of 10%, that is a variance of 1%. With that information I can update my belief of the toast landing on the unbuttered side following my observations (and transforming the variables):
x <- 0.38
v <- 0.01
alpha <- x*(x*(1-x)/v-1)
beta <- (1-x)*(x*(1-x)/v-1)
alpha1 <- alpha + success
beta1 <- beta + n - success
(theta <- alpha1 / ( alpha1 + beta1))
# [1] 0.3351821
I would conclude that for my toasts / tossing technique the portability is 34% to land on the unbuttered side.

In summary, although their is no sufficient evidence to reject the hypothesis that the 'true' probability is not 50% (at the typical 5% level), I would work with 34% until I have more data. Toast and butter please!

Session Info

R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] BayesianFirstAid_0.1 rjags_3-12           coda_0.16-1         
[4] lattice_0.20-24     

loaded via a namespace (and not attached):
[1] grid_3.0.1  tools_3.0.1

Not only verbs but also believes can be conjugated

Following on from last week, where I presented a simple example of a Bayesian network with discrete probabilities to predict the number of claims for a motor insurance customer, I will look at continuous probability distributions today. Here I follow example 16.17 in Loss Models: From Data to Decisions [1].

Suppose there is a class of risks that incurs random losses following an exponential distribution (density \(f(x) = \Theta {e}^{- \Theta x}\)) with mean \(1/\Theta\). Further, I believe that \(\Theta\) varies according to a gamma distribution (density \(f(x)= \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha \,-\, 1} e^{- \beta x } \)) with shape \(\alpha=4\) and rate \(\beta=1000\).

In the same way as I had good and bad driver in my previous post, here I have clients with different characteristics, reflected by the gamma distribution.

The textbook tells me that the unconditional mixed distribution of an exponential distribution with parameter \(\Theta\), whereby \(\Theta\) has a gamma distribution, is a Pareto II distribution (density \(f(x) = \frac{\alpha \beta^\alpha}{(x+\beta)^{\alpha+1}}\)) with parameters \(\alpha,\, \beta\). Its k-th moment is given in the general case by
\[
E[X^k] = \frac{\beta^k\Gamma(k+1)\Gamma(\alpha - k)}{\Gamma(\alpha)},\; -1 < k < \alpha. \] Thus, I can calculate the prior expected loss (\(k=1\)) as \(\frac{\beta}{\alpha-1}=\,\)333.33.
Now suppose I have three independent observations, namely losses of $100, $950 and $450 over the last 3 years. The mean loss is $500, which is higher than the $333.33 of my model.

Question: How should I update my belief about the client's risk profile to predict the expected loss cost for year 4 given those 3 observations?

Visually I can regard this scenario as a graph, with evidence set for years 1 to 3 that I want to propagate through to year 4.