Estimating yearly trends with an EM algorithm in R

This informal post is primarily aimed at people working at the interface between the geosciences and statistics.

If you are studying real geophysical applications you probably are aware of the value of data summaries to make the problem computationally feasible. As an example, if you are faced with 1e9 satellite data points at a sub 1km scale, you might opt to average the data on a 1km grid to bring the data down to a manageable size of 1e6 points. Hopefully this would not come at a cost of inferential accuracy. However data summaries can, sometimes, be carried out wrongly, leading to erroneous inferences.

In the problem we are studying we summarise the data as “linear trends in time”.  Analysing the temporal behaviour of “temporal trends” can be difficult, not because there is anything inherently complicated with the concept of a “trend” but because a trend itself is suggestive of some underlying model. Those familiar with hierarchical modelling will be comfortable with ways to accommodate this but unfortunately, if you’re using this to summarise your data, probably you are using this to avoid a “deep hierarchy” which would be computationally prohibitive to use (e.g. because the low-level spatio-temporal model is very high-dimensional). In this case a trend becomes a “data point” and we need to make sure that the “data” is representative of what we are modelling. If you are a statistician you probably are provided with these trends directly, however we should be careful to make sure that these values actually reflect the quantity we want to model.

The Problem

In an application I’m currently working on we are modelling yearly trends of the height change of the Antarctic ice sheet. If, say, a location x has witnessed a height increase of 2m in 2007 then we say that the trend for that year was 2m/yr. If it witnessed a 10m decrease over 5 years then we say that the trend there was -2m/yr and so on.

Let’s say we have data spanning 2006-2009. When working with a spatial model,  we can fit trends using OLS to the data at each spatial point and then use standard spatial statistics to obtain posterior means and credibility intervals at every point in space using the standard deviation of the trend as the “trend error”. If we are only concerned with linear trends then this seems a reasonable thing to do. For example, the trend at one spatial point may look like this:

3year_trend

Taking the trend at some spatial points, one may use spatial statistics to infer the trends at every spatial (observed and unobserved) point, see some pictures in my recent paper (open access).

Extending to the spatio-temporal case is however not straightforward. One should not just split up the data into individual years and find the trend in each year (especially without deseasoning as shown below). Doing this for the above data set would give the following:

single_year_trend

The biggest problem with the above trends is that they are suggestive of a “global trend” which is much larger than the truth. For example, in this particular example, the 3-year trend is 2.8 \pm 0.2mm/yr (errors at 1 \sigma) but the linear trend calculated yearly is given by

Year   Trend (mm)   Trend error (mm) 
2006 4.6 1.0
2007 7.2 1.1
2008 8.9 1.0

Any model you are going to fit to these three data points is going to result in inferences of trends which are much larger than they should be.

De-seasoning

A big player here is the seasonal signal and this should be removed. There are many ways to do this, my personal favourite is stl available in most statistical software packages although this is a non-parametric approach. Modelling it explicitly is another alternative. Here we will do what is most commonly done (criticised in the previous link), which is to subtract the monthly means from the data prior to finding the trend. This works pretty well in practice. Our global 3-year trend changes slightly to 2.2 \pm 0.2mm/yr whilst the yearly trends become

Year   Trend (mm)   Trend error (mm) 
2006 -1.6 0.9
2007 1.1 0.9
2008 2.7 0.8

shown below

single_year_trend

which are more reasonable (note that much more sophisticated approaches can be used). However there is still one problem, which is that we still have discontinuities at the edges. For example, there is a jump at 2008 (because we have chosen to have a temporal demarcation there) which diminishes the effect of the positive trend i.e. there was an increase which was not accounted for.

To minimise boundary effects we need to tie up the ends such that the beginning of 2009 links with the end of 2008 and so on. This can be done parametrically by fitting (after deseasoning) a piecewise linear function to the data. We do this by using “hat” basis functions, as is done in the standard finite element method and then fitting the hat functions to the data using standard techniques. This results in the following

EM_single_year_trend

with the following estimates:

Year   Trend (mm)   Trend error (mm) 
2006 -2.2 0.6
2007 3.0 0.5
2008 5.0 0.7

This, to me, seems the most straightforward way to estimate trends on a yearly scale, which we can then treat as data points in our spatio-temporal framework (we will have 1e6 of these mini time-series to model). I’ll be happy to hear other alternatives.

For those who wish some technical details, in the following I describe the expectation maximisation (EM) algorithm for this problem.

EM

For an introduction to the expectation maximisation (EM) algorithm I highly recommend Bishop’s book.

We assume the following parametric model for the underlying field:

y = Cx + v \\ x \sim MVN(0,Q)

where x denotes the weights of our “hat” basis functions \phi(t) over time t. These hat functions are set up such that they have a maximum at yearly point and reach zero at yearly points on either side (and thus have a support of two years). The prior precision matrix Q can be set as diagonal with small elements if we wish the prior to be largely uninformative. The matrix C maps the weights to the observations and thus each row of C is given by \phi_j(t_i) where t_i is the observation timestamp. Since each observation is only in the support of two tent functions, this matrix is very sparse and we exploit this to speed up computations.

We assume homogeneous noise v \sim N(0,\sigma^2), however we need to estimate \sigma^2 (actually the precision \sigma^{-2}). For EM we need to find the mean and variance over x at each iteration, which in this case are simply given by

\Sigma = C^TQC + \widehat{\sigma^{-2}}I

\mu = \widehat{\sigma^{-2}}\Sigma C^T y

The log-likelihood under expectations of x is then given by

\langle \ln p(y | x,\sigma^{-2} \rangle = \langle \frac{1}{2}\ln |\sigma^{-2}I| - \frac{\sigma^{-2}}{2}(y-Cx)^T(y-Cx)\rangle

which is maximised with \widehat{\sigma^{-2}} = n/\Gamma where

\Gamma = y^Ty - 2\mu^TC^Ty + tr(C^TC[\mu\mu^T + \Sigma])

and n is the number of data points. The algorithm is thus very straightforward and probably the hardest part is to find the knots and actually set up the basis! I provide the code for the function below, do let me know if you find it useful.

</pre>
# Piecewise interpolation of a function using tent functions.
# X is a data-frame with fields 'x' and 'y'
# ds denotes the spacing of the knots (1 by default)
# "first" denotes the location of the first knot (e.g. 2006)

Piecewise_interp <- function(X,ds=1,first=NULL) {
stopifnot(is(X,"data.frame"))
stopifnot("x" %in% names(X) )
stopifnot("y" %in% names(X) )
stopifnot(is(ds,"numeric"))
stopifnot(is.null(first) | is(first,"numeric"))

require(plyr)
require(Matrix)

if (is.null(first)) { # Automatically place knots
first_val <- ceiling(min(X$x)/ds)*ds - ds
       last_val <- floor(max(X$x)/ds)*ds + ds
       knots <- seq(first_val,last_val,by=ds) # Form notes
} else { # Starting point pre-set
       first_val <- first
       knots <- first_val
       while( tail(knots,n=1) < max(X$x)) {
         knots <- c(knots,tail(knots,n=1)+ds)
       }
       last_val <- tail(knots,n=1)
       X <- subset(X,x > first_val)
}


# For each point, find basis which they belong to
X$section <- cut(X$x,knots)
count <- 1
X <- ddply(X,"section",function(df) {
df$sectionnum = count
count <<- count+1
return(df)  } )

# For each section calculate values of phi
X <- ddply(X,"section",function(df) {
lo_x = knots[df$sectionnum]
hi_x = knots[df$sectionnum+1]
c1 = 1 - (-1/ds)*lo_x
c2 = 1 - (1/ds)*hi_x

df$phi1 <- (-1/ds)*df$x + c1
df$phi1_ind <- df$sectionnum
      df$phi2 <- (1/ds)*df$x + c2
      df$phi2_ind <- df$sectionnum + 1

return(df)  } )

# Build the 'observation matrix'
C <- sparseMatrix(i = c(1:nrow(X),1:nrow(X)), j = c(X$phi1_ind, X$phi2_ind), x =   c(X$phi1, X$phi2))

# Inference with EM algorithm for observation precision
gamma_prec <- rep(1/var(X$y),10) #Initialise precision
maxiter=100

for (m in 2:maxiter) {
# Estimate x
     ymat <- matrix(X$y)
     xML <- data.frame(x=knots,y=as.vector(chol2inv(chol(t(C) %*% C)) %*% t(C) %*% ymat))
     Qobs <- sparseMatrix(i=1:nrow(X),j=1:nrow(X),x=gamma_prec[m-1])
     Qtot <- t(C)%*%Qobs%*%C
     Varx <- solve(Qtot)

     # Update gamma_prec
     Gamma = as.vector(t(ymat) %*% ymat - 2*t(xML$y)%*% t(C) %*% ymat +    sum(diag(t(C) %*% C %*% (Varx + xML$y%*%t(xML$y)))))
     gamma_prec[m] <- nrow(X)/Gamma
     if (abs(gamma_prec[m] - gamma_prec[m-1]) < 1e-5) break
   }
     cat(paste("EM converged after",m,"iterations",sep=" "),sep="\n")

Vartrends <- NULL
   for (i in 1:(nrow(Varx)-1)) {
     Vartrends[i] <- (sum(diag(Varx)[i:(i+1)]) -2*(Varx[i,i+1]))/ds^2
   }
trends = data.frame(year = head(knots,n=(length(knots)-1)),
trend = diff(xML$y)/ds,
std = sqrt(Vartrends)
)
return(list (x=xML, trends=trends))
# returns a list with maximum likelihood estimate for the weights and the trend estimates
}
<pre>

To test this function you can use the following code


</pre>
library(ggplot2)
library(plyr)

deseason=1

x <- seq(2006.01,2009,by=0.01)
 x2 <- x - 2006
 y <- sin(x2*2*pi) + 4*x2*(x2 < 1) -
 4*x2*(x2 >=1 & x2 < 2) + 8*(x2 >=1 & x2 < 2) +
 4*x2*(x2 >=2 & x2 <= 3) - 8*(x2 >=2 & x2 <= 3)+
 2*rnorm(length(x))
 X <- data.frame(x=x,y=y)
 if(deseason) {
 int <- ((0:2)*100)+1
 av <- rep(0,3)
 for (i in 0:99) {
 av[i+1] <- mean(y[int+i])
 }
 X$y <- X$y - kronecker(rep(1,3),av)

 }

Interp <- Piecewise_interp(X,ds=1,first=2006)

g3 <- ggplot() + geom_point(data=X,aes(x=x,y=y)) + geom_line(data=Interp$x,aes(x=x,y=y),colour="red",size=1)
print(g3)
<pre>

Thanks to Botond Cseke and Nana Schoen for helpful discussions on this topic.