Notes from Spatial Statistics 2015 (conference)

This year I attended the Spatial Statistics Conference in Avignon, France. The conference was attended by about 300 people, mostly spatial statisticians, so I definitely felt at home there. I found it well-organised although it was unfortunate that the organisers decided to have approximately zero minutes for question-time. While I agree that a 20 minute talk is a good length, I would have opted for 25 minutes per talk and a specific 5 minutes for questions and discussion. The usual “we have time for one quick question” was getting tiring by the end; it definitely put off people like me who think twice before asking a question and defeats on of the biggest purposes of the conference, that of interaction among the delegates.

I thought that most presentations were excellent. Mentioning everything that I thought was of note would be a futile exercise, however a few comments  will not go amiss.

Wednesday

The conference kicked off with an excellent talk by Michael Stein on computer model emulation, where deterministic model outputs depend only initial conditions and the actual CO2 trajectory, and the output is temperature. Michael focused heavily on spectral representations of emulators. Essentially, if f0(omega), f1(omega) are the initial and final time spectral representations of the emulator, while g0(omega), g1(omega) are those of the deterministic model, then one can try to find a statistical model (emulator) f1 such that f1/f0 is approximately g1/g0. One can then focus on which spectral characteristics to capture in the model, capturing all spectral components will obviously lead to a more complicated (statistical) model.

The presentations I attended in the morning of the first day were a nice mix of application ant theory. J. B. Lecomte (AgroParisTech) talked about an application of a zero-inflated count model to biomass data, while M. Nguyen (Imperial College) talked about tempo-spatial OU processes, constructed by taking a product of a temporal OU and a spatial OU process. This tempo-spatial process can be written down as an integral with respect to a Levy basis. Only 1D examples were presented and inference was based on moment-matching. More details here. M. Bourette (INRA) talked about pairwise likelihood for multivariate models with non-separable covariance functions (which I think was the Apanasovich and Genton model).

The day proceeded with a talk of J. Wallin on parameter estimation of non-Gaussian random fields. This talk particularly interested me, as Wallin (and Bolin) use (conditional Gaussian) Markov random fields to approximate SPDEs, that in turn have generalised asymmetric Laplace, or Normal Inverse Gaussian fields as their solution. The problem of such models is inference, and most of the talk by Wallin consisted of saying “and we cannot use this” “and we cannot use that” with very valid reasons in each case (cannot sample, cannot compute the log-likelihood and so on). Wallin had successfully used Monte Carlo EM for estimating the parameters; the new method he presented is claimed to be more efficient but I do not recall the details, which will be published somewhere soon.

Another interesting talk I attended today was by Alastair Rushworth (Strathclyde) on spatio-temporal smoothing for estimating the health impact of air pollution. I found this talk interesting for two reasons. The first is that this is one of the few applications in environmental sciences where we see cascaded (statistical) spatio-temporal models, where the output of one (statistical) spatio-temporal model is used to provide samples (in an MCMC chain) for the next (statistical) spatio-temporal model. In this case, Alastair was taking the air pollution model of Sujit Sahu and using this to capture the variability of air pollution in a “public health” model he was interested in. Such cascading is prolific in the deterministic world but presents a problem in a Bayesian world where one “should” assume that we can learn something about air pollution from the nation’s health. Arguably (and depending on the model) little can be learnt about air pollution levels from the nation’s health IF there is air pollution data available, and the implicit assumption they are taking is that Y1 conditioned on Z1 is approximately independent of Y2, which is a valid assumption in many cases. I used the same reasoning in the modelling of sea-level rise contribution in Antarctica. Note that I have used Y1 and Y2 in my explanation here — in fact this problem can be nicely seen as a bivariate model built using the conditional approach, as I pointed out to Sujit and Alastair after the talk. The second thing that I liked about this talk was structural estimation. Alastair attempted to estimate the neighbourhood matrix (subject to a positive-definite constraint) by placing a normal prior on a logistic transform of the elements (so that they are between zero and one). Some very interesting boundaries from elements close to zero cropped up in the map of the UK following this inference which didn’t seem at all random (one set of boundaries clearly was outlining Yorkshire in my opinion!).

Two other talks were of interest to me today. The first was on basis reduction of a spatial log-intensity in an LGCP. What intrigued me here (and which I do not yet fully understand the reasoning behind) was the use of a dual reduction a spatial field. In this work, S. Montagna (Warwick) first projects the continuous log-intensity using a set of spline basis, and the target inference becomes the weights of these basis (in the usual fashion). However, the number of basis used (say, p) could still be prohibitive, therefore a latent factor model is used on these coefficients, where the number of factors k is much less than p. The other talk was on Gaussian excursion sets by A. Astrakova (Bergen). In excursion set estimation, one is intent on find a set X defined as the set of points x such that Y(x) > C, where C is some threshold defining, for example, the 95% level (and this is generally unknown). The author assumed that the marginal variance of the underlying GP is known, but that the scale factor and the threshold are also unknown. Gibbs sampling seems to work particularly well in this context, since there is conditional independence of the scale parameter and C given the process. This work seems very related to that of Zhang, Craigmile and Cressie that uses Baddeley’s loss function to predict the exceedance region.

Thursday

Today opened up with a talk on estimation with massive data by Marc Genton. Marc Genton works at the King Abdullah University of Science and Technology (KAUST) and I was glad that Marc took the time to talk about this institute for some time. KAUST, in Saudi Arabia is investing a lot in climate science. Things of note are the new Cray system (XC40) that KAUST has purchased that contains on the order of 200,000 cores. If I’m not mistaken, this is nearly double that of the UK national supercomputer! (~ 120,000 cores). Granted, UK’s Archer XC30 is nearly 2–3 years old now but that is definitely a big machine by any standard. Also, with nearly 1PB of shared memory, possibilities are endless (although this size is way beyond what current methods in statistics can take advantage of, in my opinion). Also of interest was the virtual environment built and designed (with the help of Stefano Castruccio) to visualise global data in 3D. Immersive environments can be very useful, and being able to rotate a global and zoom in and zoom out just by waving of the hands (while seeing everything through a virtual reality device such as the Oculus Rift is very intriguing. I hope to visit this place sometime soon!

I found Marc’s talk interesting for reasons other than KAUST. One of the biggest challenges in spatial statistics is parameter estimation of spatial max-stable processes (e.g., the Brown Resnick or Reich-Shaby processes). While standard GP operations scale as n^3, computations and memory storage with max-stable processes scale as n^n. Thus, many naturally use composite likelihood to split up the estimation into small groups (pairwise and triplewise being the most common). I found it disturbing that tests with the full likelihood could only be done with up to nine data points, and even with only nine data points it took two weeks on a big cluster. A quick look at the form of the likelihood showed that this problem is (very) embarassingly parallel, is this one place where maybe massive banks of GPUs can play a big role? As even with pairwise likelihood on a cluster, things start to get a bit problematic with 200K+ data points. Marc also talked about emulation of temperature in 3D, which could only be done by distributing matrix computations in the cluster. The work (which I think Stefano Castruccio did a lot of given his later talk) consisted of 1 Billion data points produced by the NCAR CCSM4, 6 realisations, 251 years and a lon/lat grid of size 288 x 192 and a vertical resolution of 17 levels. I remember reading the paper which dealt with this problem, and it consisted of doing it in a step-wise manner, first inferring individual time-series, then grouping on latitude bands, then on longitude strips and then vertically.

The day continued with my talk on bivariate models in atmospheric physics, and an interesting talk by F. Lavancier (Nantes) on how one can combine estimators in order to improve an estimator. The content of this talk seems remarkably fundamental, and most of the audience seemed incredulous that one can always improve on an estimator by combining them with the only condition that at least one of these is consistent. I cannot comprehend this personally, as my consistent estimator surely cannot be improved by combining it by a really simple (possible naive and just plain wrong) estimator? Another talk of interest was by Stefano Castruccio, which delved into the concept of emulation as a means of “compressing” climate data. I think this is of very practical use — climate data is too large to store so why not train a statistical model to summarise it? Stefano considered the massive example Marc Genton talked about, and showed some emulator outputs together with some climate model outputs and one could not visually tell the difference. But the natural question to ask in this context (which I didn’t because there was no time to!) is, “how do I summarise the data such that any future analysis is still valid?”. Obviously one cannot do this, and implicitly, when emulating, one is always compromising information. For example, one can decide to focus on the trend, or on the variability, but not on seasonal components or maxima/minima. This is a big problem: What may be of interest today may not be if interest tomorrow and vice versa. If I compress the data with wrong judgments on what is needed to be preserved, the cost of “wasting” simulator runs could be enormous, possibly much more than buying another bank of 1TB disks (and we are talking of 1TB/day data). Stefano’s data set consisted of about 30M points and his model used an “evolutionary spectrum” to cater for non-stationarity. See http://www.mas.ncl.ac.uk/~nsc141/CMSJM.JClim.14.pdf for more details.

Thursday concluded with an exquisite dinner, just off Palaces du Papes in what was meant to be an old water cistern, very impressive setting!

Friday

By Friday we were all a bit tired but one cannot overlook David Bolin’s talk, which eventually won the Best Talk Prize of the conference. David Bolin’s premise, which is now on ArXiv, is that one can carry out covariance tapering with a spatially-varying taper. The obvious question is validity, but in the introduciton to Section 2 in the paper, one can see that this is accommodated by assuming a convolution form for the tapered covariance function. David (and Wallin) then show how one can use hyperspherical tapers in order to make this convolution analytically tractable.

So why would one want to have a variable tapering covariance function? In the beginning of his presentation, David showed that tapering is actually quite bad where the data is sparse, and has nearly no effect when the data is dense. The reason is that where data is dense, there are a lot of zeroes on the covariance matrix, and a lot of borrowing of strength between the data points. When data is sparse, a row in your covariance matrix could even have just one element! Therefore, the obvious approach would be to try and vary your tapering function spatially so that each row in your covariance matrix has the same number of non-zeros — the resulting taper has a large range where data is sparse, and a small range where data is dense. The improvements over the standard tapering (spatially-invariant) function are, as expected, quite dramatic.

This was, overall, a great conference, and I look forward to seeing updates from the above researchers next year at ISBA or METMA.

Partial matrix inverse for efficient computation of posterior predictive variances

It is becoming increasingly common in large-scale environmental applications exhibiting various scales of interactions to let Q be large (n > 10000) but sparse (Lindgren, 2011). While the large dimension allows for considerable flexibility in representing spatio-temporal fields, the sparsity guarantees the use of numerical methods with favourable computational and memory complexities. One of these methods are the so-called Takahashi equations (Campbell, 1995), which compute recursively a partial inversion matrix (primarily to obtain the marginal variances on the diagonal) using only the sparse Cholesky factor of the precision matrix. This recursive update scheme is an integral component of several inference schemes, including INLA (Rue, 2009).

Recently, I came across one area in my work which could make excessive use of a property of the resulting incomplete inverse resulting from the Takahashi equations, to speed up computations. Unfortunately, the application for which this was intended did not materialise in practice (due to other problems emerging in the bigger picture), but this realisation deserves attention, hence this post.

The item of interest is the computation of the quantity \textrm{diag}(C\Sigma^*C^T) where \Sigma^* = (C^TQ_oC + Q)^{-1}, C is the incidence matrix (from the observation model z = Cx + e and Q_o is  the observation precision matrix which is assumed to be diagonal. This quantity would be needed, for example, if we need to compute (i) second-order statistics concerning parameters appearing in a fine-scale component (here absorbed in e), such as the stochastic volatility model of Katzfuss and Cressie (2012), or (ii) simply the posterior predictive variance at each observation location.

A useful identity allows us to re-write \textrm{diag}(C\Sigma^*C^T) as 

\textrm{diag}(C\Sigma^* C^T) \equiv [C\Sigma^*\circ C]1~~~~ (1)

where \circ is the Hadamard product. Although this might at first seem a trivial quantity to compute, in the large systems we are concerned with, C \in \mathbb{R}^{m \times n}, m can be in the millions and n in the tens (or hundreds) of thousands. A standard backward-forward solve for C\Sigma^* is thus intractable even with the use of fill-in-reduction permutations and although both C and Q^* = \Sigma^{*^{-1}} are sparse and the Cholesky factor of Q^* is easily computed.

We deal with this computational bottleneck by first noting that since C is usually very sparse, in effect only a few elements of \Sigma^* are needed in the right hand side of (1). Denote the covariance matrix with all unnecessary elements zeroed out as \Sigma^{**}. We can relate \Sigma^{**} to \Sigma^* using the zeroes function which we define as

zeroes(\Sigma)_{ij} = \left\{\begin{array}{l} 0 ~~~~ \Sigma_{ij} = 0\\ 1 ~~~~ \Sigma_{ij} \ne 0 \end{array} \right.

Then zeroes(\Sigma^{**}) < zeroes(\Sigma^*) implies that each element which is non-zero in \Sigma^{**} is also non-zero in \Sigma^* (but not vice-versa). For this problem it can then be shown using the left-hand side of (1) that zeroes(\Sigma^{**}) is given by

zeroes(\Sigma^{**}) \equiv zeroes(C^T C)~~~~~(2)

The right hand side of (2) implies that the only elements of the full posterior covariance matrix \Sigma^* that are needed for computing (1) are those which are non-zero in (C^T C). This property is not of much help if we do not know \Sigma^* in the first place. However, we can exploit the fact that the elements of \Sigma^* that we can easily know, using the Takahashi equations, are a superset of those in \Sigma^{**}.

To see this, note that since Q^{*} = C^TQ_oC + Q we can assert the following

zeroes(C^T C) \le zeroes(Q^*)~~~~~(3)

Further by Corollary 2.2 in Rue (2005)

zeroes(Q^*_L) \le zeroes(L^*)~~~~~~(4)

where the subscript L denotes the lower triangular matrix and L^{*} is the lower Cholesky factor of Q^*. This is of significance as the Takahashi equations compute all values of \Sigma^* for which the Cholesky factor is not zero when evaluating the marginal variances (Rue, 2009). Denote this partial matrix inverse as \Sigma^{***}. Then we can assert that

zeroes(L^*) \le zeroes(\Sigma_L^{***})~~~~~~~(5)

The key result follows by combining deductions (2)—(5) and noting that \Sigma^{***} is symmetric to obtain the inequality

zeroes(\Sigma^{**}) \le zeroes(\Sigma^{***})

Hence all the elements needed for the multiplication in (1) are already available in \Sigma^{***}, which is relatively easy to compute using fast linear algebraic routines. This means that we can substitute \Sigma^* with \Sigma^{***} or, even better, with (\Sigma^{***} \circ zeroes(C^T C)), and effectively replace a huge forward-backward solve by a sparse matrix multiplication. A notable advantage is that this operation can be parallelised with minimal effort resulting in significant computational savings (in our implementation we made use of the OpenBLAS library, see previous post). As an indication of achievable speeds, for the problem, in the current study, using the partial matrix inverse for the evaluation of (1) required only 0.3s on a high-end desktop with 32GB of memory, despite the Cholesky factor being on the order of 0.5Gb. Not enough memory was available to compute (1) directly.

Thanks go to Jonathan Rougier for considerable help in organising the ideas for this post.

Multivariate spatio-temporal models with separable covariance functions: the auto-regressive representation

It is somewhat convenient to model a spatio-temporal Gaussian field with a separable covariance function. All one needs to do here is fit functions to the empirical spatial and temporal variograms and the job is done. If we obtain valid spatial and temporal covariance functions, then their product is guaranteed to also be a valid covariance function. Despite their simplicity, there are significant drawbacks with these models, most notably that they are unable to capture spatio-temporal interactions, by which I mean that the cross-covariance between two time series at two locations is independent of the distance between these locations, see Cressie and Huang (1999), and Gneiting et al. (2005), and references therein for more information and limitations on this class of models.

A very convenient property of separability is its representation in dynamic, autoregressive form. By Theorem 3 of Storvik et al. (2001) one can show that if X(s,t) \sim \mathcal{GP}(0,k_1(\|s-r\|)k_2(|t-t'|)) and if k_2(|t-t'|) = \rho^{|t-t'|} then

X(s,t+1) = \rho X(s,t) + e(s)

where

e(s) \sim \mathcal{GP}(0,(1-\rho^2)k_1(\cdot))

There might be computational reasons why you would want to use one form or the other but I will not dwell on this today, let’s just say it’s useful to have both representations available at hand when carrying out parameter estimation.

Spatio-temporal co-regionalisation

In my work I’m using (very) large datasets and have to carry out inference over large areas. Separability is a help here, however I have multiple spatio-temporal fields interacting, the interaction of which cannot be ignored. Finding the joint spatial covariance structure in a multivariate problem is not hard, see for example Zammit-Mangion et al. (2013) and Finley et al. (2007) which employ a co-regionalisation assumption. Under this assumption, interactions between the fields are assumed to be local, i.e. at a given point in space the interaction between the two fields can be fully defined by a 2×2 numeric covariance matrix. This does not imply a lack of spatial correlation in the interaction terms, rather that these off-diagonal terms are fully defined by the covariance functions of the individual fields which makes things somewhat easier.

Co-regionalisation has been applied in the spatio-temporal setting in a model termed the spatiotemporal linear coregionalization model (STLM) by Choi et al. (2009). The same rules apply in the spatio-temporal setting as in the spatial setting. Thus, under co-regionalisation, and the assumption that the spatio-temporal covariance functions of both interacting fields, marginally, are the same up to a numerical constant, the joint covariance function for the separable case described above is simply {\bf R} \otimes (k_1(\|s-r\|)k_2(|t-t'|)) where \bf R is the 2×2 (numerical) covariance matrix describing the local covariances.

What I am interested in here is the multivariate auto-regressive (AR) interpretation of bivariate spatio-temporal fields having this covariance function.

The AR representation

Let us assume the presence of two fields \tilde X_1(s,t) and \tilde X_2(s,t) which are i.i.d.  according to

\tilde X_1, \tilde X_2 \sim \mathcal{GP}(0,k(\cdot))

where k(\cdot) = k_1(\cdot)k_2(\cdot), the product of the spatial and temporal covariance functions respectively. Then, without loss of generalisation, we can express the bi-variate model as

X_1(s,t) = \tilde X_1(s,t)

X_2(s,t) = a\tilde X_1(s,t) + b\tilde X_2(s,t)

where

{\bf R} = \begin{bmatrix} 1 & a \\ a & a^2 + b^2 \end{bmatrix}

Then

(X_1(s,t), X_2(s,t)) \sim \mathcal{GP}(0, {\bf R} \otimes k(\cdot))

What I wish to show is that this joint space-time covariance function describes an auto-regressive process of the form

\begin{bmatrix} X_1(s,t+1) \\ X_2(s,t+1) \end{bmatrix} = \begin{bmatrix} \rho & \\ & \rho \end{bmatrix}\begin{bmatrix} X_1(s,t) \\ X_2(s,t) \end{bmatrix} +\begin{bmatrix} e_1(s) \\ e_2(s) \end{bmatrix}

where

(e_1(s), e_2(s)) \sim \mathcal{GP}\left(0,(1-\rho^2)[{\bf R} \otimes k_1(\cdot)]\right)

I do not have a formal proof for this, however we can check that this really is the case by comparing moments from the Gaussian field representation and the auto-regressive representation. There are six moments we need to compare, namely

  1. \langle X_{1}(s,t), X_{1}(s,t+1)\rangle
  2. \langle X_{1}(s,t), X_{1}(s,t)\rangle
  3. \langle X_{2}(s,t), X_{2}(s,t+1)\rangle
  4. \langle X_{2}(s,t), X_{2}(s,t)\rangle
  5. \langle X_{2}(s,t+1), X_{1}(s,t)\rangle
  6. \langle X_{1}(s,t), X_{2}(s,t)\rangle

Using the spatio-temporal covariance structure we obtain the following

  1. \langle X_{1}(s,t), X_{1}(s,t+1)\rangle = \rho k_1(\cdot)
  2. \langle X_{1}(s,t), X_{1}(s,t)\rangle = k_1(\cdot)
  3. \langle X_{2}(s,t), X_{2}(s,t+1)\rangle = \rho (a^2 + b^2) k_1(\cdot)
  4. \langle X_{2}(s,t), X_{2}(s,t)\rangle = (a^2 + b^2) k_1(\cdot)
  5. \langle X_{2}(s,t+1), X_{1}(s,t)\rangle = a\rho k_1(\cdot)
  6. \langle X_{1}(s,t), X_{2}(s,t)\rangle = ak_1(\cdot)

From the auto-regressive form we get exactly the same covariances. Note that to extract them from the AR process we need to evaluate the moments of the stationary distribution which can be found by noting the following

var(X_{1}(s,t)) = \rho^2 var(X_1(s,t)) + (1-\rho^2)k_1(\cdot)

var(X_{2}(s,t)) = \rho^2 var(X_2(s,t)) + (1-\rho^2)(a^2 + b^2)k_1(\cdot)

cov(X_1(s,t),X_{2}(s,t)) = \rho^2 cov(X_1(s,t),X_2(s,t)) + cov(e_1(s),e_2(s))

which gives the required result. Therefore we have found the auto-regressive representation for a bi-variate spatio-temporal process which, as expected, is somewhat simplistic since the propagation matrix does not include off-diagonal terms. It would be great to see if we can extend this to more complex multi-variate spatio-temporal models. Probably there is an avenue for this through the integro-difference equation and its Fourier representation as seen in Storvik’s paper.