# Multi-Scale Process Modelling and Distributed Computation for Spatial Data

I recently published a paper, joint with Jonathan Rougier, on modelling, and computing with, multi-scale spatial processes (click here for the arXiv version). A multi-scale spatial process is, simply put, a superposition of processes with differing spatial scales, where each component process has its own ‘spectral signature,’ or ‘frequency bandwidth.’ The first process captures coarse scales. The second process captures finer scales, and so on.

Computing with multi-scale spatial processes is not easy. A straightforward way to proceed would be to introduce some notion of independence at the finer scales, by first tiling up the process and then separately computing with variables and parameters associated with these tiles. While this approach brings with it an ease for modelling nonstationarity, it does not yield a realistic, global, model; indeed the process at the finer scale is deemed to be discontinuous at the boundaries of the tiles. The main aim of this work was to come up with a computational strategy to avoid the independence assumption, while allowing one to model nonstationarity at the finer scales.

The strategy we converged on is the following:

1. Model the process as a sum of independent Gaussian Markov random fields (GMRFs) to take advantage of sparse linear algebraic libraries when making inference.
2. Associate each GMRF with a spatial scale, and use informative prior distributions to ‘separate’ the scales associated with each GMRF.
3. Model the first (coarse) scale GMRF as stationary, and subsequent GMRFs as nonstationary, with increasing degrees of nonstationarity (faster fluctuating length scales, etc.) with decreasing scale.
4. Define each GMRF over a tessellation of the domain of interest, where finer tessellations are associated with GMRFs with finer scales.
5. Use the conditional independence and distribution properties of GMRFs (exploiting graph colouring and collapsed distributions) to generate a Markov chain associated with the process that mixes well and converges quickly to the target (posterior) distribution.

This paper took many years in the making (about 6 years), partially because it was always a project ‘on the side,’ but also because we were not happy with early iterations of the model. In the first model iteration we only assumed one scale, and simply took advantage of the computational benefits of GMRFs via graph colouring to make inference computationally efficient, but the novelty associated with this single-scale model was questionable: both graph colouring and nonstationary GMRFs are not new, although to the best of my knowledge they have not been treated together.

In the second model iteration we introduced multiple scales, but assumed spatial block-independence at these finer scales. This created posterior standard errors and predictions that were very unrealistic, with sharp discontinuities at the boundaries. I still have an early image of the meshing strategy we adopted, which I show below.

In the third model iteration we considered global models for the finer scales as well, and used a graph colouring scheme together with a small approximation to the conditional distribution (eq. 11 in the paper) to make inference on the finer scale component of the processes. This worked well, but GMRF variables close to the boundaries of the tiles, created for the purpose of graph colouring, did not mix well. Furthermore, there was no smoothness constraint on the spatial variances and length scales, which is problematic. I found an image of this early attempt, where I show serious issues with the prediction standard error at the boundaries of the tiles used for graph colouring, together with unrealistic heterogeneity..

In the fourth model iteration we used basis functions to model the variances and length scales of the GMRF at the finer scale, and introduced a series of rotating tiling/colouring schemes. The latter modification solved the mixing issues we had at the boundaries. This final model ‘felt right;’ it worked well, both from a representational point of view, and a computational point of view. The posterior prediction and prediction standard error maps using this fourth iteration are shown at the beginning of this blog post, where we effectively model (and not just predict) globally the sea-surface temperature at a 30 x 30 km resolution (we can predict at much higher resolutions if we want to).

All in all this project was very rewarding, and I believe it has provided a demonstrable advance in spatial modelling and computation. The results compellingly show an improvement over other state-of-the-art methods for spatial prediction. The only real downside of this approach is the effort required to implement it: it requires one to construct informative prior distributions, manually tessellate the domain, and run a Gibbs sampler for several days for a problem with approximately 1,000,000 data points! Options are currently being explored to automate and speed up the modelling and fitting procedures, but, at the time of writing, there does not seem to be an easy way forward.

Reproducible code for this project is available here.

# Partial matrix inverse for efficient computation of posterior predictive variances

It is becoming increasingly common in large-scale environmental applications exhibiting various scales of interactions to let $Q$ be large ($n > 10000$) but sparse (Lindgren, 2011). While the large dimension allows for considerable flexibility in representing spatio-temporal fields, the sparsity guarantees the use of numerical methods with favourable computational and memory complexities. One of these methods are the so-called Takahashi equations (Campbell, 1995), which compute recursively a partial inversion matrix (primarily to obtain the marginal variances on the diagonal) using only the sparse Cholesky factor of the precision matrix. This recursive update scheme is an integral component of several inference schemes, including INLA (Rue, 2009).

Recently, I came across one area in my work which could make excessive use of a property of the resulting incomplete inverse resulting from the Takahashi equations, to speed up computations. Unfortunately, the application for which this was intended did not materialise in practice (due to other problems emerging in the bigger picture), but this realisation deserves attention, hence this post.

The item of interest is the computation of the quantity $\textrm{diag}(C\Sigma^*C^T)$ where $\Sigma^* = (C^TQ_oC + Q)^{-1}$, $C$ is the incidence matrix (from the observation model $z = Cx + e$ and $Q_o$ is  the observation precision matrix which is assumed to be diagonal. This quantity would be needed, for example, if we need to compute (i) second-order statistics concerning parameters appearing in a fine-scale component (here absorbed in $e$), such as the stochastic volatility model of Katzfuss and Cressie (2012), or (ii) simply the posterior predictive variance at each observation location.

A useful identity allows us to re-write $\textrm{diag}(C\Sigma^*C^T)$ as

$\textrm{diag}(C\Sigma^* C^T) \equiv [C\Sigma^*\circ C]1~~~~ (1)$

where $\circ$ is the Hadamard product. Although this might at first seem a trivial quantity to compute, in the large systems we are concerned with, $C \in \mathbb{R}^{m \times n}$, $m$ can be in the millions and $n$ in the tens (or hundreds) of thousands. A standard backward-forward solve for $C\Sigma^*$ is thus intractable even with the use of fill-in-reduction permutations and although both $C$ and $Q^* = \Sigma^{*^{-1}}$ are sparse and the Cholesky factor of $Q^*$ is easily computed.

We deal with this computational bottleneck by first noting that since $C$ is usually very sparse, in effect only a few elements of $\Sigma^*$ are needed in the right hand side of (1). Denote the covariance matrix with all unnecessary elements zeroed out as $\Sigma^{**}$. We can relate $\Sigma^{**}$ to $\Sigma^*$ using the zeroes function which we define as

$zeroes(\Sigma)_{ij} = \left\{\begin{array}{l} 0 ~~~~ \Sigma_{ij} = 0\\ 1 ~~~~ \Sigma_{ij} \ne 0 \end{array} \right.$

Then $zeroes(\Sigma^{**}) < zeroes(\Sigma^*)$ implies that each element which is non-zero in $\Sigma^{**}$ is also non-zero in $\Sigma^*$ (but not vice-versa). For this problem it can then be shown using the left-hand side of (1) that $zeroes(\Sigma^{**})$ is given by

$zeroes(\Sigma^{**}) \equiv zeroes(C^T C)~~~~~(2)$

The right hand side of (2) implies that the only elements of the full posterior covariance matrix $\Sigma^*$ that are needed for computing (1) are those which are non-zero in $(C^T C)$. This property is not of much help if we do not know $\Sigma^*$ in the first place. However, we can exploit the fact that the elements of $\Sigma^*$ that we can easily know, using the Takahashi equations, are a superset of those in $\Sigma^{**}$.

To see this, note that since $Q^{*} = C^TQ_oC + Q$ we can assert the following

$zeroes(C^T C) \le zeroes(Q^*)~~~~~(3)$

Further by Corollary 2.2 in Rue (2005)

$zeroes(Q^*_L) \le zeroes(L^*)~~~~~~(4)$

where the subscript $L$ denotes the lower triangular matrix and $L^{*}$ is the lower Cholesky factor of $Q^*$. This is of significance as the Takahashi equations compute all values of $\Sigma^*$ for which the Cholesky factor is not zero when evaluating the marginal variances (Rue, 2009). Denote this partial matrix inverse as $\Sigma^{***}$. Then we can assert that

$zeroes(L^*) \le zeroes(\Sigma_L^{***})~~~~~~~(5)$

The key result follows by combining deductions (2)—(5) and noting that $\Sigma^{***}$ is symmetric to obtain the inequality

$zeroes(\Sigma^{**}) \le zeroes(\Sigma^{***})$

Hence all the elements needed for the multiplication in (1) are already available in $\Sigma^{***}$, which is relatively easy to compute using fast linear algebraic routines. This means that we can substitute $\Sigma^*$ with $\Sigma^{***}$ or, even better, with $(\Sigma^{***} \circ zeroes(C^T C))$, and effectively replace a huge forward-backward solve by a sparse matrix multiplication. A notable advantage is that this operation can be parallelised with minimal effort resulting in significant computational savings (in our implementation we made use of the OpenBLAS library, see previous post). As an indication of achievable speeds, for the problem, in the current study, using the partial matrix inverse for the evaluation of (1) required only 0.3s on a high-end desktop with 32GB of memory, despite the Cholesky factor being on the order of 0.5Gb. Not enough memory was available to compute (1) directly.

Thanks go to Jonathan Rougier for considerable help in organising the ideas for this post.