Multi-Scale Process Modelling and Distributed Computation for Spatial Data

I recently published a paper, joint with Jonathan Rougier, on modelling, and computing with, multi-scale spatial processes (click here for the arXiv version). A multi-scale spatial process is, simply put, a superposition of processes with differing spatial scales, where each component process has its own ‘spectral signature,’ or ‘frequency bandwidth.’ The first process captures coarse scales. The second process captures finer scales, and so on.

Global map of sea-surface temperature anomaly produced from approx. 1,000,000 data points, using a spatial process model with two scales. Left panel: Prediction (posterior mean). Right panel: Uncertainty (posterior standard error).

Computing with multi-scale spatial processes is not easy. A straightforward way to proceed would be to introduce some notion of independence at the finer scales, by first tiling up the process and then separately computing with variables and parameters associated with these tiles. While this approach brings with it an ease for modelling nonstationarity, it does not yield a realistic, global, model; indeed the process at the finer scale is deemed to be discontinuous at the boundaries of the tiles. The main aim of this work was to come up with a computational strategy to avoid the independence assumption, while allowing one to model nonstationarity at the finer scales.

The strategy we converged on is the following:

  1. Model the process as a sum of independent Gaussian Markov random fields (GMRFs) to take advantage of sparse linear algebraic libraries when making inference.
  2. Associate each GMRF with a spatial scale, and use informative prior distributions to ‘separate’ the scales associated with each GMRF.
  3. Model the first (coarse) scale GMRF as stationary, and subsequent GMRFs as nonstationary, with increasing degrees of nonstationarity (faster fluctuating length scales, etc.) with decreasing scale.
  4. Define each GMRF over a tessellation of the domain of interest, where finer tessellations are associated with GMRFs with finer scales.
  5. Use the conditional independence and distribution properties of GMRFs (exploiting graph colouring and collapsed distributions) to generate a Markov chain associated with the process that mixes well and converges quickly to the target (posterior) distribution.
A colouring of the spatial domain in the global sea-surface temperature application. Each colour contains process variables at the fine scale that are sampled concurrently within a Gibbs sampling framework.

This paper took many years in the making (about 6 years), partially because it was always a project ‘on the side,’ but also because we were not happy with early iterations of the model. In the first model iteration we only assumed one scale, and simply took advantage of the computational benefits of GMRFs via graph colouring to make inference computationally efficient, but the novelty associated with this single-scale model was questionable: both graph colouring and nonstationary GMRFs are not new, although to the best of my knowledge they have not been treated together.

In the second model iteration we introduced multiple scales, but assumed spatial block-independence at these finer scales. This created posterior standard errors and predictions that were very unrealistic, with sharp discontinuities at the boundaries. I still have an early image of the meshing strategy we adopted, which I show below.

Our second model iteration of a two-scale process on the sphere used a coarse mesh (with vertices at the blue dots, just about visible) superimposed on separately-constructed fine meshes (with vertices at the yellow dots). Overlap of the fine meshes was introduced to reduce the effect of boundary conditions.

In the third model iteration we considered global models for the finer scales as well, and used a graph colouring scheme together with a small approximation to the conditional distribution (eq. 11 in the paper) to make inference on the finer scale component of the processes. This worked well, but GMRF variables close to the boundaries of the tiles, created for the purpose of graph colouring, did not mix well. Furthermore, there was no smoothness constraint on the spatial variances and length scales, which is problematic. I found an image of this early attempt, where I show serious issues with the prediction standard error at the boundaries of the tiles used for graph colouring, together with unrealistic heterogeneity..

Prediction standard errors using our third model iteration, which did not have any smoothness constraints on the spatially-varying variances and length scales, and did not allow for good mixing of the Markov chain at the boundaries of the tiles used for graph colouring. Note how the poor mixing manifests itself in poor estimates of the prediction standard error at the tile boundaries.

In the fourth model iteration we used basis functions to model the variances and length scales of the GMRF at the finer scale, and introduced a series of rotating tiling/colouring schemes. The latter modification solved the mixing issues we had at the boundaries. This final model ‘felt right;’ it worked well, both from a representational point of view, and a computational point of view. The posterior prediction and prediction standard error maps using this fourth iteration are shown at the beginning of this blog post, where we effectively model (and not just predict) globally the sea-surface temperature at a 30 x 30 km resolution (we can predict at much higher resolutions if we want to).

Three tilings of our domain used in our Gibbs sampler to alleviate the problem of poor mixing at tile boundaries.

All in all this project was very rewarding, and I believe it has provided a demonstrable advance in spatial modelling and computation. The results compellingly show an improvement over other state-of-the-art methods for spatial prediction. The only real downside of this approach is the effort required to implement it: it requires one to construct informative prior distributions, manually tessellate the domain, and run a Gibbs sampler for several days for a problem with approximately 1,000,000 data points! Options are currently being explored to automate and speed up the modelling and fitting procedures, but, at the time of writing, there does not seem to be an easy way forward.

Reproducible code for this project is available here.

Notes from DSSV 2017 (Conference)

The inaugural conference on Data Science, Statistics and Visualisation (DSSV2017) was an interesting change from other conferences. The audience was mixed, with a good proportion (roughly 50%) coming from industry. Many sessions reminded me of the Meetups that occur in Sydney, the balance between talks containing math and other talks remaining at a high level was appealing.

In this conference I talked about the medal plot, a graphic for visualising uncertainty in space-time fields that was developed by Jonty Rougier and myself during my stay in Bristol. The medal plot can be used to see the effect of considering spatial correlations in your model, and the effect of your prior judgement on your posterior uncertainties. More details are given in our paper here.

The conference opened with an interesting talk by Trevor Hastie who gave a run-through of subset selection methods before introducing a new one: Relaxed LASSO. Interesting points, some of which we are aware of already:

  • In Trevor’s experience, forward subset selection is (in practice) as good as best subset selection, but of course orders of magnitude faster. Forward selection works with n < p unlike, of course, backward selection.
  • An exhaustive search for best subset selection can be easily carried out on up to 35 covariates using the leaps package in R.
  • Recently Bertsimas, King, and Mazumder cracked the best-subset selection curse of dimensionality using mixed-integer programming and the GUROBI solver (which can be used in R). However is best subset selection always the best? No, it tends to overfit the data.
  • LASSO can be seen as a “convex relaxation” of the best-subset selection problem (which has an L0 penalty term instead of an L1, and is hence admits a non-convex cost function).
  • In GLMNET, LASSO is implemented using pairwise coordinate descent and computes the optimal solution over a fine grid of the regularisation parameter lambda.
  • Trevor spent some time talking about the connections between Least-Angle Regression, LASSO, and forward subset selection which was interesting.
  • In Relaxed LASSO (which was new to me) one uses LASSO to obtain the non-zero terms. Then one computes a linear combination of the LASSO estimates and the OLS estimates of those non-zero terms. Note that this is different from elastic net, in which the cost function contains both an L1 and an L2 penalty, but is similar in spirit. Trevor claimed that this reduces the shrinkage bias in vanilla LASSO methods.
  • Trevor showed the results of an extensive simulation study containing experiments with different SNRs and covariate intra-correlations. Interestingly, the best subset and forward subset selection methods only outperformed the LASSO methods at high SNRs. The relaxed LASSO consistently outperformed LASSO.

Another talk I found very interesting was that by Andre Martins on constructing sparse probability distributions by replacing the ubiquitous softmax function with a sparsemax function. More details are given in his paper here. The problem is formulated as an optimisation problem, where one is projecting the weights onto the probability simplex. While I guess this part is not new, Andre constructed a loss function inspired by that of the softmax to obtain simple optimisation algorithms.

David Banks talked about emulation of agent-based models (ABMs), something which I am personally sceptical about since ABMs are useful because they are chaotic in nature, a feature emulators would struggle to capture. Yet, when I enquired about this, David said he and his student had considerable success in emulation and that the discrepancy term is able to capture the “unpredictable component” in ABMs fairly well so I will keep an open mind on this one!

On the last day we had several interesting talks. Mario Figueiredo talked about the OSCAR loss function which deals with correlated variables in a different way than LASSO. While LASSO will assign all the weight to one of the correlated variables, OSCAR will assign equal weight to the correlated variables. There is an interesting connection between the “shrinking to exactly zero” of the LASSO and the “shrinking to exactly the same weight for correlated variables” of the OSCAR criterion. Mario extended this to an OWL criterion which generalises the OSCAR criterion (but I’m not sure in what way!).

Peter Rousseeuw talked about detecting cell-wise outliers (as opposed to row-wise outliers) in data. He showed some impressive figures obtained using his package cellWise. The Technometrics paper describing the technique is here. I do have an issue with calling outliers and anomalies the same, something which every speaker asserted in this conference. In my opinion, an outlier is a property of the data, while an anomaly is a property of the process. That is, an outlier is something we would like to filter out, while an anomaly is something we would like to keep, model, and even predict!

After Peter, Google (with three speakers) gave a talk on market attribution models that involved using non-stationary Markov models to simulate user behaviour. Finally, Daniel Keim gave a very impressive talk on visual analytics. He took the audience through a graphical journey involving Twitter feeds for disaster detection, IP monitoring for attack detection, and the analysis of soccer team strategies from millisecond data of players’ positions on the field. All this was after lambasting statisticians somewhat for solving “small” problems using only “selective” data (but which we do “very well”).

I of course do not agree with Daniel; we do what we do not because we are scared of dealing with “very” big data, but because it is not always necessary to work with all the data one could possibly collect, to solve big problems. I think Daniel focuses on some important fields which indeed do require a lot of data (and which is hence technologically very challenging), but I disagree that that is  always”the real world”. Further, like in visual analytics, we also use process chains (starting from data collection, to data pre-processing, to feature selection, to modelling, to inference, to visualisation) much like what is needed in the truly big-data cases he described.

One point which concerned me what that there was no mention of uncertainty. This is not a problem when raw data is displayed, but when an inference is conveyed, decision-making on a point prediction without a notion of risk can be dangerous.

One of the graphics Daniel showed which I never tire of seeing was Anscombe’s quartet:

Quartet

Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

All the four data sets shown above have the same marginal means, the same correlation coefficient, and the same best linear fit. Daniel used this to emphasise the importance of visualisation. An extreme case of this is shown in this lovely video below by Justin Matejka; this is an excellent resource for STAT101!

Installing Rpud on Ubuntu 14.04

`rpud` is an R package which allows for R operations on your NVIDIA GPU with minimal coding effort. It’s an excellent package and well worth looking into.

Unfortunately, its installation is not so straightforward on Ubuntu 14.04 (at least it wasn’t for me). The package comes with a help file but this is pretty useless if you encounter some obscure error. I will outline what I did to overcome these difficulties below.

1. Install from source

Unfortunately, one of the fixes I found requires you to edit the configuration file, therefore the source is needed. Download the source from CRAN and use ‘tar -zxvf’ to unzip and extract it in a folder of your choice.

2. Read the INSTALL file

The INSTALL file instructs the user to set the R_LIBS_USER environment variable. Put this in your ~/.bashrc file and, if you do not feel like restarting the terminal, set it by simply typing ‘export R_LIBS_USER=[…]’ where […] will be your R library directory. In my case this was ~/R/x86_64-pc-linux-gnu-library/3.1/

3. You need to set configuration options during installation

The configuration file for rpud will set default paths, if not specified, which are probably not relevant for your system. This will be a problem, as the installation will not be able to find “nvcc”, the NVIDIA compiler, which is obviously a pretty basic thing which is needed when compiling using CUDA libraries! The install.packages command should therefore look like something as follows:

install.packages(“./rpud/”,type=”source”,configure.args = c(” ./configure –with-cuda-home=/usr –with-r-lib=/usr/lib/R/lib”),repos=NULL,verbose = T)

Note the –with-cuda-home and –with-r-lib flags. These need to be set accordingly. If ‘nvcc’ lies in ‘/usr/bin’ then the first flag needs to just be ‘/usr’. rpud, by default, assumes something different.

4. Edit the configuration file

This is probably the trickiest part. There are two problems with the configuration file. First,rpud will assume that libcublas.so is in $CUDA_HOME/lib64. This is not always the case (in my system is was in $CUDA_HOME/lib/x86_64-linux-gnu), so do a quick search for ‘lib64’ in the file ‘configuration’ and change accordingly if your libcublas.so is in a different directory. I had to do this change in four or five places.

Second, and this is harder to debug, rpud assumes a (deprecated or different?) g++ version which assumes the existence of a “-Wl” option.  My gcc version if 4.8.2 and does not know what -Wl means. When installing the package you will get repeated errors such as

g++: error: unrecognized command line option ‘-Wl’

In the configuration file the problem is with

RPATHFLAG=”-Wl,-rpath,${CUDA_HOME}${CUDA_LIB_DIR}”

I did NOT find a proper workaround for this. However, simply emptying RPATHFLAG seemed to do the trick. So in the end I simply set

RPATHFLAG=

Then the installation went through fine after this and rpud works like a charm once installed.

Partial matrix inverse for efficient computation of posterior predictive variances

It is becoming increasingly common in large-scale environmental applications exhibiting various scales of interactions to let Q be large (n > 10000) but sparse (Lindgren, 2011). While the large dimension allows for considerable flexibility in representing spatio-temporal fields, the sparsity guarantees the use of numerical methods with favourable computational and memory complexities. One of these methods are the so-called Takahashi equations (Campbell, 1995), which compute recursively a partial inversion matrix (primarily to obtain the marginal variances on the diagonal) using only the sparse Cholesky factor of the precision matrix. This recursive update scheme is an integral component of several inference schemes, including INLA (Rue, 2009).

Recently, I came across one area in my work which could make excessive use of a property of the resulting incomplete inverse resulting from the Takahashi equations, to speed up computations. Unfortunately, the application for which this was intended did not materialise in practice (due to other problems emerging in the bigger picture), but this realisation deserves attention, hence this post.

The item of interest is the computation of the quantity \textrm{diag}(C\Sigma^*C^T) where \Sigma^* = (C^TQ_oC + Q)^{-1}, C is the incidence matrix (from the observation model z = Cx + e and Q_o is  the observation precision matrix which is assumed to be diagonal. This quantity would be needed, for example, if we need to compute (i) second-order statistics concerning parameters appearing in a fine-scale component (here absorbed in e), such as the stochastic volatility model of Katzfuss and Cressie (2012), or (ii) simply the posterior predictive variance at each observation location.

A useful identity allows us to re-write \textrm{diag}(C\Sigma^*C^T) as 

\textrm{diag}(C\Sigma^* C^T) \equiv [C\Sigma^*\circ C]1~~~~ (1)

where \circ is the Hadamard product. Although this might at first seem a trivial quantity to compute, in the large systems we are concerned with, C \in \mathbb{R}^{m \times n}, m can be in the millions and n in the tens (or hundreds) of thousands. A standard backward-forward solve for C\Sigma^* is thus intractable even with the use of fill-in-reduction permutations and although both C and Q^* = \Sigma^{*^{-1}} are sparse and the Cholesky factor of Q^* is easily computed.

We deal with this computational bottleneck by first noting that since C is usually very sparse, in effect only a few elements of \Sigma^* are needed in the right hand side of (1). Denote the covariance matrix with all unnecessary elements zeroed out as \Sigma^{**}. We can relate \Sigma^{**} to \Sigma^* using the zeroes function which we define as

zeroes(\Sigma)_{ij} = \left\{\begin{array}{l} 0 ~~~~ \Sigma_{ij} = 0\\ 1 ~~~~ \Sigma_{ij} \ne 0 \end{array} \right.

Then zeroes(\Sigma^{**}) < zeroes(\Sigma^*) implies that each element which is non-zero in \Sigma^{**} is also non-zero in \Sigma^* (but not vice-versa). For this problem it can then be shown using the left-hand side of (1) that zeroes(\Sigma^{**}) is given by

zeroes(\Sigma^{**}) \equiv zeroes(C^T C)~~~~~(2)

The right hand side of (2) implies that the only elements of the full posterior covariance matrix \Sigma^* that are needed for computing (1) are those which are non-zero in (C^T C). This property is not of much help if we do not know \Sigma^* in the first place. However, we can exploit the fact that the elements of \Sigma^* that we can easily know, using the Takahashi equations, are a superset of those in \Sigma^{**}.

To see this, note that since Q^{*} = C^TQ_oC + Q we can assert the following

zeroes(C^T C) \le zeroes(Q^*)~~~~~(3)

Further by Corollary 2.2 in Rue (2005)

zeroes(Q^*_L) \le zeroes(L^*)~~~~~~(4)

where the subscript L denotes the lower triangular matrix and L^{*} is the lower Cholesky factor of Q^*. This is of significance as the Takahashi equations compute all values of \Sigma^* for which the Cholesky factor is not zero when evaluating the marginal variances (Rue, 2009). Denote this partial matrix inverse as \Sigma^{***}. Then we can assert that

zeroes(L^*) \le zeroes(\Sigma_L^{***})~~~~~~~(5)

The key result follows by combining deductions (2)—(5) and noting that \Sigma^{***} is symmetric to obtain the inequality

zeroes(\Sigma^{**}) \le zeroes(\Sigma^{***})

Hence all the elements needed for the multiplication in (1) are already available in \Sigma^{***}, which is relatively easy to compute using fast linear algebraic routines. This means that we can substitute \Sigma^* with \Sigma^{***} or, even better, with (\Sigma^{***} \circ zeroes(C^T C)), and effectively replace a huge forward-backward solve by a sparse matrix multiplication. A notable advantage is that this operation can be parallelised with minimal effort resulting in significant computational savings (in our implementation we made use of the OpenBLAS library, see previous post). As an indication of achievable speeds, for the problem, in the current study, using the partial matrix inverse for the evaluation of (1) required only 0.3s on a high-end desktop with 32GB of memory, despite the Cholesky factor being on the order of 0.5Gb. Not enough memory was available to compute (1) directly.

Thanks go to Jonathan Rougier for considerable help in organising the ideas for this post.