Notes from Spatial Statistics 2015 (conference)

This year I attended the Spatial Statistics Conference in Avignon, France. The conference was attended by about 300 people, mostly spatial statisticians, so I definitely felt at home there. I found it well-organised although it was unfortunate that the organisers decided to have approximately zero minutes for question-time. While I agree that a 20 minute talk is a good length, I would have opted for 25 minutes per talk and a specific 5 minutes for questions and discussion. The usual “we have time for one quick question” was getting tiring by the end; it definitely put off people like me who think twice before asking a question and defeats on of the biggest purposes of the conference, that of interaction among the delegates.

I thought that most presentations were excellent. Mentioning everything that I thought was of note would be a futile exercise, however a few comments  will not go amiss.

Wednesday

The conference kicked off with an excellent talk by Michael Stein on computer model emulation, where deterministic model outputs depend only initial conditions and the actual CO2 trajectory, and the output is temperature. Michael focused heavily on spectral representations of emulators. Essentially, if f0(omega), f1(omega) are the initial and final time spectral representations of the emulator, while g0(omega), g1(omega) are those of the deterministic model, then one can try to find a statistical model (emulator) f1 such that f1/f0 is approximately g1/g0. One can then focus on which spectral characteristics to capture in the model, capturing all spectral components will obviously lead to a more complicated (statistical) model.

The presentations I attended in the morning of the first day were a nice mix of application ant theory. J. B. Lecomte (AgroParisTech) talked about an application of a zero-inflated count model to biomass data, while M. Nguyen (Imperial College) talked about tempo-spatial OU processes, constructed by taking a product of a temporal OU and a spatial OU process. This tempo-spatial process can be written down as an integral with respect to a Levy basis. Only 1D examples were presented and inference was based on moment-matching. More details here. M. Bourette (INRA) talked about pairwise likelihood for multivariate models with non-separable covariance functions (which I think was the Apanasovich and Genton model).

The day proceeded with a talk of J. Wallin on parameter estimation of non-Gaussian random fields. This talk particularly interested me, as Wallin (and Bolin) use (conditional Gaussian) Markov random fields to approximate SPDEs, that in turn have generalised asymmetric Laplace, or Normal Inverse Gaussian fields as their solution. The problem of such models is inference, and most of the talk by Wallin consisted of saying “and we cannot use this” “and we cannot use that” with very valid reasons in each case (cannot sample, cannot compute the log-likelihood and so on). Wallin had successfully used Monte Carlo EM for estimating the parameters; the new method he presented is claimed to be more efficient but I do not recall the details, which will be published somewhere soon.

Another interesting talk I attended today was by Alastair Rushworth (Strathclyde) on spatio-temporal smoothing for estimating the health impact of air pollution. I found this talk interesting for two reasons. The first is that this is one of the few applications in environmental sciences where we see cascaded (statistical) spatio-temporal models, where the output of one (statistical) spatio-temporal model is used to provide samples (in an MCMC chain) for the next (statistical) spatio-temporal model. In this case, Alastair was taking the air pollution model of Sujit Sahu and using this to capture the variability of air pollution in a “public health” model he was interested in. Such cascading is prolific in the deterministic world but presents a problem in a Bayesian world where one “should” assume that we can learn something about air pollution from the nation’s health. Arguably (and depending on the model) little can be learnt about air pollution levels from the nation’s health IF there is air pollution data available, and the implicit assumption they are taking is that Y1 conditioned on Z1 is approximately independent of Y2, which is a valid assumption in many cases. I used the same reasoning in the modelling of sea-level rise contribution in Antarctica. Note that I have used Y1 and Y2 in my explanation here — in fact this problem can be nicely seen as a bivariate model built using the conditional approach, as I pointed out to Sujit and Alastair after the talk. The second thing that I liked about this talk was structural estimation. Alastair attempted to estimate the neighbourhood matrix (subject to a positive-definite constraint) by placing a normal prior on a logistic transform of the elements (so that they are between zero and one). Some very interesting boundaries from elements close to zero cropped up in the map of the UK following this inference which didn’t seem at all random (one set of boundaries clearly was outlining Yorkshire in my opinion!).

Two other talks were of interest to me today. The first was on basis reduction of a spatial log-intensity in an LGCP. What intrigued me here (and which I do not yet fully understand the reasoning behind) was the use of a dual reduction a spatial field. In this work, S. Montagna (Warwick) first projects the continuous log-intensity using a set of spline basis, and the target inference becomes the weights of these basis (in the usual fashion). However, the number of basis used (say, p) could still be prohibitive, therefore a latent factor model is used on these coefficients, where the number of factors k is much less than p. The other talk was on Gaussian excursion sets by A. Astrakova (Bergen). In excursion set estimation, one is intent on find a set X defined as the set of points x such that Y(x) > C, where C is some threshold defining, for example, the 95% level (and this is generally unknown). The author assumed that the marginal variance of the underlying GP is known, but that the scale factor and the threshold are also unknown. Gibbs sampling seems to work particularly well in this context, since there is conditional independence of the scale parameter and C given the process. This work seems very related to that of Zhang, Craigmile and Cressie that uses Baddeley’s loss function to predict the exceedance region.

Thursday

Today opened up with a talk on estimation with massive data by Marc Genton. Marc Genton works at the King Abdullah University of Science and Technology (KAUST) and I was glad that Marc took the time to talk about this institute for some time. KAUST, in Saudi Arabia is investing a lot in climate science. Things of note are the new Cray system (XC40) that KAUST has purchased that contains on the order of 200,000 cores. If I’m not mistaken, this is nearly double that of the UK national supercomputer! (~ 120,000 cores). Granted, UK’s Archer XC30 is nearly 2–3 years old now but that is definitely a big machine by any standard. Also, with nearly 1PB of shared memory, possibilities are endless (although this size is way beyond what current methods in statistics can take advantage of, in my opinion). Also of interest was the virtual environment built and designed (with the help of Stefano Castruccio) to visualise global data in 3D. Immersive environments can be very useful, and being able to rotate a global and zoom in and zoom out just by waving of the hands (while seeing everything through a virtual reality device such as the Oculus Rift is very intriguing. I hope to visit this place sometime soon!

I found Marc’s talk interesting for reasons other than KAUST. One of the biggest challenges in spatial statistics is parameter estimation of spatial max-stable processes (e.g., the Brown Resnick or Reich-Shaby processes). While standard GP operations scale as n^3, computations and memory storage with max-stable processes scale as n^n. Thus, many naturally use composite likelihood to split up the estimation into small groups (pairwise and triplewise being the most common). I found it disturbing that tests with the full likelihood could only be done with up to nine data points, and even with only nine data points it took two weeks on a big cluster. A quick look at the form of the likelihood showed that this problem is (very) embarassingly parallel, is this one place where maybe massive banks of GPUs can play a big role? As even with pairwise likelihood on a cluster, things start to get a bit problematic with 200K+ data points. Marc also talked about emulation of temperature in 3D, which could only be done by distributing matrix computations in the cluster. The work (which I think Stefano Castruccio did a lot of given his later talk) consisted of 1 Billion data points produced by the NCAR CCSM4, 6 realisations, 251 years and a lon/lat grid of size 288 x 192 and a vertical resolution of 17 levels. I remember reading the paper which dealt with this problem, and it consisted of doing it in a step-wise manner, first inferring individual time-series, then grouping on latitude bands, then on longitude strips and then vertically.

The day continued with my talk on bivariate models in atmospheric physics, and an interesting talk by F. Lavancier (Nantes) on how one can combine estimators in order to improve an estimator. The content of this talk seems remarkably fundamental, and most of the audience seemed incredulous that one can always improve on an estimator by combining them with the only condition that at least one of these is consistent. I cannot comprehend this personally, as my consistent estimator surely cannot be improved by combining it by a really simple (possible naive and just plain wrong) estimator? Another talk of interest was by Stefano Castruccio, which delved into the concept of emulation as a means of “compressing” climate data. I think this is of very practical use — climate data is too large to store so why not train a statistical model to summarise it? Stefano considered the massive example Marc Genton talked about, and showed some emulator outputs together with some climate model outputs and one could not visually tell the difference. But the natural question to ask in this context (which I didn’t because there was no time to!) is, “how do I summarise the data such that any future analysis is still valid?”. Obviously one cannot do this, and implicitly, when emulating, one is always compromising information. For example, one can decide to focus on the trend, or on the variability, but not on seasonal components or maxima/minima. This is a big problem: What may be of interest today may not be if interest tomorrow and vice versa. If I compress the data with wrong judgments on what is needed to be preserved, the cost of “wasting” simulator runs could be enormous, possibly much more than buying another bank of 1TB disks (and we are talking of 1TB/day data). Stefano’s data set consisted of about 30M points and his model used an “evolutionary spectrum” to cater for non-stationarity. See http://www.mas.ncl.ac.uk/~nsc141/CMSJM.JClim.14.pdf for more details.

Thursday concluded with an exquisite dinner, just off Palaces du Papes in what was meant to be an old water cistern, very impressive setting!

Friday

By Friday we were all a bit tired but one cannot overlook David Bolin’s talk, which eventually won the Best Talk Prize of the conference. David Bolin’s premise, which is now on ArXiv, is that one can carry out covariance tapering with a spatially-varying taper. The obvious question is validity, but in the introduciton to Section 2 in the paper, one can see that this is accommodated by assuming a convolution form for the tapered covariance function. David (and Wallin) then show how one can use hyperspherical tapers in order to make this convolution analytically tractable.

So why would one want to have a variable tapering covariance function? In the beginning of his presentation, David showed that tapering is actually quite bad where the data is sparse, and has nearly no effect when the data is dense. The reason is that where data is dense, there are a lot of zeroes on the covariance matrix, and a lot of borrowing of strength between the data points. When data is sparse, a row in your covariance matrix could even have just one element! Therefore, the obvious approach would be to try and vary your tapering function spatially so that each row in your covariance matrix has the same number of non-zeros — the resulting taper has a large range where data is sparse, and a small range where data is dense. The improvements over the standard tapering (spatially-invariant) function are, as expected, quite dramatic.

This was, overall, a great conference, and I look forward to seeing updates from the above researchers next year at ISBA or METMA.