Geostatistics in the fast lane using R-TensorFlow and GPUs

On 28th May 2019 at 6pm I will be giving a talk at the Annual Joint SSA Canberra and R-Ladies Event in Canberra on the use of GPUs for solving simple problems in geostatistics. I’m going to do it ‘workshop-style’ and hopefully give the attendees some (brief) hands-on experience with TensorFlow. Here are the details:

Date: Tuesday 28 May 

5.20pm Refreshments in Foyer, Ground level, Building 145 (Where is the Hanna Neumann (MSI) building?).

6.00pm Presentation in Room 1.33, Building 145  

In my talk I will briefly discuss spatial modelling and Gaussian process regression, and then describe TensorFlow and RStudio’s API to it. At the end of the talk I will show how one can implement maximum likelihood estimation in a geostatistical context using TensorFlow, and illustrate the speed improvement when using a GPU. The vignette I will be going through can be found here. Attendees are invited to bring their own laptop to run some simple commands I mention in the talk and the code in the vignette.

 

Our book is now published and available online!

Our new book Spatio-Temporal Statistics with R has now been published with Chapman & Hall/CRC. It was a great few years writing this book with Chris Wikle and Noel Cressie, and I’m overjoyed to see the finished product. The book can be downloaded for free here. A hard-cover edition of the book can be purchased from Chapman & Hall/CRC here (Discount code: ASA18).

Summary

Book Cover

The world is becoming increasingly complex, with larger quantities of data available to be analyzed. It so happens that much of these “big data” that are available are spatio-temporal in nature, meaning that they can be indexed by their spatial locations and time stamps.

Spatio-Temporal Statistics with R provides an accessible introduction to statistical analysis of spatio-temporal data, with hands-on applications of the statistical methods using R Labs found at the end of each chapter. The book:

  • Gives a step-by-step approach to analyzing spatio-temporal data, starting with visualization, then statistical modelling, with an emphasis on hierarchical statistical models and basis function expansions, and finishing with model evaluation
  • Provides a gradual entry to the methodological aspects of spatio-temporal statistics
  • Provides broad coverage of using R as well as “R Tips” throughout.
  • Features detailed examples and applications in end-of-chapter Labs
  • Features “Technical Notes” throughout to provide additional technical detail where relevant
  • Supplemented by a website featuring the associated R package, data, reviews, errata, a discussion forum, and more

The book fills a void in the literature and available software, providing a bridge for students and researchers alike who wish to learn the basics of spatio-temporal statistics. It is written in an informal style and functions as a down-to-earth introduction to the subject. Any reader familiar with calculus-based probability and statistics, and who is comfortable with basic matrix-algebra representations of statistical models, would find this book easy to follow. The goal is to give as many people as possible the tools and confidence to analyze spatio-temporal data.

 

 

A “reproducible package” for replicating code

What’s the issue?

Sometimes it is easy to get caught up in all the intricacies of the publication process; writing up, submitting, revising, re-submitting and so on. Frequently all this effort falls on one or two authors which is a big ask in itself. Understandably, the urge is frequently to close off a chapter as quickly as possible in order to start a new one. Code and data are shrugged off as ‘having served their use’ and end up being stored on a computer somewhere, only to get missplaced or lost in the process of re-filing some ten years down the line.

We have all heard of the importance of commenting and organising your code and data, however we should be focused more on the ‘permanence’ of code and results. I believe that in a global research environment, where we constantly utilise (many times unknowingly) contributions from others amongst us and those from generations before us, that algorithms used in research papers should be made publicly available, even if the data cannot be for confidential reasons. In the very few cases where even the code cannot be disseminated, at the very least snippets of code reproducing some of the results present in a publication should be provided.

There are several advantages of making your code reproducible with little effort on behalf of an external researcher:

  • It increases transparency. It is not an assurance that your code is optimal or even correct for that matter. It is an expression of belief in your research, that you would rather have someone re-use your code and find any inconsistencies, if there are any, rather than have something safely tucked away that nobody can investigate.
  • It increases trust. Making sure your results can be reproduced with ease shows the fellow researcher that you have nothing to hide and that there were no ‘tricks’ needed to get the required results (or if there was, you can be upfront about them).
  • It ensures permanence. It is easy to replicate results from a script file that you coded on your machine, while you have that same machine and when you remember where the file is. But what about 10 years, 20 years down the line? Permanence ensures that if someone questions your methods a decade from now you are still able to provide an answer.

To ensure reproducibility I have come up with a protocol that I try to adhere to whenever possible.

A Reproducibility protocol

The protocol I come with below is specific to the programming language used predominantly at our group, R Software. It is heavily based on the book R Packages: Organize, test, document and share your code by Hadley Wickham.

  • Organise your code into self-contained functions and put them into an R package. This is the first and most important step of the reproducibility protocol. If possible write automatic test functions to make sure the interface to the functions is predictable and robust to future modifications. Automatic testing is provided by the package testthat. Packaging functions cannot be under-emphasised; it ensures encapsulation, that they do not implicitly depend on global variables or other script-specific options that may be set.
  • Once your functions are in the package, document them using Roxygen2. The documentation does not need to be as rigorous and organised as required, for example, for a CRAN package, but it needs to be understandable and self-contained. Using Roxygen2 ensures that if the interface to the function is changed, the documentation will too.
  • Keep the data somewhere permanent. If not confidential, put your data in the data folder of your package or, if very large, in a data repository such as datahub.io, that is especially utilised in your code (or at least commented out for cache purposes). In any case, the raw data should be uploaded and not some modified version of it.
  • Set the random seeds. Most computational statistics involves use of random number generator that can be rendered reproducible by setting the seed at the beginning of the program. Things get a bit tricky when using parallel threads (for which set.seed() does not solve the problem). When working with parallel threads use the excellent package doRNG that ensures that all your threads get a unique seed, but one that is the same every time the code is run.
  • Place reproducible script file as a vignette in the R package. Having a vignette reproducing your results is helpful, as an output document can be produced showing additional results and figures. If something is wrong, this usually can be seen at a glance from the output document. If the code takes a long time to run, insert flags in the code that ensure that, when in development mode, time-consuming results are cached in the data folder. When not in development mode these results can be loaded. All one needs to do is then ensure that the script is run once in development mode before consolidation.
  • Use the figures generated by the vignette in your paper TeX file. Not everybody is comfortable uploading the source files of a paper online (and hence putting your entire paper as the vignette), but that does not mean the paper source files cannot use the vignette figures. Once the TeX file is linked up to the correct figure folder, you can rest assured that the figures appearing in your document are not from your deprecated code, but from your updated, publicly available one.
  • Use an open-source repository such as Github. This not only ensures permanence of the code, but also accountability. Each change in the code is recorded. When the paper using the code is submitted you can release a version of the reproducible package. When the paper comes back for corrections, you will probably change the package and the vignette and on re-submission you will re-version the package. This has several advantages. If you are writing a second paper on the same topic that uses the same function, you do not need to be scared of altering the results of your previous paper since you can always revert to the package version that was submitted with the first paper.
  • Test on several platforms. This is particularly straightforward to do if you have a reproducible package. There is no messing around with Makefiles and compiler-specific options. Even if you have source code, the R package will seamlessly install on most platforms with minimal effort. Check the output on each platform and note down any inconsistencies. Make sure that the platform and important versions on the machine used to generate the results are clearly listed on the development Github page. If you are using a high-performance computer (HPC), try at least two HPCs. If this is not possible, then clearly state this in the development page.
  • Keep track of package versions. One of the most annoying realities is that keeping your end of the bargain in reproducibility is usually not enough: you depend on others to keep theirs. In open source software backward compatibility is not adhered to as much as one would wish, and therefore one must keep track of the versions of the packages being depended on. At the very minimum you can take a sessionInfo() and paste that on your Github page. Alternatively you can use packrat or checkpoint to keep track of package versions.

Getting MAGMA to work with R

MAGMA is a fascinating project. It is short for “Matrix Algebra on GPU and Multicore Architectures” and aims to allow users to easily “hybridise” their code. If I have a super-GPU I can do my operations there, if tomorrow I decide to sell my GPU and get a nice plant for my office instead, no worries, with MAGMA I will not need to change anything in my code (maybe just a flag), it’s that portable.

I’m going to be a bit of a wet blanket here and immediately point out that unless you have a great GPU (or multiple GPUs), or else you’re going to be using highly specialised code, you probably don’t need to install MAGMA, as CPUs today are highly optimised to perform linear algebraic routines efficiently.

However, if you DO want to install MAGMA in R, you can. Unfortunately it’s not as simple as “Installing MAGMA, then install the R package magma” and you’re done, for one simple reason, MAGMA has moved on, the R package hasn’t.

1. Installing MAGMA

For MAGMA you first need to have your BLAS libraries installed. I spent a good couple of days trying to install MAGMA with OpenBLAS. This did NOT work (although it installed and all). I was so happy with OpenBLAS integrated with R that I stayed at it forever. In the end I gave up. It turns out that installing ATLAS (another great BLAS package), however, is not that hard. In Ubuntu simply do ‘sudo apt-get install libatlas-base-dev’. Like this you get a basic ATLAS (not optimised to your system) but for us that is good enough for now.

When you download MAGMA (1.6.1) you will see many make.inc.xxx files, where xxx is the BLAS package you have. Since we have ATLAS, simply copy make.inc.atlas as make.inc. Then open make.inc with an editor and uncomment the line FPIC= -fPIC. We need this so we get a shared library which we then can integrate with R. Another thing to remove from this file is the “-lifcore” argument for the LIB variable. This is a link to Intel FORTRAN compiler libraries and are not available freely. Luckily, everything seems to be in the gfortran libraries and I had no problems from leaving this out.

Once the make.inc is set up, make sure you set all your paths! I had to set LAPACKDIR=/usr/lib/lapack, ATLASDIR=/usr/lib/atlas-base/ and CUDADIR=/usr/local/cuda-6.5. Once these are set, type “make” and 30 minutes later you should have MAGMA installed. Make sure you go in the ‘testing’ folder and try out a few tests there, they should run with no problems. If they don’t, try and sort out the problem, and recompile.

2. Integrating with R

Unfortunately, the R package ‘magma’ and the current version of MAGMA are incompatible (I’ll be trying to change that with the maintainer at some point).

The first thing is that the R-magma requires a library magmablas.so, but this is no longer being created by MAGMA and everything is in magma.so. Therefore, download the R-magma tar.gz, extract it in a folder, then go into the ‘configure’ file and scroll down to Line 2468. Here instead of LIBS=”${LIBS} -lmagmablas -lmagma” do LIBS=”${LIBS} -L/home/andrew/magma-1.6.1/lib -lmagma -llapack” or wherever your MAGMA library is (you probably do not need -llapack). You might also need to set your “CUDA_HOME” variable to point to your CUDA directory (see above).

The second thing is that the interface of some crucial MAGMA functions have changed. You have to go into the src/ directory and some lines need to be changed. For example,

char UPLO = (LOGICAL_VALUE(uprtri) ? 'U' : 'L'),
TRANSA = (LOGICAL_VALUE(transa) ? 'T' : 'N');

needs to become

int UPLO = (LOGICAL_VALUE(uprtri) ? MagmaUpper : MagmaLower),
TRANSA = (LOGICAL_VALUE(transa) ? MagmaTrans : MagmaNoTrans);

Essentially, some functions now require integers (defined through MagmaUpper etc.) instead of character variables. For example, in magSolvers.c with the function magTriSolve,

magma_dtrsm('L', UPLO, TRANSA, 'N', K, N, 1.0, dA, M, dB, M);

now becomes

magma_dtrsm(MagmaLeft, UPLO, TRANSA, MagmaNonUnit, K, N, 1.0, dA, M, dB, M);

There aren’t too many of these, and what I did was compile, try out some functions in R, and when I got an error went to change the source, recompiled and so on.

The last thing, is that MAGMA now requires magma_init() to be called before the GPU is used. Open magUtils.c and scroll down to the function magLoad(). Put magma_init(); in the first line and you should be done.

Go back to the top folder, and do ./configure. This will make the appropriate makefiles. Then type R CMD INSTALL . and hopefully you will not get any errors and you’re ready to go!

EFDR v.0.1.0 uploaded on CRAN! Here is the tutorial vignette

Enhanced False Discovery Rate (EFDR) tutorial

Andrew Zammit Mangion

Introduction

Enhanced False Discovery Rate (EFDR) is a nonparameteric hypothesis testing procedure used for anomaly detection in noisy spatial signals. The method, published by Shen et al. (2002), extends standard multiple hypothesis testing approaches (for example those employing the Bonferroni correction, or the standard false discovery rate (FDR) approach) by reducing the number of tests carried out. A reduced number of tests results in an approach which has more power (i.e. a lower Type II error), and hence an increased ability to detect anomalies. EFDR has proven to be vastly superior to the Bonferroni correction and superior to the standard FDR: the interested reader is referred to Shen et al. (2002) for a detailed study.

The purpose of this tutorial is to outline the basics of EFDR and demonstrate the use of the R package, EFDR, which can implement EFDR together with other basic hypothesis testing methods commonly used with spatial signals.

Underlying Theory

The ‘enhanced’ in EFDR is used to emphasise the increased power of the method over other conventional methods, brought about by reducing the number of hypothesis tests carried out when analysing a spatial signal. The number of tests carried out is reduced in two ways. First, the image is converted into the wavelet domain through a discrete wavelet transform (DWT); since the number of coefficients in wavelet space is generally chosen to be less than the number of pixels in the image, less tests are carried out. Second, tests are only carried out on some wavelet coefficients whose neighbours have relatively large wavelet coefficients.

The size of the wavelet neighbourhood b is determined by the user, while the definition of what constitutes a neighbour is determined by a metric which is fixed. This metric takes into consideration the (i) resolution, (ii) orientation, and (iii) spatial position of the wavelet. Wavelets at the same resolution are considered closer to each other than wavelets across resolutions. The same holds for orientation and spatial position. For more details on the distance metric employed, refer to Shen et al. (2002). In practice, the neighbours of wavelet i are found by calculating the distance to every other wavelet; the b closest wavelets are then assigned as neighbours of the i^{th} wavelet.

If there are L wavelets, then only L^* wavelets are tested, and usually L^* \ll L. A key difficulty is finding a suitable L^* such that the Type I error rate is controlled at some pre-specified level \alpha. One such method, which is proved to control the Type I error in the Appendix of Shen et al. (2002), is found by minimising a cost function which penalises for a large L^*. The penalisation term requires the evaluation of an integral using Monte Carlo methods.

The EFDR R package: A toy study

As apparent from the previous section, EFDR is a multi-faceted appoach requiring wavelet transforms, neighbourhood selection in a wavelet domain and the evaluation of an optimal L^*. The package EFDR is designed to facilitate the implementation of EFDR, and uses computationally-efficient parallel methods to render manageable what is naively a highly computationally-intensive approach to false-discovery-rate detection. The package is loaded in R as follows. We will also load the package fields for plotting purposes:

library(EFDR)
library(fields)

Next, the user has to specify an image on which to carry out the analysis. The image needs to be an n1 by n2 matrix where both n1 and $latex n2$ are integer powers of 2 (we also provide functions to deal with data which do not lie on a regular grid through the function regrid. This will be discussed later on in the tutorial). To get up and running, we have provided a function test_image which generates images similar to those analysed in Shen et al. (2002) with the package. In the following example we generate a 64 by 64 image and add some noise to it before plotting it (a filled circle embedded in Gaussian noise).

n = 64
Z <- test_image(n1=n)$z
Z_noisy <- Z + 0.2*rnorm(n^2)
fields::image.plot(Z,zlim=c(-0.5,1.5),asp=1)

plot of chunk unnamed-chunk-2

fields::image.plot(Z_noisy,zlim=c(-0.5,1.5),asp=1)

plot of chunk unnamed-chunk-2

Next, the user needs to provide some essential parameters. These are

  • alpha: significance level
  • wf: wavelet name (typically “la8”)
  • J: number of multi-level resolutions (for a 64 x 64, 3 is sufficient)
  • b: number of neighbours to consider with EFDR
  • n.hyp: numeric vector of test counts, from which the optimal number of tests, L^*, will be found. Setting this to a scalar fixes the number of tests a priori
  • iteration: number of iterations in the Monte Carlo integration described earlier
  • parallel: number of cores to use
alpha =     0.05        # 5% significant level
wf =        "la8"       # wavelet name
J =         3           # 3 resolutions
b =         11          # 11 neighbours for EFDR tests
n.hyp =     c(5,10,20,30,40,60,80,100,400,800,1200,2400,4096)  # find optimal number of tests in
                                                               # EFDR from this list
iteration = 100         # number of iterations in MC run when finding optimal n in n.hyp
idp =       0.5         # inverse distance weighting power
nmax =      15          # maximum number of points to consider when carrying out inverse
                        # distance interpolation
parallel =  parallel::detectCores()/2 # use half the number of available cores
set.seed(20)            # same random seed for reproducibility

That’s it! Now we can run EFDR. For comparison purposes, the package also provides functions for implementing the Bonferroni, the standard FDR and the Largest Order Statistic method (which tests the coefficient associated with the most extreme p-value, and deems it as significant if it less than (1 - (1-\alpha)^{1/n}) in wavelet space. The interface to each of these methods is similar.

m1 <- test.bonferroni(Z_noisy, wf=wf,J=J, alpha = alpha)
m2 <- test.fdr(Z_noisy, wf=wf,J=J, alpha = alpha)
m3 <- test.los(Z_noisy, wf=wf,J=J, alpha = alpha)
m4 <- test.efdr(Z_noisy, wf="la8",J=J, alpha = alpha, n.hyp = n.hyp,
                b=b,iteration=iteration,parallel = parallel)
## Starting EFDR ...
## ... Finding neighbours ...
## ... Estimating the EFDR optimal number of tests ...

The functions provide several results, mostly for diagnostic purposes, in a list (see documentation for details). The result of most interest is the image which contains only the ‘extreme’ wavelet coefficients, stored in the field Z. These can be plotted using standard plotting functions.

par(mfrow=c(2,2))
fields::image.plot(m1$Z,main = "Bonferroni",zlim=c(-1.5,2),asp=1)
fields::image.plot(m2$Z,main = "FDR",zlim=c(-1.5,2),asp=1)
fields::image.plot(m3$Z,main = "LOS",zlim=c(-1.5,2),asp=1)
fields::image.plot(m4$Z,main = "EFDR",zlim=c(-1.5,2),asp=1)

plot of chunk unnamed-chunk-5

To compute the power of the test (which we can, since we know the truth), we can use the functions wav_th and fdrpower. The first determines which wavelength coefficients are ‘significant’ by seeing which, in the raw signal, exceed a given threshold. The second function determines which of these wavelet coefficients were correctly found to be rejected under the null hypothesis that the coefficient is zero

sig <- wav_th(Z,wf=wf,J=J,th=1)
cat(paste0("Power of Bonferroni test: ",fdrpower(sig,m1$reject_coeff)),
    paste0("Power of FDR test: ",fdrpower(sig,m2$reject_coeff)),
    paste0("Power of LOS test: ",fdrpower(sig,m3$reject_coeff)),
    paste0("Power of EFDR test: ",fdrpower(sig,m4$reject_coeff)),sep="\n")
## Power of Bonferroni test: 0.85
## Power of FDR test: 0.95
## Power of LOS test: 0.05
## Power of EFDR test: 0.95

Another tool delivered with the package is a diagnostic table containing the number of false positives, false negatives, true negatives and true positives.

### Bonferroni diagnostic table:
diagnostic.table(sig,m1$reject_coeff,n = n^2)
##                     Real Negative Real Positive
## Diagnostic Negative          4073             3
## Diagnostic Positive             3            17
### FDR diagnostic table:
diagnostic.table(sig,m2$reject_coeff,n = n^2)
##                     Real Negative Real Positive
## Diagnostic Negative          4064            12
## Diagnostic Positive             1            19
### LOS diagnostic table:
diagnostic.table(sig,m3$reject_coeff,n = n^2)
##                     Real Negative Real Positive
## Diagnostic Negative          4076             0
## Diagnostic Positive            19             1
### EFDR diagnostic table:
diagnostic.table(sig,m4$reject_coeff,n = n^2)
##                     Real Negative Real Positive
## Diagnostic Negative          4062            14
## Diagnostic Positive             1            19

Several useful diagnostic measures can be computed from this table. In particular, define

  • A_{TN}: Number of true negatives
  • A_{FP}: Number of false positives
  • A_{FN}: Number of false negatives
  • A_{TP}: Number of true positives

Then, the false discovery rate (FDR), false nondiscovery rate (FNR), specificity (Sp) and sensitivity (Se) can be computed as

FDR = A_{FP}/(A_{FP} + A_{TP})

FNR = A_{FN}/(A_{TN} + A_{FN})

Sp = A_{TN}/(A_{TN} + A_{FP})

Se = A_{TP}/(A_{FN} + A_{TP})

This example considered a relatively ‘clean’ image. The real power of EFDR becomes apparent when analysing very noisy images, where the signal is not even visible from a casual inspection and hence having a powerful testing procedure becomes important. In the following we take the same signal, but this time add on a considerable amount of noise before running the aforementioned tests.

set.seed(1) # for reproducibility
Z <- test_image(n1=n)$z
Z_noisy <- Z + 2*rnorm(n^2)
fields::image.plot(Z_noisy)

plot of chunk unnamed-chunk-8

m1 <- test.bonferroni(Z_noisy, wf=wf,J=J, alpha = alpha)
m2 <- test.fdr(Z_noisy, wf=wf,J=J, alpha = alpha)
m3 <- test.los(Z_noisy, wf=wf,J=J, alpha = alpha)
m4 <- test.efdr(Z_noisy, wf="la8",J=J, alpha = alpha, n.hyp = n.hyp,
                b=b,iteration=iteration,parallel = parallel)
## Starting EFDR ...
## ... Finding neighbours ...
## ... Estimating the EFDR optimal number of tests ...
par(mfrow=c(2,2))
fields::image.plot(m1$Z,main = "Bonferroni", zlim=c(-1.5,2),asp=1)
fields::image.plot(m2$Z,main = "FDR", zlim=c(-1.5,2),asp=1)
fields::image.plot(m3$Z,main = "LOS", zlim=c(-1.5,2),asp=1)
fields::image.plot(m4$Z,main = "EFDR", zlim=c(-1.5,2))

plot of chunk unnamed-chunk-9

sig <- wav_th(Z,wf=wf,J=J,th=1)
cat(paste0("Power of Bonferroni test: ",fdrpower(sig,m1$reject_coeff)),
    paste0("Power of FDR test: ",fdrpower(sig,m2$reject_coeff)),
    paste0("Power of LOS test: ",fdrpower(sig,m3$reject_coeff)),
    paste0("Power of EFDR test: ",fdrpower(sig,m4$reject_coeff)),sep="\n")
## Power of Bonferroni test: 0
## Power of FDR test: 0
## Power of LOS test: 0
## Power of EFDR test: 0.3

In fact, in this case no method detects an anomaly except for EFDR. Note that some spurious signals are detected, this is a result of an larger Type I error than the other methods, but recall that EFDR’s Type I error rate is still constrained to be less than \alpha.

A real example

In this example we implement all the above tests on data from the Atmospheric Infrared Sounder (AIRS). Data are available from the NIASRA data portal.

As a simple case study, we take CO2 data from the first three days in these data (01 – 03 May 2003) and compare these to data in the second three days (4 – 6 May 2003) in the conterminus United States and Canada. The change in CO2 in these periods is an indication of regional fluxes and we are interested in where the flux is significantly different from zero. This satellite data is irregular in space; it therefore needs to be first regridded and interpolated onto an image of appropriate size.

First, load the dataset using RCurl:

library(RCurl)
airs.raw <- read.table(textConnection(getURL(
                "http://hpc.niasra.uow.edu.au/ckan/dataset/6ddf61bb-2f9a-424a-8775-e23ebaa55afb/resource/e1206e1c-9dca-42b6-8455-3cdd98c6a943/download/airs2003may116.csv"
                )),header=T,sep=",")

Now we will concentrate on North America, and change the variable names into the appropriate x-y-z notation required to regrid the data. For this we use some handy functions from the dplyr package.

library(dplyr)
airs.raw <- airs.raw %>%
  subset(lat > 14 & lat < 66 & lon > -145 & lon < -52) %>%
  mutate(x=lon,y=lat,z=co2avgret) %>%
  select(day,x,y,z)

For plotting we use the ggplot2 package which includes the world coastline. Below we plot all the data in the first three days of the dataset over North America.

library(ggplot2)
NorthAmerica <- map_data("world") %>%
                  subset(region %in% c("USA","Canada","Mexico"))
par(pin = c(5,1))
ggplot() +
    geom_point(data=subset(airs.raw,day%in% 1:3),aes(x=x,y=y,colour=z)) +
    geom_path(data=NorthAmerica,aes(x=long,y=lat,group=group)) +
    scale_colour_gradient(low="yellow",high="blue") +
    coord_fixed(xlim=c(-145,-52),ylim=c(14,66))

plot of chunk unnamed-chunk-13

In order to put this data frame into an image or matrix format, we use EFDR::regrid. Regrid requires three parameters:

  • n1: the length (in pixels) of the image
  • n2: (optional) the height (in pixels) of the image
  • idp: the inverse distance power used in the inverse distance weighting interpolation function used to fill in the gaps (e.g. idw = 2)
  • nmax: the maximum number of neighbours to consider when carrying out the interpolation (for computational efficiency this should be low, around 8).

In the following example we use an inverse squared distance weighting and re-grid the AIRS CO2 data from 1–3 May 2003 into a 256 x 128 image.

idp = 0.5
n1 = 256
n2 = 128

airs.interp1 <- airs.raw %>%
  subset(day %in% 1:3) %>%
  select(-day) %>%
  regrid(n1=n1,n2=n2,idp= idp,nmax = nmax)
## [inverse distance weighted interpolation]
  # idp is the inverse distance power and
  # nmax is the number of neighbours to consider

ggplot() +
    geom_tile(data=airs.interp1,aes(x=x,y=y,fill=z)) +
    geom_path(data=NorthAmerica,aes(x=long,y=lat,group=group)) +
    scale_fill_gradient(low="yellow",high="blue") +
    coord_fixed(xlim=c(-145,-52),ylim=c(14,66))

plot of chunk unnamed-chunk-14

We are interested in changes in CO2 levels. Below we produce an image of CO2 fluxes by repeating the regridding on the AIRS CO2 data from 4–6 May 2003 and then finding the difference between the two images

airs.interp2 <- airs.raw %>%
  subset(day %in% 4:6) %>%
  select(-day) %>%
  regrid(n1=n1,n2=n2,idp= idp,nmax = nmax)
## [inverse distance weighted interpolation]
airs.interp <- airs.interp2 %>%
  mutate(z = airs.interp1$z - airs.interp2$z)

The differenced image looks as follows

ggplot() +
    geom_tile(data=airs.interp,aes(x=x,y=y,fill=z)) +
    geom_path(data=NorthAmerica,aes(x=long,y=lat,group=group)) +
    scale_fill_gradient2(low="blue",high="red", mid="#FFFFCC") +
    coord_fixed(xlim=c(-145,-52),ylim=c(14,66))

plot of chunk unnamed-chunk-16

We now convert the data frame to an image matrix (using the package function df.to.mat) and then carry out the wavelet-domain tests, including EFDR.

Z <- df.to.mat(airs.interp)
m1 <- test.bonferroni(Z, wf=wf,J=J, alpha = alpha)
m2 <- test.fdr(Z, wf=wf,J=J, alpha = alpha)
m3 <- test.los(Z, wf=wf,J=J, alpha = alpha)
m4 <- test.efdr(Z, wf="la8",J=J, alpha = alpha, n.hyp = n.hyp,
                b=b,iteration=iteration,parallel = parallel)
## Starting EFDR ...
## ... Finding neighbours ...
## ... Estimating the EFDR optimal number of tests ...

For plotting we re-arrange the data into data frames and then repeat the above. For gridding ggplot objects we will use the gridExtra package.

library(gridExtra)

airs.interp <- airs.interp %>%
                  mutate(m1 = c(m1$Z),
                         m2 = c(m2$Z),
                         m3 = c(m3$Z),
                          m4 = c(m4$Z))

g1 <- ggplot() +
    geom_tile(data=airs.interp,aes(x=x,y=y,fill=m1)) +
    geom_path(data=NorthAmerica,aes(x=long,y=lat,group=group)) +
    scale_fill_gradient2(low="blue",high="red", mid="#FFFFCC") +
    coord_fixed(xlim=c(-145,-52),ylim=c(14,66)) + ggtitle("Bonferroni")

g2 <- ggplot() +
    geom_tile(data=airs.interp,aes(x=x,y=y,fill=m2)) +
    geom_path(data=NorthAmerica,aes(x=long,y=lat,group=group)) +
    scale_fill_gradient2(low="blue",high="red", mid="#FFFFCC") +
    coord_fixed(xlim=c(-145,-52),ylim=c(14,66)) + ggtitle("FDR")

g3 <- ggplot() +
    geom_tile(data=airs.interp,aes(x=x,y=y,fill=m3)) +
    geom_path(data=NorthAmerica,aes(x=long,y=lat,group=group)) +
    scale_fill_gradient2(low="blue",high="red", mid="#FFFFCC") +
    coord_fixed(xlim=c(-145,-52),ylim=c(14,66)) + ggtitle("LOS")

g4 <- ggplot() +
    geom_tile(data=airs.interp,aes(x=x,y=y,fill=m4)) +
    geom_path(data=NorthAmerica,aes(x=long,y=lat,group=group)) +
    scale_fill_gradient2(low="blue",high="red", mid="#FFFFCC") +
    coord_fixed(xlim=c(-145,-52),ylim=c(14,66)) + ggtitle("EFDR")

gridExtra::grid.arrange(g1, g2, g3, g4, ncol=2)

plot of chunk unnamed-chunk-18

Once again, EFDR is seen to detect more anomalies than all other methods. In this case both LOS and the standard Bonferroni correction do not detect anomalies due to a poor power. Another example demonstrating the increased power of EFDR using a temperature difference dataset can be found in Shen et al. (2002).

Reference

Shen, Xiaotong, Hsin-Cheng Huang, and Noel Cressie. “Nonparametric hypothesis testing for a spatial signal.” Journal of the American Statistical Association 97.460 (2002): 1122-1140.

Installing Rpud on Ubuntu 14.04

`rpud` is an R package which allows for R operations on your NVIDIA GPU with minimal coding effort. It’s an excellent package and well worth looking into.

Unfortunately, its installation is not so straightforward on Ubuntu 14.04 (at least it wasn’t for me). The package comes with a help file but this is pretty useless if you encounter some obscure error. I will outline what I did to overcome these difficulties below.

1. Install from source

Unfortunately, one of the fixes I found requires you to edit the configuration file, therefore the source is needed. Download the source from CRAN and use ‘tar -zxvf’ to unzip and extract it in a folder of your choice.

2. Read the INSTALL file

The INSTALL file instructs the user to set the R_LIBS_USER environment variable. Put this in your ~/.bashrc file and, if you do not feel like restarting the terminal, set it by simply typing ‘export R_LIBS_USER=[…]’ where […] will be your R library directory. In my case this was ~/R/x86_64-pc-linux-gnu-library/3.1/

3. You need to set configuration options during installation

The configuration file for rpud will set default paths, if not specified, which are probably not relevant for your system. This will be a problem, as the installation will not be able to find “nvcc”, the NVIDIA compiler, which is obviously a pretty basic thing which is needed when compiling using CUDA libraries! The install.packages command should therefore look like something as follows:

install.packages(“./rpud/”,type=”source”,configure.args = c(” ./configure –with-cuda-home=/usr –with-r-lib=/usr/lib/R/lib”),repos=NULL,verbose = T)

Note the –with-cuda-home and –with-r-lib flags. These need to be set accordingly. If ‘nvcc’ lies in ‘/usr/bin’ then the first flag needs to just be ‘/usr’. rpud, by default, assumes something different.

4. Edit the configuration file

This is probably the trickiest part. There are two problems with the configuration file. First,rpud will assume that libcublas.so is in $CUDA_HOME/lib64. This is not always the case (in my system is was in $CUDA_HOME/lib/x86_64-linux-gnu), so do a quick search for ‘lib64’ in the file ‘configuration’ and change accordingly if your libcublas.so is in a different directory. I had to do this change in four or five places.

Second, and this is harder to debug, rpud assumes a (deprecated or different?) g++ version which assumes the existence of a “-Wl” option.  My gcc version if 4.8.2 and does not know what -Wl means. When installing the package you will get repeated errors such as

g++: error: unrecognized command line option ‘-Wl’

In the configuration file the problem is with

RPATHFLAG=”-Wl,-rpath,${CUDA_HOME}${CUDA_LIB_DIR}”

I did NOT find a proper workaround for this. However, simply emptying RPATHFLAG seemed to do the trick. So in the end I simply set

RPATHFLAG=

Then the installation went through fine after this and rpud works like a charm once installed.

Namespaces and load_all in devtools

devtools is a great R package by Hadley Wickham which helps you create your own packages when developing a project from Day 1. I definitely recommend it, even for beginners. For an introduction to package construction visit Hadley Wickham’s wiki page and when more in-depth knowledge is needed refer to the R Extensions Manual.

What I want to focus on here is a problem you might encounter when using the function load_all to load a source package. load_all simulates the loading of a normal package, by sourcing all the R scripts in the R/ folder, loading all the data in the data/ folder and compiling (and attaching the resulting so/dll) all the source code in the src/ folder. However I had trouble (as had many) importing the namespace of other packages for use with your custom-built one.

The problem

Consider the simple function

myrowsum <- function(Q) {
   return(rowSums(Q))
}

where Q is expected to be a sparse matrix of class dgCMatrix defined in package Matrix. There are many suggested steps to make this happen.
1. Add Matrix to the Depends field in the DESCRIPTION file.
2. Add import(Matrix) to the NAMESPACE file or simply
3. Add importFrom(Matrix,rowSums) to the NAMESPACE file (or using roxygen2 as described here)

However, none of these had the desirable effect of loading the Matrix namespace. Adding  print(environment(rowSums)) to the function above revealed that in fact rowSums belonged to the environment namespace:base. This code would not work for sparse matrices.

To load the namespace of the desired package, the only way I tried which worked was by adding Matrix to the Imports field in the DESCRIPTION file. This had the desired effect of associating rowSums with the correct namespace; I used this approach for all other packages in my project which now are working fine.

Estimating yearly trends with an EM algorithm in R

This informal post is primarily aimed at people working at the interface between the geosciences and statistics.

If you are studying real geophysical applications you probably are aware of the value of data summaries to make the problem computationally feasible. As an example, if you are faced with 1e9 satellite data points at a sub 1km scale, you might opt to average the data on a 1km grid to bring the data down to a manageable size of 1e6 points. Hopefully this would not come at a cost of inferential accuracy. However data summaries can, sometimes, be carried out wrongly, leading to erroneous inferences.

In the problem we are studying we summarise the data as “linear trends in time”.  Analysing the temporal behaviour of “temporal trends” can be difficult, not because there is anything inherently complicated with the concept of a “trend” but because a trend itself is suggestive of some underlying model. Those familiar with hierarchical modelling will be comfortable with ways to accommodate this but unfortunately, if you’re using this to summarise your data, probably you are using this to avoid a “deep hierarchy” which would be computationally prohibitive to use (e.g. because the low-level spatio-temporal model is very high-dimensional). In this case a trend becomes a “data point” and we need to make sure that the “data” is representative of what we are modelling. If you are a statistician you probably are provided with these trends directly, however we should be careful to make sure that these values actually reflect the quantity we want to model.

The Problem

In an application I’m currently working on we are modelling yearly trends of the height change of the Antarctic ice sheet. If, say, a location x has witnessed a height increase of 2m in 2007 then we say that the trend for that year was 2m/yr. If it witnessed a 10m decrease over 5 years then we say that the trend there was -2m/yr and so on.

Let’s say we have data spanning 2006-2009. When working with a spatial model,  we can fit trends using OLS to the data at each spatial point and then use standard spatial statistics to obtain posterior means and credibility intervals at every point in space using the standard deviation of the trend as the “trend error”. If we are only concerned with linear trends then this seems a reasonable thing to do. For example, the trend at one spatial point may look like this:

3year_trend

Taking the trend at some spatial points, one may use spatial statistics to infer the trends at every spatial (observed and unobserved) point, see some pictures in my recent paper (open access).

Extending to the spatio-temporal case is however not straightforward. One should not just split up the data into individual years and find the trend in each year (especially without deseasoning as shown below). Doing this for the above data set would give the following:

single_year_trend

The biggest problem with the above trends is that they are suggestive of a “global trend” which is much larger than the truth. For example, in this particular example, the 3-year trend is 2.8 \pm 0.2mm/yr (errors at 1 \sigma) but the linear trend calculated yearly is given by

Year   Trend (mm)   Trend error (mm) 
2006 4.6 1.0
2007 7.2 1.1
2008 8.9 1.0

Any model you are going to fit to these three data points is going to result in inferences of trends which are much larger than they should be.

De-seasoning

A big player here is the seasonal signal and this should be removed. There are many ways to do this, my personal favourite is stl available in most statistical software packages although this is a non-parametric approach. Modelling it explicitly is another alternative. Here we will do what is most commonly done (criticised in the previous link), which is to subtract the monthly means from the data prior to finding the trend. This works pretty well in practice. Our global 3-year trend changes slightly to 2.2 \pm 0.2mm/yr whilst the yearly trends become

Year   Trend (mm)   Trend error (mm) 
2006 -1.6 0.9
2007 1.1 0.9
2008 2.7 0.8

shown below

single_year_trend

which are more reasonable (note that much more sophisticated approaches can be used). However there is still one problem, which is that we still have discontinuities at the edges. For example, there is a jump at 2008 (because we have chosen to have a temporal demarcation there) which diminishes the effect of the positive trend i.e. there was an increase which was not accounted for.

To minimise boundary effects we need to tie up the ends such that the beginning of 2009 links with the end of 2008 and so on. This can be done parametrically by fitting (after deseasoning) a piecewise linear function to the data. We do this by using “hat” basis functions, as is done in the standard finite element method and then fitting the hat functions to the data using standard techniques. This results in the following

EM_single_year_trend

with the following estimates:

Year   Trend (mm)   Trend error (mm) 
2006 -2.2 0.6
2007 3.0 0.5
2008 5.0 0.7

This, to me, seems the most straightforward way to estimate trends on a yearly scale, which we can then treat as data points in our spatio-temporal framework (we will have 1e6 of these mini time-series to model). I’ll be happy to hear other alternatives.

For those who wish some technical details, in the following I describe the expectation maximisation (EM) algorithm for this problem.

EM

For an introduction to the expectation maximisation (EM) algorithm I highly recommend Bishop’s book.

We assume the following parametric model for the underlying field:

y = Cx + v \\ x \sim MVN(0,Q)

where x denotes the weights of our “hat” basis functions \phi(t) over time t. These hat functions are set up such that they have a maximum at yearly point and reach zero at yearly points on either side (and thus have a support of two years). The prior precision matrix Q can be set as diagonal with small elements if we wish the prior to be largely uninformative. The matrix C maps the weights to the observations and thus each row of C is given by \phi_j(t_i) where t_i is the observation timestamp. Since each observation is only in the support of two tent functions, this matrix is very sparse and we exploit this to speed up computations.

We assume homogeneous noise v \sim N(0,\sigma^2), however we need to estimate \sigma^2 (actually the precision \sigma^{-2}). For EM we need to find the mean and variance over x at each iteration, which in this case are simply given by

\Sigma = C^TQC + \widehat{\sigma^{-2}}I

\mu = \widehat{\sigma^{-2}}\Sigma C^T y

The log-likelihood under expectations of x is then given by

\langle \ln p(y | x,\sigma^{-2} \rangle = \langle \frac{1}{2}\ln |\sigma^{-2}I| - \frac{\sigma^{-2}}{2}(y-Cx)^T(y-Cx)\rangle

which is maximised with \widehat{\sigma^{-2}} = n/\Gamma where

\Gamma = y^Ty - 2\mu^TC^Ty + tr(C^TC[\mu\mu^T + \Sigma])

and n is the number of data points. The algorithm is thus very straightforward and probably the hardest part is to find the knots and actually set up the basis! I provide the code for the function below, do let me know if you find it useful.

</pre>
# Piecewise interpolation of a function using tent functions.
# X is a data-frame with fields 'x' and 'y'
# ds denotes the spacing of the knots (1 by default)
# "first" denotes the location of the first knot (e.g. 2006)

Piecewise_interp <- function(X,ds=1,first=NULL) {
stopifnot(is(X,"data.frame"))
stopifnot("x" %in% names(X) )
stopifnot("y" %in% names(X) )
stopifnot(is(ds,"numeric"))
stopifnot(is.null(first) | is(first,"numeric"))

require(plyr)
require(Matrix)

if (is.null(first)) { # Automatically place knots
first_val <- ceiling(min(X$x)/ds)*ds - ds
       last_val <- floor(max(X$x)/ds)*ds + ds
       knots <- seq(first_val,last_val,by=ds) # Form notes
} else { # Starting point pre-set
       first_val <- first
       knots <- first_val
       while( tail(knots,n=1) < max(X$x)) {
         knots <- c(knots,tail(knots,n=1)+ds)
       }
       last_val <- tail(knots,n=1)
       X <- subset(X,x > first_val)
}


# For each point, find basis which they belong to
X$section <- cut(X$x,knots)
count <- 1
X <- ddply(X,"section",function(df) {
df$sectionnum = count
count <<- count+1
return(df)  } )

# For each section calculate values of phi
X <- ddply(X,"section",function(df) {
lo_x = knots[df$sectionnum]
hi_x = knots[df$sectionnum+1]
c1 = 1 - (-1/ds)*lo_x
c2 = 1 - (1/ds)*hi_x

df$phi1 <- (-1/ds)*df$x + c1
df$phi1_ind <- df$sectionnum
      df$phi2 <- (1/ds)*df$x + c2
      df$phi2_ind <- df$sectionnum + 1

return(df)  } )

# Build the 'observation matrix'
C <- sparseMatrix(i = c(1:nrow(X),1:nrow(X)), j = c(X$phi1_ind, X$phi2_ind), x =   c(X$phi1, X$phi2))

# Inference with EM algorithm for observation precision
gamma_prec <- rep(1/var(X$y),10) #Initialise precision
maxiter=100

for (m in 2:maxiter) {
# Estimate x
     ymat <- matrix(X$y)
     xML <- data.frame(x=knots,y=as.vector(chol2inv(chol(t(C) %*% C)) %*% t(C) %*% ymat))
     Qobs <- sparseMatrix(i=1:nrow(X),j=1:nrow(X),x=gamma_prec[m-1])
     Qtot <- t(C)%*%Qobs%*%C
     Varx <- solve(Qtot)

     # Update gamma_prec
     Gamma = as.vector(t(ymat) %*% ymat - 2*t(xML$y)%*% t(C) %*% ymat +    sum(diag(t(C) %*% C %*% (Varx + xML$y%*%t(xML$y)))))
     gamma_prec[m] <- nrow(X)/Gamma
     if (abs(gamma_prec[m] - gamma_prec[m-1]) < 1e-5) break
   }
     cat(paste("EM converged after",m,"iterations",sep=" "),sep="\n")

Vartrends <- NULL
   for (i in 1:(nrow(Varx)-1)) {
     Vartrends[i] <- (sum(diag(Varx)[i:(i+1)]) -2*(Varx[i,i+1]))/ds^2
   }
trends = data.frame(year = head(knots,n=(length(knots)-1)),
trend = diff(xML$y)/ds,
std = sqrt(Vartrends)
)
return(list (x=xML, trends=trends))
# returns a list with maximum likelihood estimate for the weights and the trend estimates
}
<pre>

To test this function you can use the following code


</pre>
library(ggplot2)
library(plyr)

deseason=1

x <- seq(2006.01,2009,by=0.01)
 x2 <- x - 2006
 y <- sin(x2*2*pi) + 4*x2*(x2 < 1) -
 4*x2*(x2 >=1 & x2 < 2) + 8*(x2 >=1 & x2 < 2) +
 4*x2*(x2 >=2 & x2 <= 3) - 8*(x2 >=2 & x2 <= 3)+
 2*rnorm(length(x))
 X <- data.frame(x=x,y=y)
 if(deseason) {
 int <- ((0:2)*100)+1
 av <- rep(0,3)
 for (i in 0:99) {
 av[i+1] <- mean(y[int+i])
 }
 X$y <- X$y - kronecker(rep(1,3),av)

 }

Interp <- Piecewise_interp(X,ds=1,first=2006)

g3 <- ggplot() + geom_point(data=X,aes(x=x,y=y)) + geom_line(data=Interp$x,aes(x=x,y=y),colour="red",size=1)
print(g3)
<pre>

Thanks to Botond Cseke and Nana Schoen for helpful discussions on this topic.

Improving linear algebraic operations (including chol) in R, some options.

If you’re into some big-n environmental stats with huge sparse matrices (like me), then you probably want to spend some time dwelling on the linear algebraic routines you are using. Also, if you’re not into the inner-workings of the software (like me), then it will probably come as a surprise that quite a bit can be done in R to improve certain basic operations such as matrix multiplication, Cholesky decomposition, eigen decomposition etc. As I show below, in some cases, these functions can be improved by simply bolting on a modern BLAS package which enables multi-threading. In some cases this is just not possible, and other software, in particular MATLAB, have super-human algebraic routines which are closed off to the outside world and which we would still like to make use of (provided we can grab a license running around somewhere).

The Problem

In environmental stats, the Cholesky decomposition is probably the most-used tool, since it is required both for random number generation and posterior inference. The basic chol which comes with R seems to do the job. To compare its performance with MATLAB I took one of the matrices I am using of dimension 70000 x 70000 with 0.04% non-zeros and tried the Cholesky on both platforms. On my machine this required 14s on R and 1.3s on MATLAB. This was a relatively sparse matrix, usually I would be working with matrices which are quite (~0.1%) full. A ten-fold order speed-up is thus something notable and worth investigating.

Permutations

The first thing to check is whether you are using the right ordering before carrying out the decomposition. To take advantage of possibly a better ordering than that provided by the R package Matrix I created an R wrapper for the excellent C code by Timothy Davis which implements approximate minimum degree (amd) ordering and is the one I personally like to use most in MATLAB.  It turns out, however, that the ordering provided by the package Matrix in R, at least for my matrices, actually does the job pretty well and I got negligible speed-up with amd. This therefore points to something else, the BLAS routines.

OpenBLAS

If you’re going down the route of installing BLAS packages be ready for at least a couple of hours frustration, even more if you are a Windows user. The two most common open-source BLAS packages used today are ATLAS and OpenBLAS. The former is, in general, harder to install and I have numerous failed attempts to do this on a Win64 machine, it is much easier on a Linux box although one needs to carefully set all the flags as documented here. OpenBLAS installation is extremely straightforward and works fine on a Win64 machine if you use the MINGW and MSYS distributions mentioned here. Note that you do not need to re-install R as indicated in that link. You can get the dll file produced and use it to replace that in the R\bin\x64 folder (just rename the old one to be safe). If you have not compiled the BLAS package properly R simply will not start, without indicating what is wrong.

Edit: Make sure you set the flag NO_SHARED=0 when compiling so that the .dll or .so file is generated.

OpenBLAS will give you a big performance boost. You can see some benchmark tests here and I can confirm that the ten-fold increase in computational speed is pretty much what you can expect. In fact, this matches what is reported when using ATLAS in Imperial’s HIPLARb package. I obtained similar speed-ups when running the tests shown on their homepage, so if you’re on a Windows box your best option is probably to stick with OpenBLAS.

Cholesky decomposition

All the above mentioned BLAS benchmark tests are remarkable. I was therefore quite disappointed to note that the runtime of my Cholesky decomposition remained largely unaffected with OpenBLAS. It therefore seems that the performance increase attainable on dense matrix manipulations does not carry over to the sparse realm, and probably what one would need to do here is to spend time to incorporate Davis’ SparseSuite package in R. However if you’re not a devoted C programmer (which I’m not), you have another option, which for me is doing the job nicely, the MATLAB server.

The MATLAB Server

There is an excellent package on CRAN by Henrik Bengtsson called R.package which allows you to instantiate a MATLAB server. This then communicates with MATLAB over TCP/IP and transfers data using the file system. It is easily implemented. Let’s say you have a sparse matrix Q and an ordering P. Then you can simply implement the following code in R after loading R.Matlab


Matlab$startServer()
matlab <- Matlab()
open(matlab)

Qp <- Q[P,P]
i = Qp@i+1
j = Qp@j+1
x = Qp@x
setVariable(matlab, i=i,j=j,x=x)
evaluate(matlab, paste("Q = sparse(i,j,x);",
                       "L = (chol(Q)');"))
L <- getVariable(matlab,"L")})

Yes, it’s as simple as that, and you can do a wrapper for this (say, chol_MATLAB) which you can call instead of the standard chol. You can also do the ordering using amd in MATLAB and, well, anything else really.

So what is the speed up like? The Cholesky factorisation takes only 1.3s in MATLAB however with communication overhead this goes up to 7.2s when using it as a server, which is still a big improvement over my original 14s. Also, one might expect the overhead complexity to increase roughly linearly with the amount of non-zeros in the Cholesky factor. The complexity of the process is thus dominated by computing the decomposition (since this is super-linear) and hence we can expect big dividends as the matrices grow larger. Trying the same operation with a 70000 x 70000 matrix but with 0.06% non-zeros requires 55s with OpenBLAS under an amd permutation on R but only 21s using the MATLAB server. Even more gains are possible when having to repeat the Cholesky factorisation several times (for example in an MCMC algorithm or gradient descent) since the symbolic pattern of the decomposition remains unchanged and one only needs to transfer the values. Just transferring the values reduced the time required to 14s in this example. In the below table R-MATLAB2 denotes value-only transfers.

 Non-Zeros   R (s)   R-MATLAB (s)   R-MATLAB2 (s)   MATLAB (s) 
0.04% 13.8 7.2 4.7 1.3
0.06% 55.0 20.6 13.8 5.1

The above tests were carried out on a quad-core Intel Core i7-2600S @2.8GHz.

Conclusions

If you are an R user and need a lot of linear algebraic operations, you probably should install the OpenBLAS package (since it’s fairly straightforward) however this is not likely to improve the performance of your large sparse Cholesky decomposition. For this you should consider either wrapping the SparseSuite package of Timothy Davis (I have not done this) or use a MATLAB server. I found the latter to be particularly convenient.

Thanks to my dad for helpful discussions on this topic.