R coding tips for spatio-temporal analyses

If you are into spatio-temporal modelling and its applications, you probably are also spending a good deal of your time coding in R and seeing ways in which you can enhance your code for better performance. As we start to work routinely with “big data,” i.e. data sets on the order of several Gb and even Tb in raw form, there is no room for complacency. We need to have good well-structured, modular, well-tested and reusable code. When dealing with data on these scales it is very easy to get frustrated when testing out a prototype, or for mistakes to go unnoticed and “lost in transmission.” Here I provide some random tips which can help reduce development time and the number of logical errors in code and package design of these sort of programs.

1. Use md5 for caching

Md5 is a hashing algorithm which produces a “unique” (i.e. unique for practical purposes) number based on the data being digested. This is particularly useful when you want to execute a set of operations on a chunk of code which will take a long time.

The trivial way of by-passing this is by executing the code, then saving the data, for example using save in R, commenting out the code and adding a load after the commented code. This is bad practice for one very good reason, if variables or data on which we are executing change, this will go unnoticed and we can end up with a program which is mixing old results with new results during development. The other option, to run the sequence of instructions every time we test the code is equally prohibitive.

This is where md5 comes in. We digest all the input variables and data before the sequence of code and produce a hash number. We then check a designated folder (I use a /cache folder in the project directory) for a filename which is the same as the hash number. If it is there we load it, otherwise we execute the commands and save into the cache using the md5 as filename at the end. This has the advantage that this block of code will only be run IF any of the input variables/data change. Working in an environment where people change the data or “apply new corrections” all the time? No problem, this md5 wrapper around the code will do all the work for you.

In R this wrapper will look something like this:

md5 <- digest(file=args$path,algo="md5")
if(check_md5(md5)) {
    cat("Found existing MD5 for data. Loading...", sep = "\n")
    load(file = file.path("./cache",paste(md5,".rda",sep="")))
} else {
.
.
.
  save(.Object,file=file.path("./cache",paste(md5,".rda",sep="")))
}

where check_md5 is a small function which checks the folder for the required file.

2. Do not code, design

It is very tempting in applications such as these to start coding without thought for the big picture, or package development. Unfortunately everything we do is pretty much useless if it cannot be ported into an entirely different setting with minimal user effort. If we want our work to be useful to the wider audience, we should code for re-usability. This will lengthen development time considerably, but the end result will be something useful which can help others.

I believe (and I should stress that this is just personal opinion), that the key to producing reusable code which can be ported across applications, is the use of an object-oriented paradigm. This is not because imperative programming cannot be used to produce re-usable code, rather that object-oriented programming is naturally designed to let you do this neatly.

The potential advantages of using OO code are numerous. For example, in spatio-temporal applications we usually make use of classes of basis functions which have different properties but which all serve the same purpose. In this case I can  define a class of basis functions, which can include the finite element basis (used in Lindgren et al, 2011), the bisquare functions (Cressie et al.), EOFs (Wikle et al.), spherical harmonics and so on. All these classes will share a common interface with methods such as “interpolate” or “plot” and “find_Gram_matrix”, all common functions in spatio-temporal modelling. I can then formulate spatial and spatio-temporal models in the same way, irrespective of the basis I’m using.

For the actual modelling we can then create another class called “gmrf” with the required properties (precision and mean) and associate this with a set of basis functions. The vector autoregressive model would be a subclass of the “gmrf” class and builds a GMRF based on parameters I construct it with. The spatial CAR model would be another sub-class and so on. In this way we can construct multivariate spatio-temporal modelling problems with different basis, a mix of spatial/temporal basis functions etc. with little effort. The  FRK package I have recently released has code built on this premise

3. Use low-level languages sparingly buy wisely

There is no greater satisfaction than writing something in C++ and linking it through Rcpp, or using RcppArmadillo, and having to blink at the speed of your new code. We encounter many operations that can benefit from low-level code in spatio-temporal analyses (e.g., computing Euclidean distances). Unfortunately, though, you still need to know how to code in C++. Speaking for myself, I feel I have lost touch with low-level programming languages, and I’m not as confident as I once used to be. Even with the goldmines of Rcpp and RcppArmadillo at my fingertips I still find myself spending time recalling how to incorporate C++ in R, and how to code in C++. Nowadays what I do is I make sure the whole program is written and tested in R, then find the bottlenecks and see whether they can be speeded up using low-level code. I still keep the Rfunctions; I then use the C++ implementations and the Rimplementations in test functions (see next point).

4. Test, test, test…

Automatic testing has come of age in R through the testthat package. Coding individual units and not proceeding until they pass a range of tests (in a strict bottom-up approach) is tedious, but you will reap the benefits later. In automatic testing, the tests can be re-run after every change you make, to ensure nothing has “broken.” You can also design the tests in such a way so that not only small functions are tested, but even full-scale analysis, where you know roughly what the outcome can be. For example if coding up simple kriging, you can have tests that verify that your matrix calculations for computing prediction variances are correct, or you can do kriging on simulated data with a fixed seed (so that the result is reproducible), and throw an error if the predictions are off in some sense. Both these type of tests can be used to warn you of problems that arise from a change that you effect in the code. Testing is particularly useful with continuous-integration systems such as Travis CI, and test coverage reports can be generated to show you how much of your code is being tested.

5. Think carefully about function interfaces

While code within functions can change without any user being the wiser, the function interface cannot. It is therefore time well spent thinking what the function will be used for, now, and in the future. For example, if you are fitting a model using the EM algorithm, but want to allow the possibility of other estimation methods being implemented in the future, include a “method” argument which can be used to indicate the fitting method in the future. Think also about the ordering of the arguments, and put the most important arguments that are likely to last the test of time first in your function definition.

6. Use useful variable names

In an application I’m currently working on, I’m having to deal with differing units, lengths in m and km, masses in kg, T, MT and GT, densities in kg per km2m and so on. My first attempt of keeping track was to comment next to the variable name what the variable constant refers to. However, a variable name such as rho_kg_km2m or area_km2 immediately reflects the quantity we are talking about.

Judicious use of variable names goes beyond indicating the units. Frequently, the symbols used in the code do not match up with that published in an accompanying paper on which the code is based. This can be extremely confusing to reviewers and to end users. Although there is a big barrier to changing symbol names used in code, this should be of high priority when there is an accompanying manual or publication.