Improving linear algebraic operations (including chol) in R, some options.

If you’re into some big-n environmental stats with huge sparse matrices (like me), then you probably want to spend some time dwelling on the linear algebraic routines you are using. Also, if you’re not into the inner-workings of the software (like me), then it will probably come as a surprise that quite a bit can be done in R to improve certain basic operations such as matrix multiplication, Cholesky decomposition, eigen decomposition etc. As I show below, in some cases, these functions can be improved by simply bolting on a modern BLAS package which enables multi-threading. In some cases this is just not possible, and other software, in particular MATLAB, have super-human algebraic routines which are closed off to the outside world and which we would still like to make use of (provided we can grab a license running around somewhere).

The Problem

In environmental stats, the Cholesky decomposition is probably the most-used tool, since it is required both for random number generation and posterior inference. The basic chol which comes with R seems to do the job. To compare its performance with MATLAB I took one of the matrices I am using of dimension 70000 x 70000 with 0.04% non-zeros and tried the Cholesky on both platforms. On my machine this required 14s on R and 1.3s on MATLAB. This was a relatively sparse matrix, usually I would be working with matrices which are quite (~0.1%) full. A ten-fold order speed-up is thus something notable and worth investigating.

Permutations

The first thing to check is whether you are using the right ordering before carrying out the decomposition. To take advantage of possibly a better ordering than that provided by the R package Matrix I created an R wrapper for the excellent C code by Timothy Davis which implements approximate minimum degree (amd) ordering and is the one I personally like to use most in MATLAB. It turns out, however, that the ordering provided by the package Matrix in R, at least for my matrices, actually does the job pretty well and I got negligible speed-up with amd. This therefore points to something else, the BLAS routines.

OpenBLAS

If you’re going down the route of installing BLAS packages be ready for at least a couple of hours frustration, even more if you are a Windows user. The two most common open-source BLAS packages used today are ATLAS and OpenBLAS. The former is, in general, harder to install and I have numerous failed attempts to do this on a Win64 machine, it is much easier on a Linux box although one needs to carefully set all the flags as documented here. OpenBLAS installation is extremely straightforward and works fine on a Win64 machine if you use the MINGW and MSYS distributions mentioned here. Note that you do not need to re-install R as indicated in that link. You can get the dll file produced and use it to replace that in the R\bin\x64 folder (just rename the old one to be safe). If you have not compiled the BLAS package properly R simply will not start, without indicating what is wrong.

Edit: Make sure you set the flag NO_SHARED=0 when compiling so that the .dll or .so file is generated.

OpenBLAS will give you a big performance boost. You can see some benchmark tests here and I can confirm that the ten-fold increase in computational speed is pretty much what you can expect. In fact, this matches what is reported when using ATLAS in Imperial’s HIPLARb package. I obtained similar speed-ups when running the tests shown on their homepage, so if you’re on a Windows box your best option is probably to stick with OpenBLAS.

Cholesky decomposition

All the above mentioned BLAS benchmark tests are remarkable. I was therefore quite disappointed to note that the runtime of my Cholesky decomposition remained largely unaffected with OpenBLAS. It therefore seems that the performance increase attainable on dense matrix manipulations does not carry over to the sparse realm, and probably what one would need to do here is to spend time to incorporate Davis’ SparseSuite package in R. However if you’re not a devoted C programmer (which I’m not), you have another option, which for me is doing the job nicely, the MATLAB server.

The MATLAB Server

There is an excellent package on CRAN by Henrik Bengtsson called R.package which allows you to instantiate a MATLAB server. This then communicates with MATLAB over TCP/IP and transfers data using the file system. It is easily implemented. Let’s say you have a sparse matrix Q and an ordering P. Then you can simply implement the following code in R after loading R.Matlab


Matlab$startServer()
matlab <- Matlab()
open(matlab)

Qp <- Q[P,P]
i = Qp@i+1
j = Qp@j+1
x = Qp@x
setVariable(matlab, i=i,j=j,x=x)
evaluate(matlab, paste("Q = sparse(i,j,x);",
                       "L = (chol(Q)');"))
L <- getVariable(matlab,"L")})

Yes, it’s as simple as that, and you can do a wrapper for this (say, chol_MATLAB) which you can call instead of the standard chol. You can also do the ordering using amd in MATLAB and, well, anything else really.

So what is the speed up like? The Cholesky factorisation takes only 1.3s in MATLAB however with communication overhead this goes up to 7.2s when using it as a server, which is still a big improvement over my original 14s. Also, one might expect the overhead complexity to increase roughly linearly with the amount of non-zeros in the Cholesky factor. The complexity of the process is thus dominated by computing the decomposition (since this is super-linear) and hence we can expect big dividends as the matrices grow larger. Trying the same operation with a 70000 x 70000 matrix but with 0.06% non-zeros requires 55s with OpenBLAS under an amd permutation on R but only 21s using the MATLAB server. Even more gains are possible when having to repeat the Cholesky factorisation several times (for example in an MCMC algorithm or gradient descent) since the symbolic pattern of the decomposition remains unchanged and one only needs to transfer the values. Just transferring the values reduced the time required to 14s in this example. In the below table R-MATLAB2 denotes value-only transfers.

Non-Zeros	R (s)	R-MATLAB (s)	R-MATLAB2 (s)	MATLAB (s)
0.04%	13.8	7.2	4.7	1.3
0.06%	55.0	20.6	13.8	5.1

The above tests were carried out on a quad-core Intel Core i7-2600S @2.8GHz.

Conclusions

If you are an R user and need a lot of linear algebraic operations, you probably should install the OpenBLAS package (since it’s fairly straightforward) however this is not likely to improve the performance of your large sparse Cholesky decomposition. For this you should consider either wrapping the SparseSuite package of Timothy Davis (I have not done this) or use a MATLAB server. I found the latter to be particularly convenient.

Thanks to my dad for helpful discussions on this topic.

One Comment

Partial matrix inverse for efficient computation of posterior predictive variances | Spatio-temporal modelling March 13, 2014 at 3:37 pm

[…] in significant computational savings (in our implementation we made use of the OpenBLAS library, see previous post). As an indication of achievable speeds, for the problem, in the current study, using the partial […]

Andrew Zammit Mangion

Improving linear algebraic operations (including chol) in R, some options.

One Comment

Leave a reply to Partial matrix inverse for efficient computation of posterior predictive variances | Spatio-temporal modelling Cancel reply

Share this:

Related

One Comment

Leave a reply to Partial matrix inverse for efficient computation of posterior predictive variances | Spatio-temporal modelling Cancel reply