GPUs… the jury is still out

GPUs are (were??) all the rage and unless you’ve been hiding under some stone in the last few years, you’ve heard all the stories of how they made inhumane computations humane. And I have been very intrigued by GPUs, you can have thousands of cores slowly ticking away at a big problem that can be broken down into small little bits. It reminds me of swarm control in a way; ants are pretty insignificant on their own but as a family they can do things you would never think are possible.

I’ve been venturing on this path for a while, I’ve made sure that my Laptops and Desktops all have NVIDIA GPUs so I can play with CUDA, I organised a couple of two one hour meetings at our local UQDG (Uncertainty Quants Discussion Group) where I talked, probably to the dismay of those who attended, about GPUs and how they can change the world, I installed the obligatory CUDA SDKs, the R packages which enable GPU computations (rpud), I’ve even used Rcpp and did some really low level GPU computing to add matrices and today (more of this later) I’ve finally managed to install MAGMA (the pain required to integrate this with R will appear in another post) and the story goes on. I’ve been trying to convince myself that GPUs are worth it. Are they?

Let’s start from the beginning. My laptop has quite a basic GPU but I wouldn’t call it “bad” by any means. It’s a NVIDIA 750M, it has 350 cores @ 1GHz. My processor (Core i7) has 4 cores (8 with hyperthreading) @ 1.7GHz. OK I can already hear you saying that, yes, the i7 is the best home-processor around and I totally agree. But I have nearly ~100 times more cores in my GPU, surely I should see some performance increase on the very basic computations (e.g. matrix multiplication)?

To see the improvement I started with `rpud`, an R package with which you can do some fancy stuff, but also some simple stuff like compute distance matrices. The results were astonishing, using the example they have here, I got a speed-up of 30 in computing a distance matrix for 4000 points in 120 dimensions. However, I tried running it again with 10000 points in 2 dimensions, and the GPU was now slower then the CPU.  I think I have an explanation for this: (i) if you have too few data items per thread most of the work is simply memory transfer, and the GPU is going to be waiting most of the time and (ii) the sqrt function (if treated as a transcendental function) will take many clock cycles to execute and is a limiting factor.

I was still quite impressed with this speedup though so I coded by own first function in GPU, for a project which I was doing at the time. This function was like a distance function, except that it weighted everything using a kernel. Unfortunately I only work in two dimensions (spatial stats), and this kernel had exponentials and things like that in it, so the speedup was practically 0 (maybe 2% faster over several runs, definitely not worth the effort).

There were many (mostly futile) attempts to try and replicate the 30-fold speedup I saw earlier. In the end I decided to try and integrate MAGMA with R. MAGMA allows you to carry out simple operations, like LU decompositions and Cholesky decompositions, using your resident GPU, and seamlessly. This is similar the the GPU accelerator technology available with MATLAB. After a lot of effort to integrate it (to be discussed in another post), I still did not observe the promised speedups on simple operations like multiplication and decompositions. I paste some microbenchmarks below:

### Problem setup

> library(microbenchmark)
> A <- matrix(rnorm(500^2),500,500)
> A <- crossprod(A)
> Am <- magma(A)

Multiplication (~ twice as slow)

> microbenchmark(A %*% A)
Unit: milliseconds
expr min lq mean median uq max neval
A %*% A 6.143687 6.19627 7.017437 6.279954 7.748259 15.64559 100

> microbenchmark(Am %*% Am)
Unit: milliseconds
expr min lq mean median uq max neval
Am %*% Am 11.96976 13.36487 19.81063 13.76507 14.27496 44.86355 100

LU Decomposition (~ twice as slow)

> microbenchmark(Matrix::lu(A))
Unit: milliseconds
expr min lq mean median uq max neval
Matrix::lu(A) 4.697579 4.848882 10.3484 5.950809 6.376104 97.37289 100

> microbenchmark(lu(Am))
Unit: milliseconds
expr min lq mean median uq max neval
lu(Am) 10.1704 11.68315 17.81248 12.86835 14.24375 149.158 100

Addition (~ 40x slower)

> microbenchmark(A + A)
Unit: microseconds
expr min lq mean median uq max neval
A + A 530.407 599.5015 809.1391 613.892 641.0855 2554.23 100

> microbenchmark(Am + Am)
Unit: milliseconds
expr min lq mean median uq max neval
Am + Am 4.873672 6.547027 26.95747 7.789786 9.551198 92.05331 100

The above results only show one thing really — that one needs to put in some effort for GPUs to be of use.  Unfortunately, from a statistician points of view, there is no immediate (and by that I mean, free lunch) benefit of using a GPU since even simple operations like multiplication and decompositions (which is what we use the most) are probably going to be slower. I think I’d need to get a GPU which is 10 times faster for the improvement to be worth it (remember that a lot of the time quoted above is used on data transfer), and we are looking at a top-of-the-range GPU there. Although don’t forget that we can have more than one GPU and the GEMM libraries allow you to distribute these operations seamlessly across multiple GPUs. I think if I had 2 or three state-of-the-art GPUs in a box, then we can start talking business.