What’s the issue?
Sometimes it is easy to get caught up in all the intricacies of the publication process; writing up, submitting, revising, re-submitting and so on. Frequently all this effort falls on one or two authors which is a big ask in itself. Understandably, the urge is frequently to close off a chapter as quickly as possible in order to start a new one. Code and data are shrugged off as ‘having served their use’ and end up being stored on a computer somewhere, only to get missplaced or lost in the process of re-filing some ten years down the line.
We have all heard of the importance of commenting and organising your code and data, however we should be focused more on the ‘permanence’ of code and results. I believe that in a global research environment, where we constantly utilise (many times unknowingly) contributions from others amongst us and those from generations before us, that algorithms used in research papers should be made publicly available, even if the data cannot be for confidential reasons. In the very few cases where even the code cannot be disseminated, at the very least snippets of code reproducing some of the results present in a publication should be provided.
There are several advantages of making your code reproducible with little effort on behalf of an external researcher:
- It increases transparency. It is not an assurance that your code is optimal or even correct for that matter. It is an expression of belief in your research, that you would rather have someone re-use your code and find any inconsistencies, if there are any, rather than have something safely tucked away that nobody can investigate.
- It increases trust. Making sure your results can be reproduced with ease shows the fellow researcher that you have nothing to hide and that there were no ‘tricks’ needed to get the required results (or if there was, you can be upfront about them).
- It ensures permanence. It is easy to replicate results from a script file that you coded on your machine, while you have that same machine and when you remember where the file is. But what about 10 years, 20 years down the line? Permanence ensures that if someone questions your methods a decade from now you are still able to provide an answer.
To ensure reproducibility I have come up with a protocol that I try to adhere to whenever possible.
A Reproducibility protocol
The protocol I come with below is specific to the programming language used predominantly at our group, R Software. It is heavily based on the book R Packages: Organize, test, document and share your code by Hadley Wickham.
- Organise your code into self-contained functions and put them into an R package. This is the first and most important step of the reproducibility protocol. If possible write automatic test functions to make sure the interface to the functions is predictable and robust to future modifications. Automatic testing is provided by the package testthat. Packaging functions cannot be under-emphasised; it ensures encapsulation, that they do not implicitly depend on global variables or other script-specific options that may be set.
- Once your functions are in the package, document them using Roxygen2. The documentation does not need to be as rigorous and organised as required, for example, for a CRAN package, but it needs to be understandable and self-contained. Using Roxygen2 ensures that if the interface to the function is changed, the documentation will too.
- Keep the data somewhere permanent. If not confidential, put your data in the data folder of your R package or, if very large, in a data repository such as datahub.io, that is especially utilised in your code (or at least commented out for cache purposes). In any case, the raw data should be uploaded and not some modified version of it.
- Set the random seeds. Most computational statistics involves use of random number generator that can be rendered reproducible by setting the seed at the beginning of the program. Things get a bit tricky when using parallel threads (for which set.seed() does not solve the problem). When working with parallel threads use the excellent package doRNG that ensures that all your threads get a unique seed, but one that is the same every time the code is run.
- Place reproducible script file as a vignette in the R package. Having a vignette reproducing your results is helpful, as an output document can be produced showing additional results and figures. If something is wrong, this usually can be seen at a glance from the output document. If the code takes a long time to run, insert flags in the code that ensure that, when in development mode, time-consuming results are cached in the data folder. When not in development mode these results can be loaded. All one needs to do is then ensure that the script is run once in development mode before consolidation.
- Use the figures generated by the vignette in your paper TeX file. Not everybody is comfortable uploading the source files of a paper online (and hence putting your entire paper as the vignette), but that does not mean the paper source files cannot use the vignette figures. Once the TeX file is linked up to the correct figure folder, you can rest assured that the figures appearing in your document are not from your deprecated code, but from your updated, publicly available one.
- Use an open-source repository such as Github. This not only ensures permanence of the code, but also accountability. Each change in the code is recorded. When the paper using the code is submitted you can release a version of the reproducible package. When the paper comes back for corrections, you will probably change the package and the vignette and on re-submission you will re-version the package. This has several advantages. If you are writing a second paper on the same topic that uses the same function, you do not need to be scared of altering the results of your previous paper since you can always revert to the package version that was submitted with the first paper.
- Test on several platforms. This is particularly straightforward to do if you have a reproducible package. There is no messing around with Makefiles and compiler-specific options. Even if you have source code, the R package will seamlessly install on most platforms with minimal effort. Check the output on each platform and note down any inconsistencies. Make sure that the platform and important versions on the machine used to generate the results are clearly listed on the development Github page. If you are using a high-performance computer (HPC), try at least two HPCs. If this is not possible, then clearly state this in the development page.
- Keep track of package versions. One of the most annoying realities is that keeping your end of the bargain in reproducibility is usually not enough: you depend on others to keep theirs. In open source software backward compatibility is not adhered to as much as one would wish, and therefore one must keep track of the versions of the packages being depended on. At the very minimum you can take a sessionInfo() and paste that on your Github page. Alternatively you can use packrat or checkpoint to keep track of package versions.