9 December 2015

Tips for R beginners

I'm not an R expert but those few things may save you some time. Especially when doing coursera's courses.


Don't install R nor R studio from your system package manager. It's a waste of time. Of course it will work and you'll be able to run hello world but soon you will need some external libraries. And some of them will be outdated others will have conflicting dependencies so the installation will fail. At least that's the case with ubuntu 14.04.

rJava problem

Some libraries require java. If you have a problem with 'rJava' library, it's possible that your R installation by default looks for different (older) java version than you actually have installed. in this case you may try:
sudo R CMD javareconf

as described here: http://stackoverflow.com/a/31316527/1100135

Changing locale

If, for any reason, you can't change a locale from inside R, you can run whole R with different locale:
LC_ALL=C rstudio

You can read more about it using man setlocale. Still, it won't let you use a few different locales at once.

Building / transforming formulas

At some point you will want use the power of lazy evaluation and build/transform formulas instead of providing them by hand. Two functions will be usefull: substitute and as.formula. Let's say we want to build a function that takes all the predictiors (or, more general, some part of formula) and adds the regression variable y(other part of formula)

make.formula <- function(x) as.formula(substitute(t ~ x))
and now we can call it using:
new.formula <- make.formula(x+y*z)
to get:
Class 'formula' length 3 t ~ x + y * z

Tuning knitr rendering

Each code chunk {r } accepts optional parameters that allow you, for example, control if code is executed, if diagnostic messages are also rendered, if computation is cached, if each command prints its output or whole output is displayed at the end etc. Sample:
```{r cache=T, message=F, results='hold'}
system.time(fit <- randomForest(classe ~ ., data=training))
It will exclude diagnostic from loading library, cache trained model and display whole output at the end. Do ?opts_chunk to see the reference page of available options (in library(knitr)) and links to the online documentation.

For inline R do: `r 2 + 3 * x`


System.time(x <- expensive.function()) 
or to compare multiple computations:
benchmark(x <- expensive.function1(), y <-expensive.function2())
Above code will do the actual measurement and also will assign new variable in the current environment.

Training prediction models with Caret

train delegates to other prediction method based on type. Often it's way faster to call directly the underlying method. We may loose all the caret's meta-parameter tuning but still often the model we get is good enough while having the training orders of magnitude faster. Eg:
train(y ~ x, data=training, method='rf')
randomForest(y ~ x, data=training)