9 December 2015

Tips for R beginners

I'm not an R expert but those few things may save you some time. Especially when doing coursera's courses.

Installation

Don't install R nor R studio from your system package manager. It's a waste of time. Of course it will work and you'll be able to run hello world but soon you will need some external libraries. And some of them will be outdated others will have conflicting dependencies so the installation will fail. At least that's the case with ubuntu 14.04.

rJava problem

Some libraries require java. If you have a problem with 'rJava' library, it's possible that your R installation by default looks for different (older) java version than you actually have installed. in this case you may try:
sudo R CMD javareconf

as described here: http://stackoverflow.com/a/31316527/1100135

Changing locale

If, for any reason, you can't change a locale from inside R, you can run whole R with different locale:
LC_ALL=C rstudio

You can read more about it using man setlocale. Still, it won't let you use a few different locales at once.

Building / transforming formulas

At some point you will want use the power of lazy evaluation and build/transform formulas instead of providing them by hand. Two functions will be usefull: substitute and as.formula. Let's say we want to build a function that takes all the predictiors (or, more general, some part of formula) and adds the regression variable y(other part of formula)

make.formula <- function(x) as.formula(substitute(t ~ x))
and now we can call it using:
new.formula <- make.formula(x+y*z)
str(new.formula)
to get:
Class 'formula' length 3 t ~ x + y * z

Tuning knitr rendering

Each code chunk {r } accepts optional parameters that allow you, for example, control if code is executed, if diagnostic messages are also rendered, if computation is cached, if each command prints its output or whole output is displayed at the end etc. Sample:
```{r cache=T, message=F, results='hold'}
library(randomForest)
system.time(fit <- randomForest(classe ~ ., data=training))
fit
```
It will exclude diagnostic from loading library, cache trained model and display whole output at the end. Do ?opts_chunk to see the reference page of available options (in library(knitr)) and links to the online documentation.

For inline R do: `r 2 + 3 * x`

Benchmarking

System.time(x <- expensive.function()) 
or to compare multiple computations:
library(rbenchmark)
benchmark(x <- expensive.function1(), y <-expensive.function2())
Above code will do the actual measurement and also will assign new variable in the current environment.

Training prediction models with Caret

train delegates to other prediction method based on type. Often it's way faster to call directly the underlying method. We may loose all the caret's meta-parameter tuning but still often the model we get is good enough while having the training orders of magnitude faster. Eg:
train(y ~ x, data=training, method='rf')
randomForest(y ~ x, data=training)

16 September 2015

Integer overflow: zero from non-zero multiplication

People sometimes ignore overflow problem because their algorithm is still valid (eg checking if counter has changed within some short period of time). But keep in mind that overflow causes some fundamental math laws don't hold any more. One of them is: when we multiple a few integer numbers we got zero if and only if at least one of the factors is zero. Let's see that in action. Will this loop ever end?
int x = 1;
while(true) {
   x *= RandomUtils.nextInt(1, Integer.MAX_VALUE); // positive random number
   if (x == 0) break;
}
In practice it ends instantly. Even though we multiply only non-zero integers we got a zero as a result. And, of course, it has nothing to do with rounding precision. When we take a closer look, it becomes obvious that to get zero we just have to produce any number that has a binary representation with 32 (in case of java's int) zeros at the end. And x accumulates trailing zeros with each multiplication that contains 2 in its prime factors (every second loop on average). So to get zero you can simply multiply two valid ints: (1 << 30) * 4. Same works with any combination of positive and negative numbers: in java's representation negative powers of two also accumulate trailing zeros.

5 July 2015

IoC is not DI

Often I see posts and hear people using interchangeably terms 'inversion of control' and 'dependency injection'. But it's not the same thing.

Inversion of control is a design pattern for, let's say, decoupling 'what' from 'when'. It lets some generic code pass the flow of control to custom components. It increases modularity and extensibility. It's about applying the Hollywood Principle: "Don't call us, we'll call you".

Dependency injection is a design pattern that applies IoC to resolving dependencies. In this pattern a component X doesn't have control over creation of its own dependencies anymore. Instead, the control is inverted and given to another component Y which creates dependencies and inject them into X.

But DI is not the only one realization of IoC. Some others are:

service locator,
aspects,
events / callbacks / event handlers,
strategy,
template method

and those patterns are often used to implement GUI, schedulers, plugins, declarative transactions or many frameworks/libraries (like servlets, junit runners etc).

22 April 2015

Hirschberg's algorithm explanation

Some time ago I needed to implement Hirschberg's algorithm but I couldn't find any good explanation of it. So here is mine attempt.

First of all: It's used to find the optimal sequence alignment. This alignment has to be measured somehow. As a metric we can use, for example, Levenshtein distance. Which means insertion, deletion and change have the same cost. This can be directly translated to computing list of all changes required to change one word into another.

Let's say we have a word L of size n and a word R of size m.

Needleman–Wunsch algorithm

First we need to understand Needleman–Wunsch algorithm. It's the easiest way to compute the alignment. The algorithm has just a few lines of code. Its time complexity is O(n*m), space complexity is also O(n*m). This is simple dynamic programming: having edge values we fill the array from top-left to bottom-right corner, choosing minimum (or maximum, depending on the metric. Let's say we are looking for the minimum edit distance). For L = bcd and R = abcde:



After the whole array is filled we go backward (from bottom-right corner) and recreate the path.



Each horizontal arrow represents an insertion, vertical - deletion and diagonal - match or replacement (depending if letters match or not). Therefore this array represents alignment:
- b c d -
  | | |
a b c d e
It also means that edit distance between bcd and abcde is 2. So we can change one into another with only 2 changes. But it's not all. We can get more information from this table. Last row contains all edit distances (D) between L and all prefixes of R:

D(bcd, abcde) = 2
D(bcd, abcd) = 1
D(bcd, abc) = 2
D(bcd, ab) = 3
D(bcd, a) = 3
D(bcd, ) = 3

If we are interested only in edit distance but not the whole alignment then we can reduce Needleman–Wunsch algorithm space complexity. The key observation is that we can fill the array horizontally (or vertically) so at any time we need only 2 adjacent rows of the array:



Its space complexity is O(n) or O(m) (depending on the filling direction) and time complexity remains O(n*m). Let's call this version NW'. If we choose correct direction then last row also contains all edit distances (D) between L and all prefixes of R.

Hirschberg's algorithm

Hirschberg's algorithm lets us compute the whole alignment using time O(n*m) and space O(m+n). It uses NW' algorithm and divide and conquer approach:
  1. if problem is trivial, compute it:
    1. if n or m is 0 then there is n or m insertions or deletions
    2. if n = 1 then there is m-1 insertions or deletions and 1 match or change
  2. if problem is bigger, divide it into 2 smaller, independent problems
    1. divide L in the middle. into 2 sublists L1, L2
    2. find optimum division R = R1 R2
    3. recursively find the alignment of L1 and R1
    4. recursively find the alignment of L2 and R2
    5. concatenate results
Of course point 2. needs better explanation. How can we divide big problem into smaller ones? Let's imagine some optimal alignment between L = a1...a27 and R = b1...b32:



It means we can transform L into R in 15 operations (to be more precise: cost of transformation is 15). Obviously we can divide this alignment at any arbitrary position into 2 independent alignments, for example:


Above edit distances is of course just an example. The point is, they must add up to the overall edit distance (15).

So now we have L = (a1...a7) (a8...a27) = L1 L2 and R = (b1...b4) (a8...a32) = R1 R2. It means we can transform L1 into R1 in 4 operations and L2 into R2 in 11 operations. We can concatenate those transformations to get the overall result. It means that if we somehow know the right division we'll be able to compute the alignment of L with R by computing alignments of L1 with R1 and L2 with R2 and simply concatenating results. So it will be possible to split big problem recursively until we reach trivial cases. For example (from wikipedia) to align AGTACGCA and TATGC we could do:
(AGTACGCA,TATGC)
               /              \
        (AGTA,TA)            (CGCA,TGC)
         /     \              /      \
     (AG,)   (TA,TA)      (CG,TG)  (CA,C)
              /   \        /   \
           (T,T) (A,A)  (C,T) (G,G)

But if we have L divided into L1 and L2 how do we know where to divide R? Let's say we want to align (bcdce, abcde)



Firstly, let's arbitrary divide L in the middle. So we have two new lists L1 and L2.



So where is the correct division of R? Well, we just need to ensure that edit distance (L1, R1) + edit distance (L2, R2) will not be larger than the overall edit distance (L, R). That is, we will use the shortest possible transformation (or alignment). How to ensure that? Well, some of the possible divisions are the optimal ones. So just check all of them and pick the one with the smallest sum of edit distances.
Ok, but how to quickly compute edit distances of L1 with all possible R1 and L2 with all possible R2? Here NW' algorithm can help us. It's straightforward to run for the upper half:



After NW' run is completed we have edit distances between L1 and all possible R1:



But what about L2 and all possible R2? Again NW' can help, we just need to run it backward. After all edit distance is the same if we look at both words forward or backward, right?



After NW' run is completed we have edit distances between L2 and all possible R2:



Now, we need to choose the partition of R. Let's say we choose to divide R = abcde into R1 = empty and R2 = abcde:



Now we would have to align L1 = bcd with R1 = empty and L2 = ce with R2 = abcde. What would be the edit distances?



It means we can transform L1 (bcd) into R1 (empty) in 3 operations and L2 (ce) into R2 (abcde) in 3 operations. So choosing this division we would transform L into R in 6 operations. In other words cost of alignment of L and R with this division would be 6. Can we do better? What if we place the dividing line elsewhere?



Now we would align L1 (bcd) with R1 (a) with cost 3 and L2 (ce) with R2 (bcde) with cost 2. Overall alignment cost (edit distance) would be 3 + 2 = 5. And so on. The last possible position is:



Which is aligning L1 (bcd) with R1 (abcde) with cost 2 and L2 (ce) with R2 (empty) with cost 2. Overall cost (distance) is 2+2 = 4. Obviously the smallest overall edit distance is:



So this is the division we were looking for! Now we have 2 independent subproblems: align L1 = bcd with R1 = abcd and L2 = ce with R2 = e. Let's look at the array, it's divided into 4 parts and now we are only interested in top-left and bottom-right part. We will never ever need to look at top-right and bottom-left part. So the overall problem has been reduced by half:



That's it. Rest is just basic but a bit tedious math on subarray boundaries and debugging off-by-one errors.

Complexity

space: We need to store L and R. Each NW' run uses only 2 rows of size n. Each time we divide L by 2 so the recursion depth is O(log n), Therefore total used space is O(n+m)

time: At each step we run NW' on the whole L x R array. Later 2 subproblems are created of the total size of the half of the original array. So number of operations is: (1 + 1/2 + 1/4 + ...) * (n*m) = O(n*m)

5 March 2015

Java 8: Streaming a string

It's time to finally start learning java 8 api. Recently I needed to modify a string by replacing some characters to randomly generated values. One method for '#' -> random number, one method for '?' -> random lower case letter and one for both replacements. Each char should be processed independently, therefore simply replaceAll was not an option. However mapping a stream sounds easy. Unfortunately java doesn't provide CharStream and CharJoiner so it's a bit uglier and less efficient than it should be.
import static org.apache.commons.lang3.RandomStringUtils.randomAlphabetic;
import static org.apache.commons.lang3.RandomStringUtils.randomNumeric;
import java.util.function.Supplier;
import java.util.stream.Collectors;
import com.google.common.base.Preconditions;

public class Replacer {

  private String replace(String toReplace, char pattern, Supplier<String> replacementSupplier) {
    Preconditions.checkNotNull(toReplace);
    return toReplace
             .chars()
             .mapToObj(c -> c != pattern ? String.valueOf((char)c) : replacementSupplier.get())
             .collect(Collectors.joining());
  }
 
  public String withLetters(String letterString) {
    return replace(letterString, '?', () -> randomAlphabetic(1).toLowerCase());
  }

  public String withNumbers(String numberString) {
    return replace(numberString, '#', () -> randomNumeric(1));
  }

  public String withBoth(String string) {
    return withNumbers(withLetters(string));
  }
}
I know this can be done more efficiently using direct operations on ints. But can this be done more efficiently without sacrificing readability? Any suggestions are welcome.

19 January 2015

Make git never asks for your github username again

A lot of answers in SO advise you to change your repo url after you cloned the project. Don't do it. Don't change your repo url per project. That's a waste of time. Soon you will clone another project and you will have to do it again. Better way is to set the username globally so it applies to all projects from github. Current and future ones. Just type:
git config --global credential."https://github.com".username YOUR_USERNAME
This will add
[credential "https://github.com"]
username = YOUR_USERNAME
to your ~/.gitconfig