20 March 2014

The myth of random data in unit tests

Many times I see people generate random data for any irrelevant variable in tests:
String anyName = RandomStringUtils.random(8);

Customer customer = customerBuilder()
                               .withName(anyName)
                               .withAge(18)
                               .build();

assertThat(customer).isAdult();
First of all, it would probably be better if this test looked somehow like this:
Customer customer = newCustomerWithAge(18)

assertThat(customer).isAdult();
I know, I know: sometimes tests are a bit more complex and badly written and you just need the name as a constant. So why not simply:
private static final String ANY_NAME = "John";
...
customer = customerBuilder()
                      .withName(ANY_NAME)
                      .withAge(18);
                      .build();

assertThat(customer).isAdult();
Does the random generator make you feel safer? If the name is irrelevant, why bother generating it? It just makes your code less readable.

But some people go even further. Let's say we want to test StringUtils.contains from the Apache Commons. Some people want to generate the significant parameters:
String random1 = randomString();
String random2 = randomString();
...
assertTrue(StringUtils.contains(random1 + random2 + random3, random2));
Easy, right? But how will we test if it returns false correctly? Now our random data needs to obey some specific constraints. So it's rather hard to generate the data without, in fact, implementing the functionality again in tests. Another problem is that when you have such tests you think everything is tested and you stop thinking about corner cases.

But is everything really tested? What about nulls? what about empty strings? What about combinations of them? And even if your generator can produce nulls and empty strings, still: is everything tested?

How often will your random test run before the tested code goes on production? If you do continuous delivery then the test will run a few times during your local development, once on your CI server and... that's it. If you're not so lucky to do continuous delivery then let's assume your commit goes on production in 3 weeks. Probably soon there will be feature freeze and branch stabilization. How many times will this test run? 50 times on CI server? Random tests are totally useless when running only a few times. Of course you may expect those tests will run very many times during local development of the rest of your team but...

If it fails on someone's else machine, are you sure he will record the test result? Wait! There will be no result! There will be only information that true was expected but false was returned. So you have to remember about adding logs to all your random tests. And even if logs are being dumped, are you sure that other developer (who has to deliver his own, completely different functionality) take care about irrelevant, non-deterministic test failure? Because other option is simply re-run tests, see the green light, commit and go home. No one will ever know.

Let's face it, it can't work this way. If you are not sure if your test data is good enough then:
  • Simplify your code. Extract methods/classes, avoid ifs, avoid nulls, be more immutable and functional.
  • Try to analyze the edge cases and include them in your tests.
  • If needed, throw away the part of code and start again doing TDD. If you've never tried it, you will be surprised how different the design can be.
Seriously, those two rules will almost always be enough. That's because the sad truth is that the vast majority of all the development is a typical corpo maintenance. It's not a rocket science and all the complexity is usually incidental. But the refactoring can be expensive. And if above rules are not enough:
  • Generate a lot of random data sets, look at them and check if some of them differs from what you had in mind when designing your code. And, of course, add new cases to your tests.
  • Use mutation testing.
  • Whenever a bug is discovered during the development, uat or production, add new cases to your tests to avoid regression.
  • Do real random testing. Keep the testing server running 24/7. Every generated data that breaks the tests should be logged and added to your deterministic unit tests.

0 komentarze :

Post a Comment