How to Test Your Unit Tests: Mutation Testing

2020, Aug 24

Let’s be honest - the idea of doing Test-Driven development only sounds nice on paper. Or if you have an infinite amount of time to actually do it. In the real world though, project managers, product managers, customers, the CTO, CEO, and everyone else at the company just want to see the final product. They couldn’t care less about tests. Yes, eventually they will realize how important it is to have them, especially if during the retrospective meeting they’re told that if we had more time to write unit or integration tests, it would have saved us from the last production fire. But the cycle keeps repeating because the “firefighters” arrived on time, nobody died (hopefully) and there are still a gazillion features that need to be developed. So a lot of times we have to balance the amount of code we write and the number of tests that cover it.

At least in our team, depending on what we work on and how important we think it is to have a good code coverage, we may have four different approaches:

  • Do full TDD from the beginning (rarely happens).
  • Skip tests altogether (more like it).
  • Write tests after the fact (if we have time).
  • Make a ticket to write tests later (let’s be realistic, this ticket will likely never get done).

...when we needed to add a new integration into our system, my friend and co-worker discovered that some of the tests were plainly wrong

I love unit testing. As a former QA, it was one of the first things that I learned about when learning how to program. I gave presentations on it, helped other people write better tests when I had joined their projects, and I’ve always been on the lookout for better tooling that will make writing and maintaining them easier. Moq and NBuilder have been two of my favorite tools, but I’ve gone as gross as using HttpMock for testing one of our legacy systems until eventually we completely rewrote it and the need for using this tool faded. And of course, it was really exciting to watch the number of tests grow from 0 to several hundred, the code coverage go from 0% to upper 60’s or even higher, and, most importantly - not have any bug reports that could be proven to be caused by the code we wrote for the past several months. The problem arose later when no-one expected it - when we needed to add a new integration into our system, my friend and co-worker discovered that some of the tests were plainly wrong. He later wrote his own blog post on the subject to talk about the mistakes we made while writing tests: https://www.alijahgreen.com/blog/mistakes-ive-made-unit-testing.

What validates the validator?

It appears that while we do write or at least thinking about writing tests to validate whether the production code is correct, we don’t validate the validators. We could write tests for tests, but then we’d put ourselves into a never-ending loop. We would always have to write more tests to make sure that the tests that have already been written are valid and are testing the right things. So how do we fix this problem?

If you’re into TDD, the most obvious solution that comes to mind is to do the “Red-Green-Refactor” approach, where we make sure that each test has failed before we write code that will make the test pass and call it “good”. The problem we run into though is that turning from “red” to “green” isn’t always enough. Maybe it was failing for a completely different reason from what we originally thought. Or maybe the cycle is OK for this particular condition, but did we make sure to cover all of the possible errors? For example, did we test all the boundaries? There’s nothing that tests whether our tests cover those cases or if they’re even correct in the first place.

Mutation Testing to the Rescue of Your Tests

As I was browsing Pluralsight last weekend to see if there’s anything interesting I could listen to while cooking, I found this talk from CodeMash conference: “Mutation Testing to the Rescue of Your Tests” by Nicholas Frankel. Surprisingly, I had never heard of such testing until I’ve listened to this talk. It did intrigue me though and I decided to learn more about it.

We want those bastards dead and our tests “red”.

What exactly is mutator testing? In simple words, it’s a program or a tool that hooks up to your source code and replaces conditions, string values, and other “mutators” within your code. For example, if we had this method public int Sum(int x, int y) { return x+y; } it would have replaced “+” operator with the “-” operator and checked if the tests that cover this method still passed. If they did, that means that the “mutant” has survived - not really a good thing. We want those bastards dead and our tests “red”. That would mean that our tests passed the exam.

Since the presenter was doing demos in Java, I decided to give it a try in C# and run it against the project where we discovered that some of the tests were incorrect. I found only a couple of “mutator” tools available for .Net. I went with the Stryker-mutator from https://github.com/stryker-mutator/stryker-net that gets installed globally as a dotnet tool.

It did take a couple of minutes to figure out the errors received upon trying to install it originally. It appears that in addition to running the command to install it mentioned in the “Readme”, I also needed to specify the version of the package I wanted to install. So instead of running "dotnet tool install -g dotnet-stryker” I had to run “dotnet tool install -g dotnet-stryker --version 0.18.0” because apparently it’s still in beta. After that, I had to navigate via command line to the desired tests project I wanted to run the mutator against, run “dotnet-stryker” and… it failed again. This time, because my project with tests contained references to more than one project and it wanted me to choose which one I wanted to mutate. Since third time is the charm, after finally running “dotnet-stryker -project-file=myProjectName” it gave me a bunch of warnings about some mutants not being able to compile, but finally it actually started doing some work. It estimated that this process will take about five minutes. After seven minutes, the results were produced and from looking at the numbers, I already had a bad feeling. Except, I didn’t know whether that was a feeling about our tests, our code, or the tool.

Stryker-mutator cmd logs

From 333 tests, we managed to kill only 70 mutants. 263 mutants have survived. And the final mutation score received? Soul-crushing 21.02%. Is it time to call Bruce Willis to save us from the mutant-apocalypse or is it too late?

I thought to myself that it can’t be right! It is one of the projects we were most proud of for the amount of code coverage it got! There must be a report that it produced that I want to look at! OK, here it is (on the left there were file names but I cut those out since it wasn’t my personal code I ran it against).

strykermutatorreportcapture

OK, from a few mutations I’ve looked at, it did a string replacement on endpoints, logs, and other things for which we didn’t even write tests. So that makes me feel a little better. Let’s see if there’s a way to exclude such mutations though.

Looks like we have good and bad news. Yes, we can exclude certain mutations, but it won’t improve our score. Well, OK. Let’s run it again still with the “string” mutation excluded. And wait another 5-7 minutes.

It appears that it automatically discarded some tests for which results would not have changed. So now it’s evaluating 220 tests instead of 333. After about five minutes of running, we got the results again. This time 69 mutants were killed and 151 survived. And even though the documentation said our mutation score wouldn’t improve, it went up to 31.36% (still less than desirable). Looks like this time we may have some real conditions we didn’t test for. One cool feature of the produced report is that I can click on the condition that it replaced, and it will show me the exact line it replaced it with. So that will make it easier for us to write more valid tests.

One bug I noticed this tool has is that it doesn’t care whether the “+” sign is used as a mathematical operation or as a way to join two strings together. One of the errors I got is a “CompileError” where it tried to do this weird thing and replace “+” with “-” in string concatenation:

Stryker-net "plus" operator replacement bug

Obviously, it may not work the way we expected out of the box. I didn’t really change any configurations except excluding the string literals’ mutations in the run above. Hopefully, with a bit more exploration and configuration, I can add this tool to my arsenal and never ship a buggy code again (one can only wish).

Later I ran it with a configuration file against a .Net Framework project and it failed to compile. First, because “nuget.exe” wasn’t in my PATH but it’s required for .NET Framework projects. Then it needed me to specify the project name. Then the solution name. Eventually, it tried to run… but it failed again. This time it could not compile one of the project dependencies. I was a bit disappointed but it’s probably my fault for choosing a beta version of the tool. Nevertheless, it was a good experience and can definitely help us start some more conversations about unit testing in general. I do see great potential for such a tool, especially after our fiasco, and hope it matures enough and we’ll be able to use it on legacy projects. But at least I was able to run it without too much trouble in .NET Core.

Now, let’s talk about the potential advantages and disadvantages of using mutation testing.

Advantages:

  • You obtain a tool that automatically tests your tests
  • You can be more confident that your application doesn’t have bugs in it
  • Customers are happier since there’re no bugs in the system

Disadvantages:

  • It won’t find all of your bugs. It will only run the mutations for as much code as it is covered by tests. So if you didn’t write tests in the first place, this tool will be useless.
  • It is very time-consuming. Because it has to go through the whole codebase to replace different conditions and values, on a large codebase you may be waiting for several minutes before the results are produced.
  • It won’t help you do black-box testing of the system since the tool makes changes in the source code itself.
  • What will test the tool to make sure it doesn’t produce false-positives? This cycle will never end.

So will I start doing “Mutation Testing” from now on? The answer is, as with most things in software development, “it depends”. As I said earlier, if we don’t write any unit tests on any of our projects, it’s not going to magically write them for us. And depending on how often we’d want to utilize this method, it may require more time to have the reports generated, so we won’t know immediately what we have to change. I am definitely going to explore this idea further though and see how much benefit it will provide to our team, especially when I figure out how to set up all the right configurations. But, of course, where possible, I’d still prefer TDD. Immediate feedback and 100% code coverage sounds wonderful. Especially, when you like tracking such metrics in order to try becoming a better developer. The only problem is, 100% isn’t always 100% and I hope that Mutation Testing will help us make sure we’re not lying to ourselves.

Now it’s your turn. How do you test your code? Or do you?

Subscribe for more!