Should you write tests for your code? I hope so, or billions of dollars are going down the drain. But fortunately, that hope isn’t misplaced. The evidence, collected from LGTM.com’s analysis of over 50k open source projects1, is clear:
- Writing tests reduces the number of alerts in your codebase.
- Writing more tests reduces the number of alerts even further.
However, any change to a codebase affects it in a multitude of ways. The same action could, for example, improve efficiency, but decrease readability. So it's too simplistic to say that writing tests and acting on them just improves code. More accurately, it transforms a codebase in different dimensions.
For the majority of dimensions, this transformation is desirable, but as we will see, not for all.
The large scale analysis of open source projects, made possible by LGTM, allows us, perhaps for the first time, to identify exactly how test code can actually harm the quality of tested code. And we'll discuss how to avoid the bad effects, and enjoy the benefits of test code without the drawbacks.
LGTM data and how it's processed
For each of these projects, LGTM runs a battery of sophisticated QL queries. These queries have been designed by coding experts in each language to detect potential issues with the codebase. They embody a huge amount of domain knowledge about what can affect your code's correctness, efficiency, information security, maintainability, and so on. These categories of problems are attached to the queries as tags. Each language provides around 20 different tags.
Each tag corresponds to a small army of experts hunting down problems in your code. I want to know how testing affects different aspects of code quality. Tags are a great way to capture different aspects of code quality.
Some tags are quite rare, so it's hard to draw firm conclusions about them. Below, I'll report only the tags with statistically significant results. Since I report only on a selection, my statistical tests are appropriately stricter to avoid reporting bias2.
Some alerts point to straightforward and obvious defects of code (for example, immediate crashes) whereas others point to complex and subtle aspects. Chances are you'll catch the simple problems with your first few tests. But you'll need to write lots of tests to catch the more subtle problems.
So let's ask two different questions:
1. What's the impact of having any tests at all?
I am going to compare how likely a project with tests versus a project without tests triggers an alert.
This depends on the size of the project, so I'll ask that question separately for the smallest3 10% of projects, and for the projects in the 10% - 20% size group, and so on. Past experience has shown that the first and last of those groups are very heterogenous4. So when I compute the actual numbers, I'll ignore those two extremes.
The main measurement I'll use is the following: Take two random projects with a similar amount of production code such that:
- One is testing, the other one isn't.
- One has triggered an alert with a certain tag, the other one hasn't.
What’s the chance that the project with tests is free of this alert?5 If it's 100%, that means that projects with tests manage to completely avoid alerts with that tag: testing is good. If it's 0%, that means that only projects with tests run the risk of such alerts: testing is bad. If it's 50%, tests don't appear to matter.
2. What's the impact of having more tests?
Here, I only consider projects that do have tests. Do more tests reduce alerts, or increase them?
Let's get rid of size as a confounding variable by using not the raw amount of tests but the test rank of a project, as defined here. 0% means very few tests compared to projects of a similar size, 100% means a lot6.
The main measurement I use is the following: Take two random projects such that one has triggered an alert with a certain tag and the other one hasn’t. What’s the chance that the project with the better test rank is free of this alert?7 If the chance is high then more tests mean fewer alerts with that tag. If it's low, it's the other way around. If it's 50%, tests don't appear to matter.
Bars to the right indicate alerts that are rarer for projects that use tests. So we can see that testing seems to benefit the code's reliability, portability and security.
Bars to the left indicate alerts that actually appear more often for projects that use tests. Testing appears to hurt the quality of the code's documentation and its efficiency.
What about more tests? In this chart we only compare projects with tests:
More test code seems to really help with security and reliability, whereas code efficiency and documentation seem to suffer from increased testing.
The two charts fit together nicely:
There is considerable overlap between the two. And there’s no tag where they disagree about whether tests are beneficial or detrimental.
Tests positively impact:
Tests negatively impact:
- Documentation seems to be the clear loser from JS tests.
- Tests also seem to negatively impact efficiency.
It makes sense that these two aspects of the code would be negatively impacted by testing:
- Documentation alerts check whether comments align with function signatures. Testing will never directly make that better, but it can easily make it worse. If you respond to failing tests by changing the parameters in your functions, but forget to update the comments, you'll generate documentation alerts.8
- Bodging it can be a quick way to fix the test. But it can easily introduce inefficient anti-patterns such as those detected by the efficiency alerts.
Do we find similar differential impact of test code in other languages?
Impact of Python tests
The presence of test code seems to mainly improve code. The exception is modularity.
Looking at the individual modularity alerts, it makes sense that many of these could occur when a developer tries to quickly fix some tests. For example, your test might show that an attribute value set by a superclass initializer isn’t right for a particular subclass. You preserve modularity by rewriting the superclass initializer to take a parameter for this value. But it’s the _sub_class that fails the test, so it’s tempting to just have the subclass initializer overwrite the attribute instead. This can violate the superclass’s invariants, hence it impacts modularity.
Do more tests make a difference?
No statistically significant negative impact of increasing the number of tests could be detected. On the other hand, several categories appear to improve.
Again, both charts fit together:
There is considerable overlap between the two tables. And there's no tag where they disagree about whether tests are beneficial or detrimental.
Impact of Java tests
In the chart above, only one result achieved a significance level of 0.1%. This is a consequence of most Java projects using tests, which makes it hard to discern the effect of a complete lack of testing. This doesn’t affect the comparison between little test code and a lot of test code. Four of the five results in the table above are also confirmed in the connection of alerts to the quantity of test code:
Naming alerts seem to increase with more testing. While there are a number of different naming alerts, almost two thirds of all instances are actually due to confusing method overloading. It's very plausible that those overloads are often afterthoughts introduced hastily to fix tests. A proper refactoring would include renaming the overloaded methods, but tests don't enforce that, so it often doesn't happen.
Strengths and weaknesses of tests
Let's put this all together, and try to draw some overall conclusions. The following comparison plot presents an “impact profile” for each language.
This chart can be used to quickly compare the three languages. For example:
- Modifying code to make a test pass seems to easily mess up a program's modularity.
From a more general perspective, there’s one underlying and, I believe, very relevant insight to take home:
The impact of testing is not uniform. And it’s not even a net benefit for each aspect of code quality.
The most obvious aspect of that is that test code normally checks the behavior of code, but doesn't check the structure of code. Good code doesn't just behave well, it also reads well -- so your future self, or other people, can maintain it. Test code completely ignores this aspect of code quality, and therefore it's not too surprising it can harm it.
But in fact, even for those categories of alerts where testing helps on average, it is likely that some problems will improve when developers rewrite their code to fix tests, and others get introduced. The net effect may be positive, but almost certainly not every single instance. I think there are three main reasons for why having tests also has drawbacks:
- Having to fix tests makes the developer rework their code. Everything you fiddle with has a certain chance of being dented or broken in the process. Tests themselves guard against some defects being introduced. But tests only check what they've been told to check, so some of the introduced defects stay.
- Having to fix tests places a constraint on the developer. Often, the principled way to obey that constraint would require organized refactoring, but the easy way to make the test pass is introducing a quick hack. Tests only check what they've been told to check, so the developer gets away with it.
- Having to fix tests can focus the developer too strictly on fixing those tests. They might pay less attention to aspects that are not codified in tests. Tests only check what they've been told to check, so the aspects the developer doesn't focus on are left to deteriorate.
That's not to say that tests aren't essential. They are. But the empirical data clearly indicates that, in certain respects, tests actively harm the quality of a codebase. Surely we can do better? We can -- on condition we understand and control their side effects:
How to have it all
In each of the languages mentioned earlier, testing makes your code better. That's the main reason why people do it. But it's a two-steps-forward-one-step-back situation:
- A test flags up a problem.
- A developer quickly plugs that hole.
- Often, they open up a new one while doing so.
For the majority of categories, especially those which are easily testable, the net effect is still positive. But I feel that accomplished developers should set their sights on turning that net positive into a pure positive.
I used LGTM alerts to measure quality, but you can use them to ensure it. LGTM offers automated code review as a free service. If you turn it on, each time someone bodges to make the tests pass, they will be told about the new alerts they were about to introduce. The alerts can then be fixed before the code is even merged into your codebase.
Paying attention to alerts helps you get the most from your tests. You reap the benefits while guarding against the side effects.
If you run 20 such tests and only report the one that passes, you're quite likely to find something, but it's completely unclear whether it's a real effect. See for example here for a fun illustration. But there's a solution: make your significance test appropriately stricter. If you run your 20 significance tests at a level of 5%/20 = 0.25%, you will not get more than 1 false positive per 20 runs on average. So if you only want to report the one that passed, report it as merely 5% significant, while individually it passed the stricter test at 0.25%. It's a bit more complicated if there's more than one test that passes though. I used the Holm-Bonferroni method to make sure that the chance of even one false positive per table is definitely less than the reported significance level.
LGTM only counts files in the corresponding programming language, and within those files, it strips non-code like whitespace or comments.
being connected both to testing and to the chance for an alert. See this post for more information about how size is distributed.
Since each size range contains the same number of projects, I think it's reasonable to take the average as aggregate statistic. I need to run statistical significance tests to see whether the answer is trustworthy. For each single group I use a Chi-squared test to verify it (except when the expected cell size gets too small: then I switch to Fisher's exact test). This gives me a p-value for each of the size categories. I'll combine the p-values of all size categories with the Stouffer method.
statistical test to check whether I can trust the answer.
You need to turn them on by hand if you want to see them.
Note: Post originally published on LGTM.com on February 28, 2018
This post was originally published on February 28, 2018. As of now (August 15, 2019) over 135,000 open source projects are analyzed daily on LGTM.com↩
If you run 20 significance tests at a significance level of 5%, on average one of them is expected to pass simply by chance.↩
According to the LGTM computed "lines of code" metric.↩
That's a problem because size acts as a confounding variable,↩
Strictly speaking, we're answering that question 10 times, once for each size range.↩
I use a Mann-Whitney-Wilcoxon↩
Note that LGTM doesn't show the documentation alerts by default, since not everyone is interested in them.↩