Unit Testing: Does more test code lead to fewer code alerts? (LGTM study)

August 08, 2019

Category

Technical Difficulty

Reading time

In a previous blog post, I showed that projects containing test code have fewer alerts than projects that don't. The evidence suggests that both LGTM's code analysis and writing test code help find a shared set of software defects. So if you care about writing test code you should use LGTM to check your code for alerts.

But my previous post did not go into detail about the quantity of test code within projects. Some projects include one simple test case, while others have so many tests that the quantity of test code vastly outstrips production code.

Does the quantity of test code make a difference? Do projects with more test code have fewer alerts?

LGTM data

Like last time, I'll analyze the data we've collected on LGTM. This time, however, I'm only going to include projects that use at least some tests1. This gives me 3521 Java projects, 6362 Python projects, and 21250 JavaScript projects. At the time of writing, we're testing our C and C++ analysis on a small set of projects. Projects written in these languages are not included in this analysis.8

LGTM carefully tracks lines of production and test code, so I can measure the relative quantity of test to production code in a project. It also identifies downloaded libraries and generated code; I've excluded these files from this analysis.

More tests means better code

The quantity of alerts and test code in projects increases significantly with project size. This relationship is not linear. To avoid the confounding influence of size, I've assigned a rank to each project in the following way:

  • The test rank of a project measures how many similarly-sized2 projects have fewer tests. 0% means fewer tests than everyone else (oh dear!), 100% means more than everyone else (yay!).
  • The alert rank of a project measures how many similarly-sized projects have more alerts. 0% means more alerts than everyone else (oh dear!), 100% means fewer than everyone else (yay!).

So what does the data tell us? The high level result is clear:

If you test more than your peers, you'll very likely have fewer alerts than them.

For each of the three languages under consideration, this passes a statistical significance test at level 0.1%. The effect size is strongest for Java (Spearman correlation coefficient ρ = 0.197, where 0 means no connection at all and 1 means perfect correspondence) followed by Python (ρ = 0.123) and then JavaScript (ρ = 0.078).

This relationship extends over all of the data and is fairly linear. In the chart below, we see a clear positive correlation between more tests and fewer alerts for all languages:

unnamed chunk 2 1

Note that projects with a lot of JavaScript tests compared to production code are an outlier in the above plot. Spot-checking these projects reveals what's going on. These are often projects in which tests are written in JavaScript, but the production code is predominantly written in a different language, for instance TypeScript or CoffeeScript. We are currently working on support for these languages; LGTM does not analyse these files at the moment3. As a result, the relative amount of test code (written in JavaScript) appears to be extremely large, because the production code is not analyzed by LGTM. Their high test rank is therefore misleading, and this is reflected in their more modest alert rank.

Overall, however, the evidence is striking: more tests means fewer alerts.

Importance of tests increases with project size

Large projects have more elements that all need to fit together. So even small differences between a component's behavior and its designer's expectations can be amplified to affect the whole system. Thus, we might expect that good testing is even more important for large projects. And so it is:

unnamed chunk 3 1

In the graph above, the local correlation has been obtained over a small window of size 400, so it's rather noisy. But the blue interpolation line shows a clear trend: the larger a project, the stronger the correlation between test rank and alert rank4.

For small projects, there isn't a very big difference between the average project with meager testing and the average project with extensive testing. For larger projects the correlation is stronger. On average, large projects with lots of tests have a substantially better alert rank than large projects with few tests. Testing plays a bigger role for larger projects.

Note that Java projects are generally much bigger than the average JavaScript project. This partially explains why the correlation between tests and alerts is stronger for Java projects.

Where does the relationship between tests and alerts come from?

We've seen that projects with more tests have fewer alerts. But why? I think it's likely that the chain of cause and effect works as follows:

The great code hypothesis

  1. Tests identify defects that can cause runtime bugs.
  2. Many alerts also identify these defects.
  3. More test code means more defects are detected.
  4. Fixing those defects reduces the number of alerts.

However, you famously can't prove causation from correlation alone. For example, the following is also possible:

The great developer hypothesis

  1. Good developers instinctively write code that triggers fewer alerts.
  2. Unrelatedly, good developers also happen to write more test code.

There are a few minor variations to this alternative. Instead of good developers, it could also be developers who have more time to spend on their code, or who work in a team with high shared ambitions or strict coding standards, or who have a very thorough techical lead, etc. In the terms of our analysis, they're all pretty much isomorphic. So I’ll stick with the first one.

unnamed chunk 5 1

Both of these explanations could contribute to the overall effect. Can we use LGTM's data to distinguish between them?

Let's assume the great developer hypothesis is true and the great code hypothesis isn’t. Great developers write great code and also happen to write more tests. Those tests are part of their code, so they’re probably also great5. Therefore, we should find fewer alerts in their test code.

By default, LGTM only displays alerts for production code, but behind the scenes it of course also finds alerts in test code. So, if the great developer hypothesis is true, then we should see that more tests correlate to fewer alerts in test code just as strongly6 as to fewer alerts in production code. But they don't:

unnamed chunk 6 1

For each test rank of x%, the above plot displays the average alert rank of all projects with a test rank of at most x% (the cumulative mean). It shows that, on average, more tests do imply slightly cleaner test code7. But the line corresponding to production code is much steeper.

So the connection to production code is clearly stronger, contrary to our expectations according to the great developer hypothesis. But the great code hypothesis explains this easily: tests look for defects in production code, not test code. We would therefore expect to see a stronger relationship between more tests and fewer alerts in production code compared to test code.

While this does not constitute formal proof, the finding strongly suggests that tests have an actual impact on the number of alerts.

Conclusions

Programmers don't write tests in order to fix alerts. They write tests to isolate and detect potential issues within their code base. Then they fix those issues, which often fixes an alert -- or a group of alerts -- related to that issue.

Does your project have all the tests it could possibly want? The data indicates that basically no matter how many tests you have, writing some more has a good chance to help you resolve yet another alert. The reason the test fixed that alert is probably because the issue uncovered by the test was the one that prompted the alert in the first place. In other words, alerts identify defects in your code before any tests are written. In essence, by fixing alerts you achieve the same as by writing additional code tests -- but with zero cost of implementation.

Data collected from many thousands of open source projects demonstrates that if you care about writing test code then you should also care about LGTM's alerts. It's additional testing for free.

You can use alerts as inspiration for where and how your test coverage could be improved. But there's an even easier way to ensure someone doesn't (re)introduce alerts: LGTM offers free automated code review for GitHub and Bitbucket projects. Once activated, LGTM acts like a trusted code reviewer dedicated to identifying problems in your code. If it does, it raises an alert before any code is merged into master! Follow the link to try it out.

Footnotes

For compiled languages, LGTM only analyses the code that's actually built. And I want to make use of the info LGTM computes about a project's tests, in particular the actual lines of code.

and are analyzing it in all projects that use it.

But I would expect the relative correspondence to hold: Someone who generally writes shoddier production code than others will normally also tend to write shoddier test code than others.

The reason is that code bases are usually of mixed origin. If some developers test more, they contribute to a larger proportion of the test code than of the production code. If these are also the developers that produce fewer alerts, the alert density in test code should profit more from them than the alert density in production.

Note: Post originally published on LGTM.com on December 12, 2017


  1. In the case of Java, I also require that the project builds its tests.

  2. "Similar sized" means the closest 200 smaller projects, and the closest 200 bigger ones.

  3. Editor's note: we have since added TypeScript support

  4. For each of the three languages, this statement is statistically significant at a significance level of 0.1%.

  5. I would not expect test code to be as clean as production code in absolute terms.

  6. In fact, you might expect the connection to test code to be even stronger than the connection to production code.

  7. For each of the three languages, this statement is statistically significant at a significance level of 0.1%.

  8. This post was originally published on December 12, 2017. As of now (August 8, 2019) 8500 C/C++ and 2500 C# projects are analyzed daily on LGTM.com

Join us in securing the software that runs the world!

Enter your email address below to stay up-to-date with Semmle news, security announcements and product updates.

Loading...