Bugs are bad. You want to find and remove them as early as possible when working on a project. Typically, you'll write tests (for instance, unit tests, regression tests, smoke tests) to execute parts of the source code to ensure code behaves as expected and to identify run-time bugs.
LGTM, in contrast, finds problems in source code ("alerts") before the program executes. These alerts range from simple coding errors to deep structural problems identified by sophisticated data flow analyses. In other words, testing is dynamic, while LGTM's source code analysis is static. So, at least at first glance, LGTM's analysis and writing tests do something completely different.
But both LGTM and tests aspire towards the same goal: making code better. With that in mind, I wanted to find out whether there’s any evidence that bridges the static and dynamic divide: Does fixing LGTM alerts result in fewer run-time bugs?
Slice by language
- The queries LGTM uses are different for each language, and they detect different types of alerts. Many alerts only apply to a specific language, and are therefore not comparable between different languages.
- Testing cultures may well differ between languages. In fact, in the following it will become clear that there is a massive difference.
Slice by size
I split the projects by size into 10 groups. Project size is measured by lines of production code. LGTM automatically excludes resource files, generated code, empty lines and comments. Each group contains the same number of projects.
Furthermore, I split the projects into two categories: projects that contain test code (detected using our queries and a series of heuristics), and projects that do not. Both the number of alerts and the existence of tests correlate with the size of the project in a nonlinear way.
The distribution of project sizes is roughly log-normal, so we see sigmoid shapes in the graph above. As a consequence, groups 2 - 9 are fairly homogeneous in size. Conversely, groups 1 and 10 cover a great range of sizes. So size as a confounding factor may well still play a role for groups 1 and 10. We should interpret any results from those groups with caution.
Does testing reduce alerts?
We can answer this question by measuring the risk of having an alert for the two categories of projects: those with test code, and those without test code.
The following plot cycles through the risk for having at least one alert, at least two, and so on, for projects with and without tests. We can immediately see that, for almost all sizes of projects, the presence of test code implies fewer alerts: the green line (projects without tests) is consistently on top of the blue line (projects with tests).
Does testing make you better than others?
There's a different way to compare the number of alerts for tested and untested projects. We looked at absolute numbers before. But instead, we could also ask: "How many alerts does this project have compared to others of similar size?" To answer this question I took, for every project, the 400 other projects closest to it in size. I then ranked the projects from 0% ("every single one of my 400 peers is better than me") to 100% ("every single one of my 400 peers is worse than me"). Call this percentage of peers with fewer alerts the project's alert rank.
Now this investigation boils down to one simple question. Is the average alert rank higher or lower for tested projects? Let's see how this depends on the size of the project.
We can clearly see that, in general, projects that lack tests have more alerts: it's mainly blue above the dividing line.
Fewer alerts imply fewer bugs
LGTM's large-scale analysis of open-source projects tells us something crucial: projects with test code have fewer alerts.
The implication is clear: when developers write tests, run them, and fix bugs, then they also tend to fix LGTM alerts, without specifically setting out to do so. So both run-time tests and LGTM alerts identify some shared set of software defects.
We began by asking the question, "Will fixing alerts reduce run-time bugs?". In general, we can't prove a direct causal connection. But it's unlikely that the connections above are all just coincidental. So we would say that if you care sufficiently about the quality of your code to write tests, then you should also care about LGTM alerts.
If your projects are stored in GitHub.com, you can set up LGTM to run checks automatically on each pull request that you make. It's fast, it's free, and LGTM will make your code better.
1 I've used a Mann-Whitney-Wilcoxon test with familywise error correction using Holm-Bonferroni. I chose a Mann-Whitney-Wilcoxon test because it doesn't make any assumptions on the distributions, which are quite different from normal distributions. I needed familywise error correction because I'm running 30 individual significance tests in parallel and want to avoid false positives coming up simply because I ran so many tests.
2 This is to mean that the population used to calculate each window comprises a large range of sizes. This increases the role of size as a confounding variable. Larger projects are more likely to have tests. At the same time, larger projects are more likely to have alerts (there is more that can go wrong). This skews the results towards tests appearing less beneficial than they actually are.
(Image credit: binaryproject / 123RF Stock Photo)
Note: Post originally published on LGTM.com on November 15, 2017