Source code analysis: Preventing bugs through tests and alerts

August 01, 2019


Technical Difficulty

Reading time

Bugs are bad. You want to find and remove them as early as possible when working on a project. Typically, you'll write tests (for instance, unit tests, regression tests, smoke tests) to execute parts of the source code to ensure code behaves as expected and to identify run-time bugs.

LGTM, in contrast, finds problems in source code ("alerts") before the program executes. These alerts range from simple coding errors to deep structural problems identified by sophisticated data flow analyses. In other words, testing is dynamic, while LGTM's source code analysis is static. So, at least at first glance, LGTM's analysis and writing tests do something completely different.

But both LGTM and tests aspire towards the same goal: making code better. With that in mind, I wanted to find out whether there’s any evidence that bridges the static and dynamic divide: Does fixing LGTM alerts result in fewer run-time bugs?

LGTM data

I analyzed all projects on LGTM. At the time of writing, that was just over 50,000 in total: 4786 Java projects, 11213 Python projects, and 36444 JavaScript projects. These are all open source projects published on and Projects are automatically added to LGTM if they fulfill objective criteria (for example, popularity on GitHub, measured by counting the number of stars). There are also projects on that were added manually by developers who are committed to improving their code.

Slice by language

When I carried out this research, analyzed Java, Python, and JavaScript code. C/C++ was in beta back then, and we now analyze C# and COBOL too. For various reasons, we need to consider each language separately:

  • The queries LGTM uses are different for each language, and they detect different types of alerts. Many alerts only apply to a specific language, and are therefore not comparable between different languages.
  • Testing cultures may well differ between languages. In fact, in the following it will become clear that there is a massive difference.
  • The projects themselves are just very different for different languages. For example, the median JavaScript project has only 475 lines of code, compared to 3582 lines of code for the median Java project.

Although C/C++, C#, and COBOL hadn't been thoroughly tested at the time I did this research, it is now clear that typical projects in these languages are also very different from those written in JavaScript, Java, and Python.

Slice by size

I split the projects by size into 10 groups. Project size is measured by lines of production code. LGTM automatically excludes resource files, generated code, empty lines and comments. Each group contains the same number of projects.

Furthermore, I split the projects into two categories: projects that contain test code (detected using our queries and a series of heuristics), and projects that do not. Both the number of alerts and the existence of tests correlate with the size of the project in a nonlinear way.

splitting the data 1

The distribution of project sizes is roughly log-normal, so we see sigmoid shapes in the graph above. As a consequence, groups 2 - 9 are fairly homogeneous in size. Conversely, groups 1 and 10 cover a great range of sizes. So size as a confounding factor may well still play a role for groups 1 and 10. We should interpret any results from those groups with caution.

The graph above also shows some interesting general differences between the languages: Small Python projects rarely use tests, while large Python projects almost always do. For Java, even small projects often use tests, and bigger ones use them still more often. For JavaScript projects, the presence of tests is almost independent of size.

This last observation may be explained by the fact that for JavaScript, lines of code is an extremely bad indicator for program complexity; even more so than for other languages. If we assume that a programmer's decision to include tests is mainly driven by how complex they judge their project to be, then what we see is exactly what we would expect. However, the number of tests and alerts is of course still strongly tied to size. It's therefore essential to slice JavaScript projects by size as well.

Does testing reduce alerts?

We can answer this question by measuring the risk of having an alert for the two categories of projects: those with test code, and those without test code.

The following plot cycles through the risk for having at least one alert, at least two, and so on, for projects with and without tests. We can immediately see that, for almost all sizes of projects, the presence of test code implies fewer alerts: the green line (projects without tests) is consistently on top of the blue line (projects with tests).

The presence of tests implies fewer LGTM alerts. For JavaScript and Python, this seems to hold everywhere, except for some values at group 10. As we know, this group mashes together projects of very different sizes, which results in less reliable data. For Java, the effect does not seem to be quite so clear for projects with a small number of alerts. However, Java is the programming language with the smallest sample size, so it's not surprising that the trends are less clearly visible. The trends may in fact also be different: Java is compiled and statically typed, so some of the most basic checks are already performed by the compiler.

For JavaScript and Python, this result is not only visible in the plot, but also statistically significant. Very much so, in fact. For every single group (except 10), the claim that non-testing projects have a higher chance for alerts passes a statistical test[1] at a significance level of 0.1%. The results for Java or group 10 do not pass this significance test.

Does testing make you better than others?

There's a different way to compare the number of alerts for tested and untested projects. We looked at absolute numbers before. But instead, we could also ask: "How many alerts does this project have compared to others of similar size?" To answer this question I took, for every project, the 400 other projects closest to it in size. I then ranked the projects from 0% ("every single one of my 400 peers is better than me") to 100% ("every single one of my 400 peers is worse than me"). Call this percentage of peers with fewer alerts the project's alert rank.

Now this investigation boils down to one simple question. Is the average alert rank higher or lower for tested projects? Let's see how this depends on the size of the project.

alert ranks 1

We can clearly see that, in general, projects that lack tests have more alerts: it's mainly blue above the dividing line.

This is fairly homogeneous effect for JavaScript and Python projects, possibly a bit stronger for medium-to-large projects than for small-to-medium projects. It's not clearly discernible at the very top, but in that range the data becomes very thin[2]. For Java, the effect is only visible for medium-to-large projects.

The claim that projects with tests have a better alert rank than projects without tests is statistically significant[1] at the 0.1% level for Python and JavaScript, and at the 1% level for Java. Testing is but one of many aspects of a project. It makes a big difference though:

overview effect of testing 1

Fewer alerts imply fewer bugs

LGTM's large-scale analysis of open-source projects tells us something crucial: projects with test code have fewer alerts.

The implication is clear: when developers write tests, run them, and fix bugs, then they also tend to fix LGTM alerts, without specifically setting out to do so. So both run-time tests and LGTM alerts identify some shared set of software defects.

We began by asking the question, "Will fixing alerts reduce run-time bugs?". In general, we can't prove a direct causal connection. But it's unlikely that the connections above are all just coincidental. So we would say that if you care sufficiently about the quality of your code to write tests, then you should also care about LGTM alerts.

If your projects are stored in, you can set up LGTM to run checks automatically on each pull request that you make. It's fast, it's free, and LGTM will make your code better.


1 I've used a Mann-Whitney-Wilcoxon test with familywise error correction using Holm-Bonferroni. I chose a Mann-Whitney-Wilcoxon test because it doesn't make any assumptions on the distributions, which are quite different from normal distributions. I needed familywise error correction because I'm running 30 individual significance tests in parallel and want to avoid false positives coming up simply because I ran so many tests.

2 This is to mean that the population used to calculate each window comprises a large range of sizes. This increases the role of size as a confounding variable. Larger projects are more likely to have tests. At the same time, larger projects are more likely to have alerts (there is more that can go wrong). This skews the results towards tests appearing less beneficial than they actually are.

(Image credit: binaryproject / 123RF Stock Photo)

Note: Post originally published on on November 15, 2017