Code review metrics: Do more reviews lead to better code?

December 04, 2018


Technical Difficulty

Reading time

I’ve recently started a 3-month internship as a data scientist at Semmle. Taking a break from from my physics PhD, I wasn’t sure what to expect from my first proper taste of working in industry. I had some experience in data science and machine learning, but I had spent the previous three years doing academic research.

My first month at Semmle, however, has been overwhelmingly positive, and my worries quickly faded after being welcomed by the data science team. In addition to the nice working environment, I was given a huge range of potential projects to choose from. I chose to work with Albert Ziegler, investigating the organizational structure of software projects and how that impacts code quality. I wanted to investigate how commits are distributed among developers, and how those commits get reviewed by other members of the team.

This blog reports some of my preliminary results on the impact of code review on the quality of source code.

Does more reviewing lead to better code?

Having someone review my code can sometimes make me feel a bit ...uneasy. On one hand, I would like to become a better programmer so I value the feedback. On the other hand, I’ve put a lot of effort into writing that code—I don’t want some expert tearing my beloved work to pieces!

But code reviews are worth it, right? Or at least we assume they are. When asked in a survey, most developers said that code review is the best thing a company can do to improve code quality, with unit testing in second place (see The State of Code Review 2018 by SmartBear). But here I want to move away from survey responses, and move closer to an empirical and robust evaluation of the impact of code review. I therefore explore the relationship between the quality of code within a project, and how much that code gets reviewed by other developers.

Pull request data and code quality scores

You make a pull request when you propose changes to a Git project from your own branch. Assuming your collaborators are happy with the proposed modifications, the pull request is merged. However, before deciding whether to merge or not, collaborators have the opportunity to review the pull request and comment on specific changes to the code. Each pull request can have multiple reviewers providing multiple reviews.

I analyzed the pull requests submitted over the last few years to roughly 1000 prominent Python projects on GitHub. I measured the extent of code review in a project by dividing the total number of review comments1 by the number of pull requests. This ‘review comments per pull request’ metric is less dependent on the size of a project than simply counting the number of review comments, and is therefore a better measure of the amount of reviewing. For brevity I'll refer to ‘review comments per pull request’ as review abundance.

We now have review abundance, but how do I quantify how ‘good’ the code of a project is? Judging the quality of a project is a difficult task, but thankfully LGTM provides a solution. As described here, LGTM automatically measures the quality of a project using the most recent state of the code base.

To construct the quality measure, LGTM executes hundreds of QL queries against the project’s source code. These high-precision QL queries, developed over many years by language experts, identify issues in the code from simple uninitialized variables to potential zero-day vulnerabilities. The more LGTM alerts generated by the QL queries, the lower the code quality.

However, large projects will tend to have more alerts simply because they have more source code. To negate the effect of project size, LGTM compares the number of alerts with that of other open source projects with similar amounts of code. For example, if projects X and Y have the same amount of code, but project Y generates more LGTM alerts, then project Y will have a lower quality score relative to project X. This community-based quality score takes values between 0 (worst) and 1 (best), and for each GitHub project I use this score as a measure of code quality.

But a code quality metric should account for run-time bugs, I hear you say. It's true to say that LGTM analyzes static code and doesn't, for example, inform us about the efficiency or usefulness of a piece of code. However, in addition to identifying correctness, maintainability, and other properties of the source code, LGTM finds problems that can cause run-time bugs. Identifying LGTM alerts therefore has a similar effect to adding test code to projects (see here and here for explanations of this effect). So LGTM’s code quality metric does in fact capture important aspects of what we collectively understand as the quality of code.

More review comments, better code

I found that the review abundance of projects varies enormously. Most projects have only a handful of review comments, while a few projects have hundreds of review comments. I've therefore used a log scale to plot the number of review comments per pull request, to better illustrate the variation between projects. I also only consider projects with at least 5 code review comments and 100 pull requests (to remove the effects of very small projects which can make the data a bit noisy).

fig1 quality

The main result: projects with a higher review abundance have on average a better code quality score2. This makes sense, as more reviewing should identify more problems which can be fixed before code is merged. Correlation does not imply causation, but in this case I think it’s quite plausible that more reviews do indeed lead to better code.

Incidentally, while it’s a good sign to have many review comments on each pull request, they don’t have to be extensive. There’s no discernible connection between comment length and quality. For a reviewer, the important thing is that they find a problem at all. There’s no need to write a novel about it.

Impact of review on quality grades

LGTM summarizes the code quality score as an easily accessible grade (for example, check out the stellar performance of scikit-learn). The code quality grades range from A+ to E and are based on a project's quality score relative to what is realistically achievable compared to other projects of a similar size.

I used these grades to split projects into two groups: those with grades D to E (low-quality) and those with grades A+ to C (high-quality). How does the review abundance vary between these two groups? Let’s look at the histograms of review abundance for the low- and high-quality projects. The y-axis of a histogram tells us how frequently a set of values appear. To compare the histograms, I normalized by the number of projects of each group.

fig2 histograms

Both distributions can sensibly be approximated by a log-normal distribution (note the log scale of the x-axis), with a similar amount of variation within the low- and high-quality groups. However, the high-quality distribution is shifted to the right, implying an increased number of review comments per pull request on average (the dark green shows where the two distributions overlap). This is confirmed by examining the median of each distribution (the dashed lines). The medians show that on average, high-quality projects have nearly double (0.27) the review abundance of low-quality projects (0.14).

How many more review comments does a project need on average to move up one quality grade? I can estimate this by linearly regressing the quality grade onto the logarithm of review abundance. I found that you need to increase the number of review comments (on average) by 68% to move up a single grade.

Want more reviews? Then add more reviewers!

I've shown that projects with higher code quality tend to be those with more code review comments. But the number of reviews is hard to control directly: it depends on how much useful feedback your reviewers will think of. So what qualities of a project are linked to more code reviews?

I've looked at different metrics, and the best was also the simplest: just count the number of people who review pull requests. We all know that a fresh pair of eyes can help find those hard-to-spot mistakes, so more reviewers should lead to more feedback. I count someone as a 'reviewer' if they provide at least one code review.

There is another piece of information I take into account, namely the number of users who opened a pull request, that is, how many ‘openers’3. I can look at the ratio of the number of reviewers to openers, which will tell me how many people are reviewing code, relative to the number of people suggesting code changes. How does review abundance vary with the reviewer-opener ratio?

fig3 reviewers

There is a strong positive relationship (r=0.75 Spearman correlation 4) between review abundance and the reviewer-opener ratio. This has a clear interpretation: more reviewers leads to more feedback on your code. The average number of reviewers (per 100 openers) of the low- and high-quality groups are 9 and 14 reviewers respectively. In other words: higher quality projects have a higher proportion of people reviewing the code.

Previously, I found that on average you need to increase the number of review comments by 68% to move up a quality grade. How many more developers do you need to generate these extra review comments? Using the linear trend above, I estimated that you would need to increase the number of developers who conduct code reviews (per opener) by 32%5 on average to move up a quality grade.

From these results, we would expect the reviewer-opener ratio to be positively correlated with code quality as well. Does this hold? Indeed it does, with an even slightly higher Spearman correlation (r = 0.18 6) compared to the review abundance (r=0.17). It seems that four eyes really do see more than two… 7

fig4 quality vs reviewers

What does this all mean?

We all want better code—the question is how do we get there? By analyzing pull requests of GitHub projects, I have shown that simply increasing the amount of code review might be the most straightforward way of increasing code quality. If you want to shift your project from an E-D quality grade to a C-A+ quality grade, you may need to double the amount of reviewing.

You want people within your project to review more code. One option is to ask existing reviewers to do more reviewing (that is, the same number of eyes but staring longer). We can roughly test whether this is linked with higher quality by considering the review abundance per reviewer. Alas, I find no relationship (r=-0.02,p=0.6) between code quality and review abundance per reviewer. The data suggests—somewhat unexpectedly—that for a fixed number of reviewers, more time spent reviewing does not improve code quality!

We now have evidence that adding more reviewers improves code quality, but getting existing reviewers to review more does not. Say that every developer in your organization spends 10% of their time reviewing code. Then a good bang-for-your-buck strategy is for developers to continue spending 10% of their time on code review, but spread that time over more pull requests. Why? Because the data suggests it's better to get more eyes on a pull request, than fewer eyes staring longer.

But you can also get another pair of eyes for free! Why not instead boost your current review process by setting up LGTM's automated code review on your pull requests? LGTM’s automated code review functionality acts as an additional automated-reviewer that finds and flags problems independently. It adds an extra layer of code review by providing a comprehensive analysis of pull requests before they are merged (without interfering with existing reviewing by developers). LGTM will also catch deeper mistakes or vulnerabilities that developers might miss.

Overall, my empirical analysis suggests that:

  • The code review process does indeed improve code quality.
  • Increasing the number of reviewers, and not the time individuals spend reviewing, is the best way to improve your project’s code review process.

Thank you to the data science team for their help thus far—and in particular for a lovely welcome curry! I look forward to two more exciting months of my internship.

  1. Automatically generated review comments are ignored in this analysis.

  2. This result is statistically significant at a level of 0.1% with a p-value of 0.0002.

  3. A GitHub user can be both a reviewer and an opener for a project, so the ‘reviewers’ and ‘openers’ of a project are not mutually exclusive.

  4. This result is statistically significant with a p-value of less than 10e-6.

  5. The underlying reason why you need to increase the number of reviewers less than you need to increase the number of review comments, is that some reviewers are much more active than others. So a few extra reviewers can greatly increase the number of issues found, provided there are a few workhorses among those extra reviewers.

  6. This result is statistically significant at a level of 0.1% with a p-value of 0.00024.

  7. The National Magazine, 1857, with the alternative from Venice: “pope and peasant know more than the pope alone”.