Reviewing code is essential for a clean code base. I’ve recently shown that review processes are better if they are less hierarchical. Such an everyone-reviews-everyone-else culture is associated with higher code quality, faster turnaround, and less wasted development effort.
This blog post continues my application of flow hierarchy to the code review process. I've introduced the flow hierarchy measure in my last blog post. It indicates a kind of ‘pecking order’ among the contributors to a project, where it's always high rank people reviewing the lower ranks, but rarely the other way around. Since code review is so important, barriers to code review can be problematic. And indeed, when I compared pairs of similarly-sized projects, the majority of coding mistakes were made in the projects with the higher flow hierarchy.
Java leads the way
So all the expected effects of hierarchy hold within each language, and they also appear to hold between the languages, with the least hierarchical language having the fastest pull requests and the most reviewers.
Commercial projects have the edge
Why is Java different? A potential explanation emerges when distinguishing between commercial and non-commercial projects. Full-time developers, as opposed to those working on software in their spare time, produce software in completely different institutional contexts. This context matters, and is particularly relevant to Java.
I need a way to classify whether a project is commercial or not. This is tricky, as simply going to the GitHub page of a project doesn’t immediately tell you anything (unless you already know the organization who owns the project). But on average, commercial projects commit predominantly during the working week and relatively less on the weekends. Can we use this information somehow?
Yes! But we don’t have labeled data, so this is an unsupervised learning problem. For each project, I can calculate the proportion of commits made on each day of the week. I then apply K-means clustering to these daily commit proportions to see if any meaningful separation occurs.
To train my K-means clustering model (with K=2), I used data on nearly 4 million commits from a little over 40,000 GitHub projects, and then restricted to projects with at least 100 commits. The daily commit proportions of the resulting centroids are shown above. The centroids automatically— with no prior information— distinguish between projects that commit mainly on weekdays or the weekend! For projects that I have both pull request and commit data for, I apply the K-means classifier to tell me which centroid the project belongs to. I can now compare the flow hierarchy, pull request duration, and reviewer-opener ratio between commercial and non-commercial projects.
I find that commercial projects have lower hierarchy, faster pull requests, and more reviewers— similar to the results for Java. The differences in duration are not statistically significant, but the differences in hierarchy and reviewers are robust (p<1e-6). It seems commercial developers typically have a more effective code review process than non-commercial ones. It's not completely obvious why that is. My personal speculation would be that it's because commercial projects are, on average, under stronger pressure towards efficiency. They simply can't afford taboos like "the newbie can never review the master." While many non-commercial projects are very professional, not all of them have to be.
Java has the highest proportion of commercial projects, suggesting ‘commercial-ness’ partially explains the observed hierarchy differences between languages. This is based on an imperfect classifier and so a better method of detecting commercial projects could reveal an even stronger result.
Room to grow
On average, commercial projects appear to be better organized. But better does not mean perfect. And in fact, the benefits of everyone-reviews-everyone-else show no sign of saturation3 for the flow hierarchies found either in commercial or in non-commercial projects: so no matter where you're at, there's every reason to keep on striving.
This is the last in a series of three blog posts I have produced over the course of my internship at Semmle (part1, part2). I would like to thank the Data Science team at Semmle for helping me produce this piece of work, and in general making my internship an all-round great experience!
The difference to the language with the second lowest flow hierarchy was significant at the 0.1% level using a Mann-Whitney test.↩
The difference to the language with the second most reviewers per opener was significant at the 0.001% level using a Mann-Whitney test.↩
Often, factors impacting quality work like a threshold, where as long as you haven't exceeded some maximum acceptable value, the exact value doesn't matter anymore. On a plot, this would correspond to the curve flattening off. However, if you look at the curves for flow quality versus number of reviewers, pull request merge ratio, or pull request duration, you see that it's not the case here: however low flow hierarchy you have, going even lower would still, on average, be better.↩