Many large software companies recognize the benefits of having their coders contribute to open source projects. Sometimes, these contributions are like charitable donations to a community effort. Sometimes, it's a company's own project that's worked on in a transparent way that encourages outside contributions. Often, it's something in between.
But there's one thing that open source contributions always are: open. Provided that you can recognize someone as an employee of a certain company (for example by virtue of a company email address), they automatically represent this company. The quality of these contributions can benefit or hurt the company's reputation, since they are judged by prospective customers, collaborators, and employees. And also by LGTM.
In this blog, I compare companies (Google or Microsoft?), organizations (Apache or Mozilla?) and email providers (Gmail or Hotmail?). My goal is to find out which email domains offer the best contributions to the open source communities on GitHub and Bitbucket. Who writes the best code?
Judging and attributing contributions
LGTM checks each commit to see whether it introduces or fixes alerts in the code base. These alerts are related to run-time bugs (see my previous blog posts about the significance of LGTM alerts here and here). If two commits change the size of the code base in a similar way (for example, they both increase it by 10 lines of code) you can compare their quality: If one commit fixes many more LGTM alerts than it causes, and the second one fixes only a few, the first one is better. Based on this, any commit can be assigned an "alert rank" between 0% (worse than all other similar commits) and 100% (better than all other similar commits) as described in another blog post.
Each commit has an author, and each author has an email address. I use the email address's domain name to find out which organization an author belongs to. I compute the mean alert rank of each organization to determine which one is best. Of course, if an organization is only responsible for a single commit, which happens to be excellent, that organization might easily lead a naively compiled table. Most likely, that's just chance and does not actually represent a phenomenally good company. I take the following two actions to avoid such false positives:
- I only consider the 20 most common email domains. Using more would incur a larger risk of false positives.
- I compute confidence intervals2, taking care that all medalists are statistically significantly better than the average commit. It doesn't guarantee that the order is completely correct, but at least the prizes don't go to anyone who doesn't deserve them.
What is it that sets the organizations that produce the best code apart from the rest? Of course any attempt at an exhaustive treatment of this multifaceted question would exceed the scope of this blog post. But I provide at least a flavor of background by also comparing whether a typical commit is more focused on expanding or pruning the code base and whether the commit is focused on few files or spread out over many3.
I only report results if the commits from an email domain are statistically significantly different from other commits4.
Who writes the best Java code?
In the following table, the calculated confidence interval5 is blue, with the actual measurement marked in green.
Java coders from Microsoft have the best average quality. (There aren't very many of them working in open source projects, which is why the confidence interval is rather broad.) Runner-up Pivotal is mainly a Java company, best known for the Spring framework. LGTM's security team recently uncovered some serious vulnerabilities in Spring (as described in previous blog posts: 1, 2, and 3). Pivotal has since corrected them. Their efforts to increase quality allow them to narrowly push Red Hat to third place.
Who writes the best Python code?
Where are the big differences?
What about the free email providers?
Eight of the common email domains are not limited to a particular company or organization, but are open to everyone. In order of number of commits, these are gmail.com, hotmail.com, qq.com, yahoo.com, 163.com, me.com, outlook.com, and googlemail.com. I expect their users to be a more diverse group than the employees of any particular company. However, there may still be some general trends. And indeed there are.
There seem to be two kinds of providers:
- Users with email addresses from gmail.com, googlemail.com, me.com, and yahoo.com are pretty average. 5 of their 11 scores are very slightly better than average, 6 are very slightly worse, but the differences are not statistically significant.
- Email addresses from hotmail.com, qq.com, and 163.com are bad news. They all perform substantially below average, mostly6 statistically significantly so.
The following plot shows that the two categories are nicely separated for all three languages:
The shaded areas in the above plot represent confidence regions7, green for good and blue for bad.
That demonstrates that using a free email address doesn't need to be a bad sign. It depends on the exact domain.
Why do hotmail, 163, and qq perform so badly? I know very little about the Chinese internet scene, and what using an email address from the big portal sites 163.com or qq.com signifies. But I do know that hotmail has a reputation to also include many less technically skilled users, to the point that many high profile recruiters are willing to go on record to talk about the negative impact of hotmail addresses on job applications.
Overall, coders with email domains hotmail, qq, and 163 contribute poorer quality code. Coders with email domains google, yahoo, and outlook supply code of a similar quality to average, or below average, coders with commercial email addresses. Coders who work for Microsoft, SAP, or Google contribute the highest quality code.
A handy trick to boost your score
Everyone makes mistakes, but not everyone has to suffer the consequences. If you enable automated code review for pull requests, LGTM warns you whenever someone is about to introduce new problems into your code base. That way, you never have to worry about subpar scores again. Oh, and you'll also have fewer long-term headaches and run-time bugs, if you care about that sort of thing.
Conflict of interest statement
Some of the companies that I investigated (in particular Microsoft and Google) are clients of the company Semmle where I am employed. I have not treated these companies differently, nor have I received any suggestion that I do so.
Title image: Hitesh Choudhary
Emojis: Noto project
LGTM also analyzes projects of the C family, but I'd prefer to wait until I have more data before drawing any conclusions.↩
I computed the confidence intervals using a bootstrapped test based on 100.000 simulations. The intervals have a confidence level of 95% (one-sided: since I'm only interested in winners, this is one of the rare instances where one-sided tests are actually appropriate).↩
I also take the commit size into account here—a large commit touching 3 files counts as more focused than a tiny commit touching 2 files.↩
For commit size, this is tested with a Mann-Whitney-Wilcox test. The expanding/pruning distinction and the focused/spread distinction are tested with bootstrapped p-tests. In each case, I use a significance level of 5% (two-sided).↩
The confidence interval is a 90% interval. This makes the certainty that the true value is at least as large as the lower end of the interval 95%, which is the usual criterion for statistical significance.↩
Only the Python contributions from qq.com and 163.com fail a significance test at 5%. However, there are very few contributions from these domains to Python projects. In a sense, there just isn't enough data to condemn them.↩