The other day, I had to overpay horribly on the bus when the driver refused to give me change for a tenner. When I arrived at my computer I was still fuming. Appropriately, the first routine I wrote completely went up in flames when I tested it. It made me wonder: was it just coincidence, or would I have succeeded on my first attempt if I hadn't been so angry? And what if I had been happy instead, or sad?
Basically, I want to know: Is there a connection between a developer's emotional state and their quality of work?
Let's find out.
Measuring the quality of a piece of work
I’m going to look at developers’ commits and compare “how good” those commits are in terms of code quality to what the emotional state of the developer may have been at the time of the commit. To do that, I infer sentiments from commit messages.
On LGTM,com, several million commits have been analyzed and “alerts” identified. These can indicate a wide range of problems with the code. The number of alerts is an indicator of code quality (as described in a previous blog post). It's not just a simple matter of counting alerts though: commits that mostly add code typically also introduce more alerts, and commits that mostly delete code typically eliminate more alerts.
A straightforward way of measuring the quality of a commit is by comparing it to commits with a similar number of net lines of code1. For the purposes of this blog post, I compare each commit to its 500 nearest neighbors2 in terms of net lines of code. If its net effect on the total number of alerts is better than all of them, the commit scores an alert rank of 100%. If it's worse than all of them, it scores 0%. If there are as many commits with a better effect than there are commits with a worse effect, the commit's alert rank is 50%.
The alert rank is a very rough measure: commits are small, and alerts are rare. Most commits (83%, to be exact) don't change the number of alerts at all. Those have an alert rank of pretty close to 50%. The other commits often just add or remove a single alert. I compensate for this low granularity and the high noise level by relying on the large size of the data. If a group of commits has been created under favorable circumstances, those commits should on average have a slightly better (that is, higher) alert rank.
Each commit comes with a commit message. I'm running a sentiment analysis on these messages to classify the emotions expressed in them as angry, or sad, for example. I don't distinguish between different degrees of sadness. But I do allow a message to be classified as both angry and sad.
Sentiment analysis is a challenging problem, and many tools have already been developed to tackle it. However, analyzing commit messages is more complicated. Sophisticated attempts to penetrate grammar and meaning are thwarted by incomplete sentences, copious abbreviations and a large amount of jargon.
In contrast, I base my analysis pipeline on a tool called sentimentr3, which uses quite robust heuristics. It doesn’t get everything right, but it isn’t systematically thrown by telegraphese and weird lexemes. It looks for emotionally charged words and checks whether their surroundings are likely to increase or reverse the thrust of the word. For example:
- The statement "looks good to me" contains the positively charged word "good" in a straightforward way, so it's classified as positive.
- The statement "looks really good to me" bolsters the word "good" with "really", so it's classified as very positive.
- The statement "looks good to me, but technobabbletechnobabble" undermines the word "good" with "but", so it's only classified as slightly positive.
- The statement "doesn't look good to me" inverts the word "good" through negation, so it's actually classified as negative.
This approach requires two lexicons. One is for the modifiers like “really”, “but” and “doesn’t” in the example above. I rely on sentimentr’s default here, which appears to work well. The other one is for words carrying an emotional value like “good” in that example. Sentimentr allows you to customize this, but its default is the Syuzhet lexicon4 which describes positive and negative words. I see no reason to deviate from that default when sorting commit messages into positive, neutral or negative.
But I also want to go beyond this one-dimensional classification to the specific emotional flavors like angry or sad. Here I use the NRC lexicon5, which classifies English words evocative of one or more of the following 8 basic emotions6: anger, fear, joy, sadness, trust, disgust, surprise, anticipation.
The following example word clouds show lexicon words commonly appearing in commit messages:
For every message, Sentimentr computes a score between +1 (strongly expressed) and -1 (strongly negated or denied) for each emotion. Inversion through negation or denial appears to be more complicated with specific emotions than it is with general positive/negative sentiments. Rather than fall victim to complex biases I know little about, I interpret negative scores for an emotion as "unclassifiable" and not as categories in their own right. So for each emotion (for example anger) I compare the emotionally charged messages (anger > 0) with the uncharged ones (anger = 0), ignoring the negatively charged ones (anger < 0). I don't distinguish between degrees of anger: either you're angry (anger > 0) or not (anger = 0).
The emotions dictionary has been obtained by crowdsourcing. However, there are some words where the technical meaning quite obviously diverges from the meaning the crowd had in mind. For example, "argument" is listed as connected to anger, and that makes sense if you think of two people yelling at each other. In the programming world though, arguments are what you feed to a function, so the word should be quite neutral. I'm very reluctant to manually remove handpicked words, but I don't see any sensible alternative here. For each emotion, I've checked the very top of the list of most common words in the corpus and removed those which are obviously fishy. These are:
- From disgust: Bug, default, tree
- From surprise: Tree7, variable
Just to give you a flavor of this sentiment analysis, here are some sample commits analyzed by LGTM that have been classified as positive, joyful, trusting, anticipatory, surprised, negative, sad, disgusted, fearful or angry.
On average, out of every 4 messages, about 1 is labeled positive, 1 negative, and 2 neutral.
Sad means small, happy means huge
Commits have lots of interesting properties before I even get to their alerts. One of the most basic is size: How many lines of code does the commit consist of in total (either added or deleted)? It turns out that the emotions detected from the commit message predict this surprisingly well.
In the following plot, larger areas correspond to larger average churn ranks:
Also, it's not just that positive messages are typical of commits that change a lot of lines. They're also typical of commits that mostly add new lines instead of deleting old ones9:
These are pretty clear effects. And they're not very mysterious. In fact, there are three plausible mechanisms that explain this effect:
- When you're feeling happy, you're less critical, so you're quickly adding new code. You're not so much focused on pruning away inferior code.
- When you get a lot of work done, especially if it consists mainly of new additions, you're happy and express that through your commit message. When you get only a little work done, and what you do is mostly focused on pruning inferior code away, that makes you a lot less happy.
- Even a completely objective message about adding many new features likely evokes more positive emotions than an objective message about tweaking a couple of old routines.
All three effects may well work together: When the coder is happy, the code flows. When the code flows, the coder is happy. When reporting flowing code, the description sounds happy.
Quantity or quality?
So, I know that happy people write more code in one commit, but do they write better code? The answer is no, and I see that by comparing the average alert rank:
- Negative sentiments, in particular sadness, are related to cleaner code. This fits in with a wider pattern that has been observed across many domains: Sadness or dysphoria is often observed to coincide with critical thinking and improved judgement.
- The one negative emotion that leads to more mistakes is anger, which is proverbially known for having a blinding effect.
- Fear is also often seen as blinding, but commits with messages classified as fearful are of above average quality. This might be because the emotion detection does not properly distinguish between fear and caution. It makes sense that cautious commits would be better commits: Prudence prevents problems.
- Surprise and anticipation are associated with bad commits. However, surprise is expressed quite rarely, and when it is, it’s often combined with anticipation. If surprise stands on its own, it’s not indicative of a bad commit. So the real culprit of the two is anticipation, possibly due to developers getting ahead of themselves rather than focusing on the task at hand. After all, as Camus said:
Real generosity towards the future lies in giving all to the present.
Statistical significance of observed effects
The absolute values of the average differences are quite small. As noted above, this is expected due to the granularity of the measurement process. But it also means that I need to take special care to ascertain whether these are real effects or if I'm just overinterpreting some random fluctuations. For this, I use p-values11 to see whether the effects are significantly distinct from random noise.
Programmers aren't robots, and their emotions do matter. As Master Yoda (more or less) put it:
Fear leads to anger. Anger leads to bugs. Bugs... lead to suffering.
Don’t let your project team suffer. You can’t eliminate their anger (and you shouldn’t eliminate their joy). But you can get rid of the mistakes those emotions bring with them, and many others besides.
LGTM helps you with that. It automatically analyzes GitHub and Bitbucket projects so you see which issues in your codebase need fixing. This way, LGTM acts as a trusted code reviewer. It flags up potential problems before they are merged into your codebase.
This lets you enjoy the freely flowing code without fear that your happiness risks mistakes that will cause toil and anger down the line. Master Yoda’s vicious circle is avoided, and you become a true coding Jedi.
Title image credit: Daniel Huntley
Note: Post originally published on LGTM.com on 08/17/2018
It’s possible to be even stricter and compare a commit only to other commits with both a similar net and churn. This is more sensitive to the amount of data. Nevertheless, I used it to double check, and the effects are very similar to the ones where the alert rank is based only on the net number of lines. In particular, all the results mentioned in the text below hold for both variants.↩
In the case of ties, more than 500 neighbors may be used. For example, there are 1592 Python commits which add exactly 50 more lines of code than they remove.↩
Rinker, T. W. (2017). sentimentr: Calculate Text Polarity Sentiment version 1.0.1. University at Buffalo. Buffalo, New York. http://github.com/trinker/sentimentr↩
Jockers, M. L. (2015). Extract Sentiment and Plot Arcs from Text. Nebraska Literary Lab. https://github.com/mjockers/syuzhet↩
Mohammad, S. and Turney, P. (2013). NRC Word-Emotion Association Lexicon. National Research Council Canada. http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm↩
These are the primary emotions according to the research of Plutchik. Irrespective of the merits of this claim to primacy, I think they make a decent set of standard emotions for analysis.↩
I have to admit that I found the range of emotions associated with trees to be quite... surprising. I had a quick google, and now I know about Christmas tree surprise and tree as a metaphor for anger in English poetry. I still don't get the disgust part though. Email albert at semmle.com if you can explain it to me.↩
In fact, many commits don't change any lines of code, but modify one or more non-coding files. They're not interesting for this investigation and so I've excluded them from the data set.↩
To see how much a commit focuses on adding to the codebase rather than pruning it, I use a similar method to the one for quality. Commit quality is the rank of net alerts (where a commit is only compared to other commits with similar net lines of code). Focus on addition is the rank of net lines of code (where a commit is only compared to other commits with similar total churn).↩
The distribution of alert ranks is so different from a normal distribution that the assumptions of a standard t-test do not hold. Nor can I use a Wilcox test, because while the alert rank has been crafted in such a way that net lines changed has been removed as a confounding factor for the mean, it's still confounding the median. Thus, I use a bootstrapped p-test based on 50.000 simulations for each p-value. This is computationally more expensive, but it doesn't require any further assumptions.↩