Wouldn’t it be great to automatically measure the quality of source code? Developers could make better-informed decisions regarding what to focus on. Project maintainers could stay on top of problems. Users could assess which solution best meets their needs. The world would generally be a better-informed place!
Well, we can’t capture everything about quality but, perhaps for the first time, LGTM now makes it possible to get a good handle on it. In this blog post I want to talk about how we do this. If you're looking for a practical, high-level description of LGTM's code quality features, please have a look at the help topic on scoring and grading.
Let’s start by taking a step back and consider what it is that we’re trying to measure and why. Or in other words...
What do we mean by "code quality"?
A simple definition of what constitutes the quality of source code is hard to come by. We might think that the point of writing code is to build software that will perform some desired task and so good quality code, from a product perspective, corresponds to software that performs as expected, reliably and efficiently. But the input-output behavior of a program is just one aspect of quality. For instance, a program that meets its requirements and is free of run-time bugs may nonetheless be a hacky mess, with high technical debt, and therefore costly to maintain and build upon. From an engineering perspective, this code is poor quality.
Another difficulty is that, whatever collection of measurements relating to code quality are taken, they need to be put in context. Otherwise we’re lacking the necessary perspective to form an opinion. Indeed if you look up “quality” in the Oxford English Dictionary, you’ll find the following definition:
quality (noun): The standard of something as measured against other things of a similar kind; the degree of excellence of something.
I like this definition since it makes it clear that a meaningful implementation of code quality should deal with the contextualizing, or standardizing, issue as well as the measuring one. Code quality on LGTM addresses both.
Of course, it would be foolish to pretend we have the one and only true answer to the question of how to assess the quality of source code (and I would strongly encourage you to be wary of anyone making such claim). But we believe our approach provides a tool that takes into account the many facets of code quality and summarizes them in an easily interpretable way. This will benefit anyone dealing with source code and wanting to:
Judge. Say Jack thinks a particular codebase is of high quality, since that's the word on the street, but he isn't sure, especially as he isn't super knowledgeable in the programming language and doesn't have the time to investigate further or ask experts. Jack really doesn't want to rely solely on his gut, so ideally he'd like a quick way of confirming or disconfirming his opinion based on large quantities of empirical data.
Compare. While working on her project Jill realizes she needs to rely on an external library to complete some task. She has identified two candidates. Both seem to do what she needs but they are implemented in radically different ways and Jill is unsure about their relative merits. As part of her decision-making process, she’d like to know more about the relative code quality of her two options.
Monitor. Jill is moving forward with her project. She’s continuously implementing new features and her codebase is becoming larger and more complex. Other programmers have even started to contribute. She’d like to stay on top of things and ensure that the code quality of the project stays on track as it grows.
Improve. Having assessed the quality of the codebase he is interested in, Jack would now like to know what to do to improve it. Similarly, should the monitoring of her project reveal a drop in quality, it is very important for Jill to know exactly what went wrong and where in the codebase. They both need a measure of source code quality that is easily actionable.
LGTM code quality is designed to answer just these kinds of questions.
LGTM alerts as quality probes
When analyzing a given codebase, LGTM executes hundreds of QL queries, where each query identifies a potential problem in the source code—from simple mistakes, such as uninitialized variables, to complex errors, like taint analysis via routes from data sources to sinks1. The magic of QL enables such in-depth investigations with high precision—in fact QL is so powerful it continually allows us to uncover new security vulnerabilities. The queries have been developed over many years by language experts and encode domain knowledge about code quality. Consequently, the LGTM alerts generated by these queries are indicative of a wide range of quality issues, including problems associated with technical debt and run-time bugs as shown previously by my colleague Albert here, here and there.
Intuitively, the more LGTM alerts are found in a codebase, the lower the quality of the source code. However, as mentioned above, we need a way to standardize the results before we can form an opinion, i.e. control for the size of the codebase and account for the fact that not all alerts are equal.
Building a community-based standard
The idea is simple: for each language, we build models that help us predict the distribution of a type of alert as a function of project size. In practice we do this in 2 steps: first we predict a set of quantiles values and then we use these to reconstruct the Cumulative Distribution Function (CDF).
Let’s go through an example to make it clearer. Say we’re interested in the diversity of warnings triggered by a Python project, i.e. the number of QL queries, classified as warnings, that generate alerts3. Let’s call this value Y. The raw data might look something like this (note the log scale):
This plot shows that larger projects tend to trigger a wider range of warnings. But it also shows there is quite a large amount of variation between similarly-sized projects, and the extent of this variation appears to significantly change with LOC. Approaches based on point estimates of E[Y|LOC] and an error term wouldn’t help us capture the complexity of such heteroscedastic data4. But we can by computing a set of rolling quantile values5, say the 5th, 25th, 50th, 75th and 95th percentiles, and to model each as a function of LOC.
The predictions of these models for a given LOC value can then be fed into a piecewise linear model with an exponential asymptote in order to obtain an estimate of the CDF of Y for this particular LOC.
As you can see, the idea is pretty straightforward. There are however a few difficulties to address when implementing it. In particular we must ensure that the modeled quantile curves do not cross, otherwise the predicted quantile values do not correspond to a valid distribution and the CDF is undefined. Furthermore we also constrain all quantile models to be monotonous with respect to LOC. Indeed, the data show that, unsurprisingly, the extent to which things can go wrong increases with project size. Enforcing monotonicity simply ensures this natural property of source code is built in our models and never contradicted for the sake of a marginally better local fit. Finally, we also impose a diverging condition on the gradients of the quantile curves at the edge of the training LOC support. This makes it easier to extrapolate the models and to provide sensible CDF estimates in high LOC regions for which data is very sparse.
By following this approach we effectively build quality standards for each type of alert. Given a project’s language and size, the models allow us to measure “how good” a value Y is by mapping it to a value SY = 1 - qY in the [0,1] interval (where qY is the quantile estimate of Y). For projects of impressively high quality w.r.t the type of alerts considered SY will then be close to 1. Conversely, SY values close to 0 indicate projects that are significantly under-performing.
Giving more weight to more serious issues
How can we put these standards together to derive an overall estimate of quality? Intuitively what we’re looking for is the average performance of a project across the different types of alerts. To acknowledge that some types of alerts impact the quality more than others, we can simply give them a stronger weight when calculating the average6. But this then begs the question: how to determine those weights?
This is a tricky one: different people tend to care about different things and there is no external “quality oracle” that could readily be used for calibration. Nevertheless I think it’s fair to say that some general truths must hold. For instance, alerts labeled as errors are indicative of worse quality issues than those marked as recommendations. We’ve therefore leveraged this knowledge to define some simple heuristics to calculate severity-based and precision-based weights from the data. These are used to attribute a weight to the different types of alerts—whereby each type is defined by aggregating queries with similar severity and precision. In consequence, more severe quality issues that LGTM is confident about have greater impact on the overall estimate of code quality.
LGTM code quality metrics
Now that we’ve discussed how to derive open source based standards for each language let’s turn to how to best convey back information about the quality of a particular project.
Code quality score
As we’ve seen, our models allow us to compute the average performance of a project w.r.t. to different types of alerts as a number between 0 and 1. We refer to this value as the code quality score of the project. You can think of it as a way to answer the question “How well is this project performing compared to what can be expected?”. The code quality score controls for project size and its interpretation is straightforward: values close to 0.5 indicate overall quality is on par with expectations, while values close to 1 (resp. 0) indicate a project performing significantly above (resp. below) standards. As such, it enables comparison between projects and also provides a consistent measure of the quality performance of a project as it evolves and grows.
Code quality grade
But we can also use our models and data to answer the slightly different question of “How much better could this project be?”. Such information may seem similar to that provided by the code quality score but it’s not. So let me give you an example to illustrate the difference.
Consider a very small “Hello World” program (Project SMALL): our quality models wouldn’t expect such a trivial project to contain any alerts and, assuming that it is indeed the case, it would get a code quality score of about 0.5. Now let’s consider a very large and complex program (Project BIG) where the alerts roughly correspond to what the quality models would expect for a project of that size. Similarly to Project SMALL, the score of Project BIG would also be about 0.5. The quality score of both projects is similar because they are in line with expectations. However their room for improvement is obviously quite different! There are several quality issues to address in Project BIG and so it could be improved significantly. In contrast, Project SMALL is already the best it can be. Of course this is a rather simplistic example, but it demonstrates how additional information about the “improvability” of a project can give us a more comprehensive picture of the quality of its codebase.
So we compute a code quality grade for each project. We first estimate the quantile that the code quality score of a project corresponds to by comparing it to reference projects of similar size, and we then map this value to a letter between A+ and E using the table below:
Let’s go back to our simple example. The grade of hypothetical Project SMALL would be A+ since it is not possible for similarly trivial projects to get a better score. On the other hand, Project BIG would get a grade of B due to the existence of similarly large codebases with fewer and/or less serious quality issues and thus better scores.
So although score and grade are related—the closer the score is to 1, the closer the grade is to A+—they measure two different things. The code quality score is essentially an absolute “wow factor” that conveys how impressive the quality of a codebase is. In contrast, the code quality grade is a relative judgement on performance, using a large open source reference, that indicates the extent to which the quality of a codebase could be better. Together, these two metrics provide a comprehensive handle on the quality of source code that can help answer the type of questions faced by developers, maintainers and users alike. And because both LGTM code quality grade and code quality score are tied to LGTM alerts, there’s a straightforward way to improve them. Just head to a project’s page to explore issues in the code that should be addressed. And you can ensure that your project's quality doesn’t drop as it continues to be developed by enabling automatic code review for pull requests.
LGTM code quality metrics are based on a rigorous, data-driven approach. This gives us confidence they’re a sound and meaningful measure of the quality of source code. But is there a way to validate our results? Well, in the absence of an external oracle for code quality there is no straightforward way to do so—and indeed, if such an oracle existed the whole endeavour described here would be pointless! However, some external project metrics can arguably be considered as weak proxies for code quality. And if we can show that these correlate with code quality as measured by LGTM it’ll give us even more confidence that our metrics are sensible.
GitHub stars indicate the popularity of a project. And it's reasonable to think that, on average, more popular projects will be of higher quality. We’d therefore expect some correlation between the code quality score of a project and its number of stars. The figure below shows this is the case as we find that the average project score does tend to increase with stars.
In case you’re wondering “Why not just use GitHub stars as a measure of quality then?” let me stress again the above is simply a sanity check, stars are not expected to be more than a weak proxy for quality. And the correlation only tells us something about the average quality of more popular projects. In contrast, LGTM code quality score and code quality grade provide you with the necessary information to answer quality-related questions about a particular codebase.
So what next? Judge and improve!
LGTM now makes it easy to judge, compare and monitor the code quality of different projects. The grade of each project is displayed under its name so you can get an idea of quality at a glance.
An interactive visualization allows you to investigate a project’s performance further, based on its score, and to conveniently explore how it relates to that of other projects. With this information, you can identify the best performers in your domain as well as which projects would benefit the most from focusing on quality.
But more importantly, LGTM doesn’t just judge, it also tells you how to get better! Alerts contain information on the issues to address so that anyone can contribute improvements. And project maintainers can enable automatic code review for their project's pull requests to prevent code quality from deteriorating. Badges are also available for easy display and monitoring of quality metrics.
So what are you waiting for? Now that we have a good handle on source code quality, let’s make code better!
Note: Post originally published on LGTM.com on August 10, 2018
The metadata of each QL query contains information in a categorical field indicating the likely severity of the issue identified by the query. The possible values are↩
This fancy term essentially means that the variance cannot be considered to be constant. And unfortunately it makes modeling the data a bit more complicated, head over to Wikipedia if you want to find out why.↩
To ensure the meaning of the windows statistics remains the same, the length of the window is proportional to the LOC value it is centered on.↩
These weights express how much importance to give to deviations from expectations for different types of alerts. They do not express some belief that “1 type A alert is worth N type B alerts”. Indeed not only are the numbers already normalized by the standards but this normalization is also LOC-dependent. It would thus be meaningless to interpret the weights as a general ratio between the number of alerts in different types.↩