Introduction to variant analysis with QL and LGTM (part 2)

May 16, 2019


Technical Difficulty

Reading time

This is part two of the intro to variant analysis blog series. Part one can be found here.

Variant analysis is the process of taking a known problem, or seed vulnerability, and finding other instances (or "variants") of that problem in a codebase. In this post, I'll show how you can use a seed vulnerability to write and refine a QL query, to do just that. The class of vulnerability we'll be trying to find are potentially dangerous uses of snprintf, something that has been the source of a number of CVEs in popular projects, including rsyslog (CVE-2018-1000140) and icecast (CVE-2018-18820). Before we get into writing our query, let's meet the technology that makes this all possible.

Introducing QL

The QL language is a high-level, object-oriented logic language, that underpins all of Semmle’s libraries and analyses. (You can learn lots more about QL by visiting Introduction to the QL language and About QL.) With QL, you can quickly perform variant analysis to find previously unknown security vulnerabilities.

Semmle QL ships with extensive libraries to perform control and data flow analysis, taint tracking, and explore known threat models without having to worry about low-level language concepts and compiler specifics. With QL, you can run out-of-the box or custom queries on multiple codebases to get accurate and relevant security analyses, allowing you to focus on the most critical issues.

QL treats code as data, allowing you to write custom queries to explore your code and identify even the most complex semantic patterns. Each QL query represents a piece of security knowledge — codified, readable, and executable — ready to be applied to any number of projects. You can write and execute QL queries locally using QL plugins for your favorite IDE. You can also use the LGTM query console to write QL directly in your web browser and query your entire portfolio for security vulnerabilities.

How Semmle QL works

analysis overview

QL works by creating (or "extracting") a queryable database of your source code, then allowing you to run queries to explore your code, or find variants of known issues. For compiled languages, Semmle’s tools observe an ordinary build of the source code. Each time a compiler is invoked, the compiler call is "intercepted," and the extractor is invoked with the same arguments. This allows the extractor to see precisely the same source code that is compiled to build the program. The extractor gathers all relevant information about the source code (the file name, a representation of the AST, type information, information on the operation of the preprocessor, etc.) and stores it in a relational database. For interpreted languages, which have no compilers to incercept, the extractor gathers similar information by running directly on the source code.

Once the extraction finishes, all of the relevant information about the project is contained in a single snapshot database, which is then ready to query, possibly on a different machine. A copy of the source files, made at the time the database was created, is also included in the snapshot so that analysis results can be displayed at the correct location in the code.

Queries are written in the QL language and usually depend on one or more of the standard QL libraries (and of course you can write your own custom libraries). They are compiled into an efficiently executable format by the QL compiler and then run on a snapshot database by the QL evaluator, either on a remote worker machine or locally on a developer’s machine.

Query results can be interpreted and presented in a variety of ways, including displaying them in an IDE plugin such as QL for Eclipse, or in a web dashboard as on LGTM.

The seed vulnerability

It's well known that sprintf is unsafe, since it provides no protection against buffer overflow. It's not unusual to see documentation that points users to snprintf as a safer version, since it truncates the output if the buffer is too small. Howeverer, snprintf has an unintuitive interface: it always returns the number of bytes it would have written to the buffer if the buffer's size was unlimited. A common error is for programmers to assume that snprintf always returns the number of bytes written to the buffer. In Icecast, the open source streaming media server, this assumption lead to a vulnerability that allowed attackers to craft HTTP headers that overwrote the server's stack contents, and allowed remote code execution. We'll develop a query that finds these unsafe uses of snprintf.

Below is a slightly simplified version of the vulnerable code in Icecast CVE-2018-18820. This code is used in a loop to copy each of the HTTP headers (cur_header) from a user request to a new buffer (post), where it's constructing the body of a POST request to send to another server. post_offset is the variable that tracks where we need to continue writing from for each iteration of the loop.

post_offset += snprintf(post + post_offset,
                        sizeof(post) - post_offset,

As the value of post_offset is not bounds-checked, and given that snprintf returns the length of the data it would have written, this would allow a user to send one long HTTP header that will get truncated, but whose length will allow us to position post_offset anywhere in the stack we choose. Then we can send a second HTTP header whose contents will be written to that location.

This case will act as our “patient zero” in this variant analysis exercise. We will use this known seed vulnerability to write a simple QL query to catch other variants in another codebase. The query can be run in the query console on LGTM, or in your IDE.

The target codebase

Now that we have a seed vulnerability, we need to choose a codebase to run our variant analysis investigation on. It's quite common to start with the same codebase that the seed vulnerability was discovered in, but for the purposes of this blog post we'll run our queries on rsyslog instead, and in particular rsyslog/librelp.

We now know that rsyslog/librelp had a variant of this vulnerability that was fixed in commit 2cfe657; so it will be useful to run our queries on snapshots before and after the fix, so that we can confirm that we correctly catch the variant and account for the fix.

So we'll run our queries on:

  • The latest version of rsyslog
  • The latest version of rsyslog/librelp
  • Version 5b81b1f of rsyslog/librelp (before the fix for CVE-2018-1000140)
  • Version 2cfe657 of rsyslog/librelp (after the fix for CVE-2018-1000140)

A simple query

We'll start out by writing a simple query to find all calls to snprintf. A QL query consists of a select clause that indicates what results should be returned. Typically it also provides a from clause to declare some variables, and a where clause to state conditions over those variables. For more information on the structure of query files (including links to useful topics in the QL language handbook), see Introduction to query files.

import cpp

from FunctionCall call
where call.getTarget().getName() = "snprintf"
select call, "potentially dangerous call to snprintf."

The first line of the query imports the C/C++ standard QL library, which defines concepts like FunctionCall. The variables declared after from represent the set of values in the database, according to the type of each of the variables. For example, call has the type FunctionCall, which means it represents the set of all function calls in the program.

We use the where clause to specify the condition that we are only interested in rows where the name of the call function's target is equal (not assigned!) to snprintf. The getTarget().getName() operation is available for any FunctionCall.

Finally, we select call, returning every FunctionCall where the name of the target is snprintf and display a message to explain what the problem is. One way to interpret this is that our query is performing a filtering operation: examine every FunctionCall and only keep those for which some logical condition holds.

You can see the results of our simple query run on our chosen four projects in the LGTM query console.

Iterative query refinement

QL makes it very easy to experiment with analysis ideas. A common workflow is to start with a simple query (like our query to find calls to snprintf), examine a few results, and refine the query based on any patterns that emerge, and repeat.

Our first query found 173 results. Checking the seriousness of each of these results manually would be a time-intensive and error-prone task. Instead, we can refine our query based on the observation that only calls to snprintf with %s in the format specifier are likely to be vulnerable. This is because other format specifiers, like %d can only change the length of the output string by a few characters, but %s can change it a lot. A %s specifier is also much more likely to allow an attacker to overwrite the stack or heap with arbitrary code.

import cpp

from FunctionCall call
where call.getTarget().getName() = "snprintf"
  and call.getArgument(2).getValue().regexpMatch("(?s).*%s.*")
select call, "potentially dangerous snprintf."

This refined query only find calls to snprintf that contain %s in their format strings. Each time we refine our query, we remove potential false positives. Our revised query now only has 103 results. We are making progress, but we can do even better.

Next we'll use taint-tracking (a form of data-flow analysis) to look for calls to snprintf whose return values flow back into their size arguments. This should narrow down the results significantly.

import cpp
import semmle.code.cpp.dataflow.TaintTracking

from FunctionCall call
where call.getTarget().getName() = "snprintf"
  and call.getArgument(2).getValue().regexpMatch("(?s).*%s.*")
  and TaintTracking::localTaint(DataFlow::exprNode(call), DataFlow::exprNode(call.getArgument(1)))
select call, "potentially dangerous call to snprintf."

TaintTracking::localTaint(source, sink) is true when there is a data-flow path from the source node to the sink node. In our query above, we are using (DataFlow::exprNode(call) as the source, which returns the node in the data-flowgraph corresponding to the call to snprintf. For the sink, we are using the call's first argument, which corresponds to the size parameter of snprintf.

If we explore the results generated by this query, we can see we're down to just one result in rsyslog, and one result in the vulnerable version of librelp. Manual review of the rsyslog result reveals that it's actually a false positive, as rsyslog has implemented a guard check in the lines of code above our result:

if (offset + len + 1 >= option_str_len) {
int bytes = snprintf((char*)option_str + offset,
                (option_str_len - offset), "%s&", token);

We can further refine our query to exclude cases where a check like this is already in place.

import cpp
import semmle.code.cpp.dataflow.TaintTracking
import semmle.code.cpp.controlflow.Guards

from FunctionCall call
where call.getTarget().getName() = "snprintf"
  and call.getArgument(2).getValue().regexpMatch("(?s).*%s.*")
  and TaintTracking::localTaint(DataFlow::exprNode(call), DataFlow::exprNode(call.getArgument(1)))
  // Exclude cases where it seems there is a check in place
  and not exists(GuardCondition guard, Expr operand |
      // Whether or not call is called is controlled by this guard 
      guard.controls(call.getBasicBlock(), _) and
      // operand is one of the values compared in the guard
      guard.(ComparisonOperation).getAnOperand() = operand and
      // the operand is derrived from the return value of the call to snprintf 
      TaintTracking::localTaint(DataFlow::exprNode(call), DataFlow::exprNode(operand))
select call

This refined query produces exactly one result, the vulnerability CVE-2018-1000140. After this vulnerability was discovered by the Semmle team, the lead developer of rsyslog fixed the bug, removing dangerous calls to snprintf from their codebase. You can read more about the discovery, disclosure, and fix in the blog post about CVE-2018-1000140.

If you want to learn more about QL and how to begin writing your own queries, visit, the community site, or the forum.