This is part two of the intro to variant analysis blog series. Part one can be found here.
Variant analysis is the process of taking a known problem, or seed vulnerability,
and finding other instances (or "variants") of that problem in a codebase.
In this post, I'll show how you can use a seed vulnerability
to write and refine a QL query,
to do just that.
The class of vulnerability we'll be trying to
find are potentially dangerous uses of snprintf,
something that has been the source of a number of CVEs in popular projects,
including rsyslog (CVE-2018-1000140)
and icecast (CVE-2018-18820).
Before we get into writing our query,
let's meet the technology that makes this all possible.
Introducing QL
The QL language is a high-level, object-oriented logic language, that underpins all of Semmle’s libraries and analyses. (You can learn lots more about QL by visiting Introduction to the QL language and About QL.) With QL, you can quickly perform variant analysis to find previously unknown security vulnerabilities.
Semmle QL ships with extensive libraries to perform control and data flow analysis, taint tracking, and explore known threat models without having to worry about low-level language concepts and compiler specifics. With QL, you can run out-of-the box or custom queries on multiple codebases to get accurate and relevant security analyses, allowing you to focus on the most critical issues.
QL treats code as data, allowing you to write custom queries to explore your code and identify even the most complex semantic patterns. Each QL query represents a piece of security knowledge — codified, readable, and executable — ready to be applied to any number of projects. You can write and execute QL queries locally using QL plugins for your favorite IDE. You can also use the LGTM query console to write QL directly in your web browser and query your entire portfolio for security vulnerabilities.
How Semmle QL works
QL works by creating (or "extracting") a queryable database of your source code, then allowing you to run queries to explore your code, or find variants of known issues. For compiled languages, Semmle’s tools observe an ordinary build of the source code. Each time a compiler is invoked, the compiler call is "intercepted," and the extractor is invoked with the same arguments. This allows the extractor to see precisely the same source code that is compiled to build the program. The extractor gathers all relevant information about the source code (the file name, a representation of the AST, type information, information on the operation of the preprocessor, etc.) and stores it in a relational database. For interpreted languages, which have no compilers to incercept, the extractor gathers similar information by running directly on the source code.
Once the extraction finishes, all of the relevant information about the project is contained in a single snapshot database, which is then ready to query, possibly on a different machine. A copy of the source files, made at the time the database was created, is also included in the snapshot so that analysis results can be displayed at the correct location in the code.
Queries are written in the QL language and usually depend on one or more of the standard QL libraries (and of course you can write your own custom libraries). They are compiled into an efficiently executable format by the QL compiler and then run on a snapshot database by the QL evaluator, either on a remote worker machine or locally on a developer’s machine.
Query results can be interpreted and presented in a variety of ways, including displaying them in an IDE plugin such as QL for Eclipse, or in a web dashboard as on LGTM.
The seed vulnerability
It's well known that sprintf is unsafe,
since it provides no protection against buffer overflow.
It's not unusual
to see documentation that points users to snprintf as a safer version,
since it truncates the output if the buffer is too small.
Howeverer, snprintf has an unintuitive interface:
it always returns the number of bytes it would have written
to the buffer if the buffer's size was unlimited.
A common error is for programmers to assume that snprintf always returns
the number of bytes written to the buffer.
In Icecast,
the open source streaming media server,
this assumption lead to a vulnerability that allowed attackers
to craft HTTP headers that overwrote the server's stack contents,
and allowed remote code execution.
We'll develop a query that finds these unsafe uses of snprintf.
Below is a slightly simplified version of the vulnerable code in Icecast
CVE-2018-18820.
This code is used in a loop to copy each of the HTTP headers (cur_header) from a
user request to a new buffer (post),
where it's constructing the body of a POST request
to send to another server.
post_offset is the variable that tracks where we need to continue writing from
for each iteration of the loop.
post_offset += snprintf(post + post_offset,
sizeof(post) - post_offset,
"%s",
cur_header);As the value of post_offset is not bounds-checked,
and given that snprintf returns the length of the data it would have written,
this would allow a user to
send one long HTTP header that will get truncated,
but whose length will allow us to position post_offset anywhere in the stack we choose.
Then we can send a second HTTP header whose contents will be written to that location.
This case will act as our “patient zero” in this variant analysis exercise. We will use this known seed vulnerability to write a simple QL query to catch other variants in another codebase. The query can be run in the query console on LGTM, or in your IDE.
The target codebase
Now that we have a seed vulnerability,
we need to choose a codebase to run our variant analysis investigation on.
It's quite common to start with the same codebase
that the seed vulnerability was discovered in,
but for the purposes of this blog post
we'll run our queries on rsyslog instead,
and in particular rsyslog/librelp.
We now know that rsyslog/librelp had a variant
of this vulnerability that was fixed in commit
2cfe657;
so it will be useful to run our queries on snapshots before and after the fix,
so that we can confirm that we correctly catch the variant
and account for the fix.
So we'll run our queries on:
- The latest version of
rsyslog - The latest version of
rsyslog/librelp - Version
5b81b1fofrsyslog/librelp(before the fix for CVE-2018-1000140) - Version
2cfe657ofrsyslog/librelp(after the fix for CVE-2018-1000140)
A simple query
We'll start out by writing a simple query to find all calls to snprintf.
A QL query consists of a select clause
that indicates what results should be returned.
Typically it also provides a from clause to declare some variables,
and a where clause to state conditions over those variables.
For more information on the structure of query files
(including links to useful topics in the
QL language handbook),
see Introduction to query files.
import cpp
from FunctionCall call
where call.getTarget().getName() = "snprintf"
select call, "potentially dangerous call to snprintf."The first line of the query imports
the C/C++ standard QL library,
which defines concepts like FunctionCall.
The variables declared after from represent the set of values in the database,
according to the type of each of the variables.
For example, call has the type FunctionCall,
which means it represents the set of all function calls in the program.
We use the where clause to specify the condition
that we are only interested in rows where the name of the call
function's target is equal (not assigned!) to snprintf.
The getTarget().getName() operation is available for any FunctionCall.
Finally, we select call, returning every FunctionCall
where the name of the target is snprintf
and display a message to explain what the problem is.
One way to interpret this is that our query is performing a filtering operation:
examine every FunctionCall and only keep those for which some logical condition holds.
You can see the results of our simple query run on our chosen four projects in the LGTM query console.
Iterative query refinement
QL makes it very easy to experiment with analysis ideas.
A common workflow is to start with a simple query
(like our query to find calls to snprintf),
examine a few results,
and refine the query based on any patterns that emerge, and repeat.
Our first query found 173 results.
Checking the seriousness of each of these results manually
would be a time-intensive and error-prone task.
Instead, we can refine our query based on the observation
that only calls to snprintf with %s in the format specifier are likely to be vulnerable.
This is because other format specifiers, like %d can only change
the length of the output string by a few characters,
but %s can change it a lot.
A %s specifier is also much more likely to allow an attacker
to overwrite the stack or heap with arbitrary code.
import cpp
from FunctionCall call
where call.getTarget().getName() = "snprintf"
and call.getArgument(2).getValue().regexpMatch("(?s).*%s.*")
select call, "potentially dangerous snprintf."This refined query only find calls to snprintf that contain %s in their format strings.
Each time we refine our query, we remove potential false positives.
Our revised query now only has 103 results.
We are making progress, but we can do even better.
Next we'll use taint-tracking (a form of
data-flow analysis) to look
for calls to snprintf whose return values flow back into their size arguments.
This should narrow down the results significantly.
import cpp
import semmle.code.cpp.dataflow.TaintTracking
from FunctionCall call
where call.getTarget().getName() = "snprintf"
and call.getArgument(2).getValue().regexpMatch("(?s).*%s.*")
and TaintTracking::localTaint(DataFlow::exprNode(call), DataFlow::exprNode(call.getArgument(1)))
select call, "potentially dangerous call to snprintf."TaintTracking::localTaint(source, sink) is true when
there is a data-flow path from the source node to the sink node.
In our query above, we are using (DataFlow::exprNode(call) as the source,
which returns the node in the data-flowgraph
corresponding to the call to snprintf.
For the sink, we are using the call's first argument,
which corresponds to the size parameter of snprintf.
If we explore
the results generated by this query,
we can see we're down to just one result in rsyslog,
and one result in the vulnerable version of librelp.
Manual review of
the rsyslog result
reveals that it's actually a false positive,
as rsyslog has implemented a guard check in the lines of code above our result:
if (offset + len + 1 >= option_str_len) {
break;
}
int bytes = snprintf((char*)option_str + offset,
(option_str_len - offset), "%s&", token);We can further refine our query to exclude cases where a check like this is already in place.
import cpp
import semmle.code.cpp.dataflow.TaintTracking
import semmle.code.cpp.controlflow.Guards
from FunctionCall call
where call.getTarget().getName() = "snprintf"
and call.getArgument(2).getValue().regexpMatch("(?s).*%s.*")
and TaintTracking::localTaint(DataFlow::exprNode(call), DataFlow::exprNode(call.getArgument(1)))
// Exclude cases where it seems there is a check in place
and not exists(GuardCondition guard, Expr operand |
// Whether or not call is called is controlled by this guard
guard.controls(call.getBasicBlock(), _) and
// operand is one of the values compared in the guard
guard.(ComparisonOperation).getAnOperand() = operand and
// the operand is derrived from the return value of the call to snprintf
TaintTracking::localTaint(DataFlow::exprNode(call), DataFlow::exprNode(operand))
)
select callThis refined query produces exactly one result,
the vulnerability
CVE-2018-1000140.
After this vulnerability was discovered by the Semmle team,
the lead developer of rsyslog fixed the bug,
removing dangerous calls to snprintf from their codebase.
You can read more about the discovery,
disclosure, and fix in the blog post about CVE-2018-1000140.
If you want to learn more about QL and how to begin writing your own queries, visit help.semmle.com, the community site, or the forum.



