Code as Data: Advanced Techniques for Code Analysis

June 07, 2019


Technical Difficulty

Reading time

Twenty years ago I ventured outside the dreaming spires of Oxford, and joined a project at Microsoft for an eight-month sabbatical. The project had a substantial codebase of over two million lines of C++, and for the first six weeks or so I was lost in that labyrinth. How can you make sense of so much complexity? Of course there was nothing particular about my situation at the time: software systems are large and complex, and it can be tremendously hard to find things if you're not intimately familiar with the codebase already. The obvious solution is to treat code as data, and write queries to find what you want in the code.

In his famous article "An Axiomatic Basis for Computer Programming," Tony Hoare wrote:

Computer programming is an exact science in that all properties of the program and all consequences of executing it in any given environment can, in principle, be found out from the text of the program itself by purely deductive reasoning.

Tony was right: all the information you might want is in the source code—it just needs a little digging to get it out. As an example, let's say you want to find potential cross-site scripting attacks. Untrusted data comes in as an HTTP query parameter, and is used to display a page of results, without properly sanitizing the request. To create the analysis, you'd need to specify what sanitizing means, what the dataflow is, and where the untrusted data is used. It's possible, but it wouldn't be easy.

Of course there are many program analysis tools that solve particular instances of this problem, and they do it very well. However, the number of questions we want to ask of our code is unbounded. Ideally, we'd like the full flexibility of asking questions about code the way we interrogate any other database.

The idea of storing source code in a database and then analyzing it by writing queries is not new. Mark Linton pioneered the idea in the early '80s, and like all good ideas, it has resurfaced many times in academic research over the years. Unfortunately, the general conclusion was that it doesn't work in full generality. Some good results were obtained for specific applications that required only limited data, but storing all source code data (without prior knowledge of the type of queries) failed. Even today, the reaction of distinguished computer scientists, upon hearing what Semmle (the company who created LGTM) does is:

It has been tried and it does not work.

Why did people believe it couldn't work?

The first reason is efficiency. The query I outlined above, for finding cross-site scripting vulnerabilities, requires complex recursion to compute the dataflow dependencies. Traditional databases are not good at executing recursive queries.

The second reason is complexity: representing a Java program accurately in a relational database takes something like 77 tables: one for methods, one for expressions, classes and so forth. That is a tough database to query, and you'd want many views to simplify the queries. But that number of views would kill performance. So there you have it: efficiency and complexity, and their interplay, prohibited the vision of code as data from becoming reality.

How did we overcome these difficulties?

Semmle overcame these difficulties by applying cutting-edge programming language research to relational databases. We created an elegant query language named QL, using familiar object-oriented abstraction mechanisms, and designed an optimizer so that you can use those abstractions without paying a performance penalty. Fortunately, at the time, we didn't know the opinion of Jeff Ullman, one of the founding fathers of database research, who declared in 1991:

It is not possible for a query language to be seriously logical and seriously object-oriented at the same time.

Often it's better not to know that something is impossible before doing it!

Here are a few of the key ideas that make QL tick:

  • QL classes are logical properties, and inheritance is logical implication. So, for instance, you can define "pure" expressions (that definitely do not have side-effects) as a subclass of ordinary expressions.
  • Method dispatch is governed by that logical reading of classes, always choosing a most-specific implementation. Logically, that's just a big disjunction where each disjunct has a type test in it.
  • Type-specialization is an optimization that knocks out all the unnecessary disjuncts in a particular calling context—without that optimization, method dispatch in QL would be prohibitively expensive. This is a great example of an idea we found in programming language research, and applied in the database setting.

It took many years of R&D (back among the dreaming spires!) to make all this work. We worked on this technology with the help of a small number of strategic customers who helped us refine it and harden it for unsupervised use by tens of thousands of developers. LGTM is the culmination of that journey. Here, for the first time, truly code is data.

Examples of the kinds of things you can do with our query technology are:

  • Find a remote code execution vulnerability in Swagger caused by unsafe use of SnakeYaml. More details here
  • Find JavaBeans that were declared "stateless" but are not.
  • Check the correct use of an API.
  • Save a space mission.

The possibilities are endless. Enjoy!

Note: This post was originally published on on August 20, 2017