How LGTM automatically builds your C/C++ projects

May 03, 2018

Category

Technical Difficulty

Reading time

Imagine a tool that can take the URL of any GitHub repository and figure out the commands and system dependencies that are needed to build it. Such a tool would be useful for analyzing, running, or packaging code from a repository. For LGTM.com, we need it to analyze current and historical revisions of thousands of open-source projects. This blog post describes how I made the autobuild system that LGTM.com uses to build C/C++ projects on Ubuntu Linux.

Guessing the build command is the easy part of the problem. A shell script looks at which files exist in the source directory and tries to invoke common build tools that are known to work with those files. In many cases, that works out as autoreconf -i && ./configure && make, although CMake seems to be growing fast as a replacement for GNU Autoconf. Many of the projects that do not use one of the popular build systems have a script named ./build or ./build.sh, which we execute if other build systems cannot be detected.

The more difficult part of the problem is installing the right packages on the system for the build to succeed. Unlike other languages, dependencies for C/C++ projects are rarely specified in any machine-readable format; instead, human-readable documentation may name certain packages that are needed on certain Linux distributions. This documentation is often out of date, or only lists a minimum number of packages, rather than the full set of packages that enables all the optional features of the project.

The approach I took to discovering dependencies is to run the build in an instrumented environment where packages are installed on the fly as the files they contain are needed.

https://xkcd.com/1367/

I've given the system that auto-installs dependencies the name deptrace. A central component of this is the deptrace server, which receives requests for file names on a Unix socket and attempts to install packages that provide those file names. It looks up each request in a database provided by Ubuntu, Contents-amd64.gz, that maps file names to installable package names (database available for download).

For example, the deptrace server might get a request to provide the file /usr/include/event2/event.h. It responds by attempting to install the package libevent-dev, which contains that file according to the database, and then tells its client to retry accessing that file.

The deptrace client is a shared library that LGTM injects into every process in the build using LD_PRELOAD. It intercepts calls to C library functions that access files, such as open, fopen, stat, and access. The intercepted version of each call first tries to perform the call as usual through the C library. If that fails with a "file not found" error, it connects to the deptrace server and asks it to provide the file. The client waits until the server responds, and then it retries the call.

The diagram below shows the programs involved as solid-edged boxes. The diagram also shows that certain intercepted calls -- compiler invocations -- trigger the invocation of our extractor, which is our compiler-like program that turns source code into a database.

extractor <---- Deptrace client <----> Deptrace server ----> apt-get

All these processes run inside a Docker container that's destroyed after the build has finished. The file system in the container ignores the setuid bit, which should mitigate most risks of privilege escalation that come with installing arbitrary Ubuntu packages.

It turns out that a typical build system makes a staggering number of requests for files that do not exist in any package. One reason is that when GCC processes an #include directive, it attempts to open the named file in every directory on the include path until it finds one that is successful. For a medium-size project like curl, with 176 source files, the deptrace client makes 502,926 requests for 3,913 unique non-existent absolute paths. This adds an overhead: the build takes 75% longer when it runs under deptrace.

The initial implementation of deptrace was very simple: it just intercepted open calls to header files made by the compiler. That was enough to install dependencies for most programs that use Autoconf and GCC. Unfortunately, not every program works like GCC and just attempts to open the files it wants. For example, an important build tool like pkg-config instead lists all the files in its search directories and does an in-memory check of whether a file with the name of that dependency exists. This check fails because deptrace does not override the opendir call to present a fake directory listing. If it did that, then a command like ls -l /usr/lib/x86_64-linux-gnu/pkgconfig would trigger the installation of every package supported by pkg-config, and that would take too much time and disk space. Instead, I wrote wrapper scripts around the popular build tools like pkg-config, CMake, and qmake. Each works in a different way, but they all try to guess which files are wanted by the build tool and try to open those files before the build tool does.

Even with those wrappers in place, the autobuild system was many small steps away from being able to work on arbitrary projects. Here are some of the other adjustments that I've made:

  • Some files are provided by multiple packages, so I updated the preprocessing heuristics to decide which package should be installed when such a file is requested.
  • I had to blacklist some packages that contain overly generic file names such as /usr/include/util.h. Otherwise when a program contained #include <util.h>, to include a file from its own repository named util.h, that file would instead be installed from Ubuntu and included.
  • Paths have to be normalized, and symlinks have to be resolved. It's hard to resolve symlinks because we can't tell from the package database which files are symlinks. Some symlinks are created only by post-install scripts, and I've had to hard-code those in the important cases.
  • When a process checks for the existence of some directory that's used by multiple packages, none of which are installed, the deptrace client will return fake data for that directory so it appears to exist as long as the process does not enumerate its contents.

In some cases I just ran out of tricks. For example, a package like docbook-xsl has post-install scripts that modify configuration files in /etc, and some Docbook tools will only work properly if those modifications are present. The solution there is just to pre-install packages like docbook-xsl in the Docker image used for building. A related problem exists when packages put Autoconf macro files in /usr/share/aclocal. These files are not explicitly imported anywhere but may contain macros that are needed for Autoconf to work on certain projects.

My goal is that all these tweaks will eventually add up to an autobuild system that can build any project that is compatible with the latest Ubuntu version and uses standard build commands. Projects with exotic custom build commands will have to configure those manually.

Discussion of alternatives

Why not use ptrace instead of LD_PRELOAD, like strace does?

Using ptrace would let us hook in at the Linux system call level, which should be a smaller and more stable interface than the glibc library call level. Unfortunately, Docker disables ptrace inside containers by default because it increases the attack surface for malicious programs trying to break out of the container. Another reason I chose LD_PRELOAD is that we already use it for launching our extractor as mentioned above. This allowed me to reuse a lot of existing code.

Why incur the overhead of a separate server process and Unix socket communication?

A library injected with LD_PRELOAD should do an absolute minimum amount of work. It could be called from an interrupt handler, which means it may only call a small subset of library functions that are guaranteed to be re-entrant. There are no functions for memory allocation in this subset.

Keeping the database and state in a separate process was the easiest way to make the LD_PRELOAD library very non-intrusive. It does not allocate memory, and it only keeps a file descriptor open for the socket for the duration of a single request to the server. I think there is room for further optimization, but I would ideally like the code to remain simple and instead introduce a cache to reuse a successful build environment across multiple builds.

Another benefit of Unix sockets is that the server process can have higher privileges than the client, which can provide an extra layer of defence against processes breaking out of the Docker container.

Why not make a Docker image with every development package installed?

The main problem is that this would be very costly in terms of time and space. This high cost would not only be an issue in production, but it would also slow down development because changes to such a Docker image would take a long time to build and test. It would also require us to solve all file conflicts statically. With the chosen approach of running apt-get on demand, we have the option to install two packages that provide intersecting sets of files, as long as the first file requested is outside that intersection.

Why not make a virtual file system where files are downloaded on demand?

This might be an elegant solution. It would still require a database of which files are provided by which packages, and this has the same problems with symlinks and post-install scripts as the current solution.

Another problem is that making such a virtual file system could be very difficult, especially when it's the root file system. There's probably a reason why Microsoft GVFS still doesn't support Linux even a year after they announced they were working on it.

Future work

As we add more projects to LGTM.com and see some of them failing, I will continue to improve deptrace. I'm considering whether I can build a better database of package contents by trying to install each and every package in Ubuntu as part of the pre-processing and checking which new files are provided, including symlinks. Performance might also become an issue, and that could be improved by caching a list of installed packages in a successful build for use in subsequent builds that can then run without deptrace.

I would welcome further ideas for improvements or comments about this article on https://discuss.lgtm.com.

Image credits

Cartoon: Randall Munroe

Note: Post originally published on LGTM.com on 05/03/2018

Join us in securing the software that runs the world!

Enter your email address below to stay up-to-date with Semmle news, security announcements and product updates.

Loading...