Machine learning problems: How to interpret noisy classifier predictions

October 25, 2018

Category

Technical Difficulty

Reading time

Semmle is a company that's known for its philosophy of treating code as data so I was excited when I joined the data science team for an internship. My first day began with a short trip through the office, where I met my colleagues. Then the team presented several different data science project proposals from which I could pick one. Choosing exactly one was a hard process, as Semmle is full of interesting data science ideas—ranging from sophisticated machine learning to finding new insights about the open source community. I decided to work with Albert Ziegler on a problem arising from interpreting noisy predictions of classifiers.

Motivation

Most classical machine learning problems are easy. Not necessarily easy to solve, but easy to evaluate. Say you're training a classifier to distinguish cats from dogs. If you show it a cat picture, it had better say "cat", because "dog" is just wrong. If you show it a dog picture, "dog" is right and "cat" is wrong.

Semmle produces machine learning models which are at the cutting edge of software analytics. At the very edge, things are not so black and white, and you're often left to do your best with just the vaguest of signals. If your classifier is asked to judge a single blue eye, the best answer might be something along the lines of "30% dog, 70% cat". That's because blue eyes occur in both species, but more often in cats. The model has low confidence in its classification here, because eye color alone is such a weak indicator.

But such a probabilistic answer is rarely satisfying. Especially if the features you have access to are very coarse, it will often be depressingly close to "50:50". And if you consider a large population (for example, "How do <reddit.com>, <facebook.com> and <cuteoverload.com> differ in their cat-to-dog ratio?"), the single predictions will even out, so that the average for one site will be even closer to "50:50".

Nevertheless, you may discover that, for example, reddit has "51:49" much more often than "49:51" while cuteoverload has "49:51" much more often. You might conclude that the reason for that could be that the real ratios are actually further apart, but the classifier isn't confident enough to recognize that with certainty.

Albert and I wanted to make classifiers much more certain of their predictions by placing them in a subpopulation context. We tried to find off-the-shelf approaches for this problem, but we came up short. There's a related problem that is very well researched: combining different experts' opinions about the same picture (see 1, 2, 3). If 10 experts look at a picture, 5 say "cat!", 2 say "dog!" and 3 go "¯\_(ツ)_/¯", there are a host of different methods that tell us that it's probably a cat. But instead, we want to combine one expert's opinions about different pictures. Unbeknown to the expert, these pictures are not independent (because they're from the same website). But what their dependence is, is something we need to learn at the same time as performing the combination. For example, if all pictures from a website get the same prediction of "80% cat", then it's likely that the dependence is complete: all pictures are in fact cats, and 80% is just the maximum certainty the classifier can get from one single picture.

Formulation of the problem

Let's reformulate the problem using three random variables:

  • XX: refers to the hidden state (for example: X1X_1—it's a cat, X2X_2—it's a dog)
  • TT: refers to test results (for example: T1T_1—the eye is blue, T2T_2—the eye is brown)
  • SS: refers to different populations (for example: S1S_1—reddit.com, S2S_2—facebook.com, ...)

A classifier provides a confusion matrix P(XkTi)P(X_k\cap T_i) and what we can observe are the test results. We will count occurrences or results TiT_i and refer to them as C(TiSj)C(T_i|S_j). In the limit of a big number of tests, C(TiSj)C(T_i|S_j) should be proportional to P(TiSj)P(T_i|S_j).

In the example above, we consider dichotomous distributions (either dog or cat), but nothing prevents us from increasing the number of XX labels. A good example may be a machine learning model that takes an English sentence and returns its sentiment—a single label classifying it as negative, neutral or positive. Then XX is the true sentiment, TT the predicted sentiment and the model performance can be evaluated as the matrix P(XkTi)P(X_k\cap T_i). By analyzing different authors SS, we can discover the style of each author—that is, find the distributions P(XSj)P(X|S_j). This style can be used in many ways—it can be a rough indicator of the authorship of texts of unknown origin, or it can be used to improve the classifier predictions for a given author. Knowing their style, we can interpret the predicted sentiment for a new sentence in the context of their overall style.

Assuming that P(TiSjXk)=P(TiXk)P(SjXk)P(T_i \cap S_j | X_k)=P(T_i | X_k)\cdot P(S_j | X_k), we can come up with a relation:

P(TiSj)=kP(TiXk)P(XkSj).P(T_i|S_j) =\sum_k P(T_i|X_k) P(X_k|S_j).

Interpreting suitable terms as vectors and matrices, it can be written as:

P(TSj)=P(TX)P(XSj).P(T|S_j) = P(T|X) P(X|S_j).

There are a few subtleties that need to be resolved:

  • The equation may have a non-unique solution or no solution at all!
  • If the classifier is weak, the condition number is big.
  • We don't know the vector P(TSj)P(T|S_j), but only its rough approximation C(TSj)C(T|S_j).

Proposed solution

To solve these issues, we formulated the above as an optimization problem—we would like to find P(XSj)P(X|S_j) with two properties:

  • The C(TSj)C(T|S_j) we observe should agree with P(TSj)=P(TX)P(XSj)P(T|S_j)=P(T|X) P(X|S_j). A good measure to check whether a given observation agrees with the candidate probability distribution may be the negative log-likelihood.
  • An average P(XSj)P(X|S_j) should look like P(X)P(X). A good measure of this may be Kullback-Leibler divergence (KL divergence).

The whole optimization procedure can be summarized as follows:

  1. Start with any matrix P(XSj)P(X|S_j).
  2. Calculate P(TSj)P(T|S_j) as the product P(TX)P(XSj)P(T|X) P(X|S_j).
  3. Calculate the negative log-likelihood for the event that the observed counts C(TSj)C(T|S_j) were drawn from the candidate distribution P(TSj)P(T|S_j).
  4. Add regularization loss, that is, the KL divergence between P(X)P(X) and P(XSj)P(X|S_j). Note that the regularization weight may depend on the confusion matrix and overall condition number.
  5. Try to minimize the total loss—do a gradient descent step for P(XSj)P(X|S_j) and go back to 2.

In the end, the probability distribution P(XSj)P(X|S_j) is an estimate for the "style". It can be treated as a new feature for each population SjS_j or used to improve individual classifier predictions. Let's test how well this works in practice.

Validation

We created an artificial data set using a preferential attachment process to resemble an example confusion matrix for a weak classifier. The first component P(X0Sk)P(X_0|S_k) is shown in the plot:

Validation

We try to retrieve different distributions over two labels: 0 and 1. The X axis represents the true P(X0S)P(X_0|S), while the Y axis represents the prediction. Green points represent "naive" approximation P(X0S)=P(T0S)P(X_0|S)=P(T_0|S), which interprets the test result as the truth. Purple points represent the "corrected" prediction as described above. The black dashed line is added for reference—ideally each point should land on it. The other dashed lines are the regression lines for purple and green points. For a small number of test samples, both methods can be very inaccurate if a classifier is weak. For a larger number of predictions, there is some correlation between the test predictions and the ground truth (the green points form a line), but the signal is very weak. With enough samples we can apply the algorithm to infer the context and accurately reconstruct the ground truth.

Summary

The method developed may improve the performance of existing machine learning models for which the following statements are true:

  • The confusion matrix is known.
  • They give predictions about a known population (for example, different websites with cat/dog pictures or different authors with commit snippets).
  • The number of observed samples is big enough to retrieve the information about the P(XS)P(X|S) distribution.

I like the idea that the algorithm will become a part of an upcoming LGTM.com feature. I'm looking forward to seeing this algorithm running in the near future! Not only was the internship a great opportunity to work on a real data science problem, but also a very pleasing personal experience. During the whole internship, I have felt very welcome—I was invited to daily stand-ups, weekly data science discussions, and company dinners. It's worth mentioning that, although I always got help when I needed it, I was given lot of trust and freedom. I was able to plan my priorities and tasks, use any tools and methods I wanted, or temporarily move to minor projects. We celebrated the end of the internship with a delicious sushi dinner funded by Semmle.