This post is the eleventh in the series, "12 Days of HaXmas."

by Suchin Gururangan, Bob Rudis and the Rapid7 Data Science Team

Anomaly detection (i.e. identifying “badness”) and remediation is a hard and expensive process, fraught with false alarms and rabbit holes. The security community is keenly interested in developing and using data-driven tools to filter out noise and automatically detect malicious activity in large networks. While machine-learning offers more flexibility than static, rule-based techniques it is not a silver bullet. In this post, we will cover obstacles in applying machine learning to security and some ways to avoid them.

It's All About the Data

One core concept in machine learning is that the utility of the algorithms being used are only as strong as the datasets being used. What does this mean when applying machine learning techniques to cybersecurity?

This is a bit of an oversimplification, but we generally do one of two things with machine learning:

  1. Put a bunch of things together into unlabeled groups (unsupervised learning)
  2. Identify new things as being part of already known/labeled groups (classification)

Both actions are based on the features associated with each data element.

In security, we really want to be able to identify (or classify) a “thing” as good or bad. To do that, the first thing we need is labeled data.

At its core, the this classification process is two-fold: first, we train a model on known data and then test it on unknown samples. In particular, adaptable models require a continuous flow of labeled data to train with. Unfortunately, the creation of such labeled data is the most expensive and time-consuming part of the data science process. The data we have is usually messy, incomplete, and inconsistent. While there are many tools to experiment with different algorithms and their parameters, there are few tools to help one develop clean, comprehensive datasets . Often times this means asking practitioners with deep domain expertise to help label existing data elements, which is a very expensive process. You can also try to purchase “good” data but this can be hard to come by in the context of security (and may go stale very quickly). You can also try to use a combination of unsupervised and supervised learning called—unsurprisingly—semi-supervised learning [].

_ The creation of labeled data is the most expensive and time-consuming part of the data science process.

Regardless of your approach, it's likely you'll spent a great deal of time, effort and or money in your quest for labeled data.

The Need for Unbiased Data

Bias in training data can hamper the effectiveness of a model to discern between output classes . In the security context, data bias can be interpreted in two ways. First, attack methodologies are becoming more dynamic than ever before. If a predictive model is trained on known patterns and vulnerabilities (i.e. using features from malware that is file-system resident), it may not necessarily detect an unprecedented attack that does not conform to to those trends (i.e. misses features from malware that is only memory resident).

Bias can sneak up on you, as well. You may think you can use the Alexa listings to, say, obtain a list of benign domains, but that assumption may turn out to be a bad idea since there is no guarantee that those sites are clean. Getting good ground truth in security is hard.

Data bias also comes in the form of class representation. To understand class representation bias, one can look to a core foundation of statistics: Bayes' theorem.

Bayes theorem describes the probability of event A given event B:

Expanding the probability P(B) for the set of two mutually exclusive outcomes, we arrive at the following equation:

Combining the above equations, we arrive at the following alternative statement of Bayes' theorem:

What does this have to do with security? Let's apply this theorem to a concrete problem to show the emergent issues of training predictive models on biased data.

Suppose company X has 1,000 employees, and a security vendor has deployed an intrusion detection system (IDS) alerting the company X when it detects a malicious URL sent to an employee's inbox. Suppose there are 10 malicious URLs sent to employees of company X per day. Finally, suppose the IDS analyzes 10,000 incoming URLs to company X per day.

We'll use:

  • I to denote an incident (i.e. an incoming malicious URL)
  • ¬I denote a non-incident (i.e. an incoming benign URL)
  • A to denote an alarm (i.e. the IDS classifies incoming URL as malicious), and
  • ¬A to denote a non-alarm (the IDS classifies URL as benign).

That means:

What's the probability that an alarm is associated with a real incident? Or, how much can we trust the IDS under these conditions?

Using Bayes' Theorem from above, we know:

We don't have to use the shorthand version, though:

Now let's calculate the probability of an incident occurring (and not-occurring)—P(incident) and P(non-incident)—given the parameters of the IDS problem we defined above:

These probabilities emphasize the bias present in the distribution of analyzed URLs. The IDS has little sense of what makes up an incident, as it is trained on very few examples of it. Plugging the probabilities into the equation above, we find that:

To have reasonable confidence in an IDS under these biased conditions, we must have not only unrealistically high hit rate, but also unrealistically low false positive rate. That is, for an IDS to be 80 percent accurate, even with a best case scenario of a 100 percent hit rate, the IDS' false alarm rate must be 4 x 10−4. In other words, only 4 out of 10,000 alarms can be false positives to achieve this accuracy.

Visualizing Accuracy

One way to actually “see” this is with a chart designed to visually depict the accuracy of our classifier (called a receiver operating characteristic—or, ROC—curve):

From "Proper Use of ROC Curves in Intrusion/Anomaly Detection"

As we train, test and use a model, we want the ratio of true positives to false positives to be better than chance and also accurate enough to make it worthwhile using (in whatever context that happens to be).

In the real world, detection hit rates are much lower and false alarm rates are much higher. Thus, class representation bias in the security context can make machine learning algorithms inaccurate and untrustworthy. When models are trained on only a few examples of one class but many examples of another, the bar for reasonable accuracy is extremely high, and in some cases unachievable . Predictive algorithms run the risk of being "the boy who cried wolf" – annoying and prone to desensitizing security professionals to incident alerts[2] . That last thing you want to do is create a fancy new system that only exacerbates the problem that was identified at the core of the Target/Home Depot breaches.

“ When models are trained on only a few examples of one class but many examples of another, the bar for reasonable accuracy is extremely high, and in some cases unachievable

Avoiding the Pitfalls

Security data scientists can avoid these obstacles with a few measures:

  1. Train models with large and balanced data that are representative of all output classes. Take balanced subsamples of your data if necessary and use available techniques to get an understanding of the efficacy of your data sets.
  2. Focus on getting a plethora of labeled data. Amazon's Mechanical Turk is a useful tool for this and is used by many researches outside of security is one example. Look at open sourced data, and encourage data gathering expeditions.
  3. Encourage security expertise on the team. Domain expertise is crucial to the performance of machine learning algorithms applied in the security space. To keep up with the changing threat landscape, one must have security experience.
  4. Incorporate unsupervised methods into the solution of the data science problem. Focus on organization, presentation, visualization, filtering of data - not just prediction.  Check out this handy tutorial on self-taught learning by Stanford.
  5. Weigh the tradeoff of accuracy (i.e. getting all the “guesses” right) vs. coverage. You can think of this in terms of a Bloom filter. In the case of search, it's more important that all the matching elements are returned even if that means some incorrect elements are returned. Depending on the application of your classification algorithm, you may be able to make similar tradeoffs.

Machine learning has the potential to revolutionize how we detect and respond to malicious activity in our networks. It can weed out signal from noise to help incident responders focus on what's truly important and help administrators discover patterns in network activity never seen before. However, when delving into applying these algorithms to security we must be aware of caveats of the approach, so we may overcome them.