Machine Learning in Cybersecurity – Demystifying Buzzwords & Getting to the Truth

PUBLISHED ON December 27, 2018
LAST UPDATED August 2, 2021

Earlier this month, I had the opportunity to discuss the role of machine learning in security with Dave Shackleford from SANS. It was a fun discussion, and if you have the time, I encourage you to check it out here.

One of the recurring themes throughout our discussion was the need to separate the marketing hype from the reality when it comes to the capabilities (and use) of data science, particularly in security. I think this is a really important topic and one I’d like to dig into more here. New analytical and detection models are absolutely changing the world of security. We are transitioning from a time of static signatures to more complex multi-dimensional detection models that can understand the behavior of an attacker. To say this is important is an understatement.

But on the other hand, I think many security vendors have gotten a bit drunk on AI buzzwords, and worse still, are treating their algorithms like magical black boxes. If you can’t see how the detection system works and vendors are playing fast and loose with terminology, how can you have confidence in your security? So with that in mind, I’d like to offer a take that tries to demystify some of the terminology and focus on the practical side of what matters for your security when it comes to data science and machine learning.

AI Disambiguation

Artificial Intelligence has become a sort of strange term in society. You see it referred to in all sorts of marketing, including security marketing. But if you ask someone to explain what that AI means, you typically get vague answers. And that is because there is a bit of a mismatch between the cultural and technical uses of the term Artificial Intelligence. Culturally when we say AI, we often think of SkyNet or Ex Machina, depending on which decade you get your science fiction. This sort of AI is referred to as “General AI” and refers to the ability for a machine to solve virtually any problem that a human could. This form of AI also doesn’t exist today.

The AI that we have today is what is called “Narrow AI” and it refers to teaching a machine to tackle a specific problem – playing chess, recognizing a voice, or detecting an application attack. This notion of AI has been around since the 1960s and is used to describe a host of analytical techniques including machine learning.

And this is the disconnect. If you ask most people about AI and machine learning, they often think that AI is more sophisticated than machine learning. But in terms of the narrow AI that we have today, AI is actually the more generic term. And this is why I hate seeing AI being overused in marketing. It sounds cool, but it almost never means something concrete.

It’s All About Work Reduction

The reality is that data science and machine learning should be extremely tangible to your organization. Most enterprises are generating way more data than their analysts could ever analyze. By 2020, it’s estimated that for every person on earth, 1.7MB of data will be created every second. That’s staggering. This trend is particularly true in security with threat feeds, intelligence, endless reputation lists, IOCs, signature, and more being constantly updated. Additionally, threats are constantly adapting to avoid detection – moving IP addresses, repacking their payloads, obfuscating their attack code. The combined result is that security teams have more data than they can manually analyze and the adversaries are evolving too fast to keep up with. This is a real-world, practical case for new analytical models.

You’ll notice that I didn’t jump to just saying this is a use case for machine learning. Machine learning is a great tool, and it’s one we use. But it doesn’t have to be the only tool. Sometimes good old-fashioned statistical analysis can be very effective. K-means clustering is a great example. To me, k-means doesn’t really qualify as machine learning, but it can be incredibly useful at identifying groups within a large data set. That information can then be used to inform a risk engine of what group a session matches in our case that might be normal, abnormal or malicious. The point really isn’t to brag about k-means per se, but rather to remind ourselves not to get overly attached to certain terms. In the same way that the term AI can be oversold, sometimes “statistics” is undersold. If it solves a problem and helps us get answers out of our data and make better decisions, then we want to use it.

And this is ultimately where any use of data science in security reaches its moment of truth. Is the technology able to make trustworthy decisions that reduce the workload on security staff? This need is particularly true at the WAF, where for years, organizations have relied on human effort to tune signatures and rules, only to still deal with false positives and false negatives.

We are solving this problem at ThreatX. We aren’t doing it with a magical black box algorithm. We use a variety of techniques that are right for their respective job that work together for a final answer. We use machine learning to train algorithms based on massive amounts of attack data. We profile applications to understand their normal behavior and find deviations. We apply statistical analysis, active deception, threat intelligence, and other techniques as well. We use all of these components with a simple goal in mind…Building a WAF that does its job and actually defends your organization. A WAF that reduces work for your security team so you can actually defend all of your applications and APIs. And that is something that I think is both exciting very practical.

About the Author