How to Detect Botnet Traffic with Behavioral Analysis

PUBLISHED ON June 6, 2018
LAST UPDATED April 4, 2022

The following is the first post in a three-part series surrounding bot detection and neutralization based on botnet analysis. The series will begin by addressing commodity form/comment spam. 

One of the unfortunate realities of running a site on the Internet is the amount of “background noise” — the automated, unsophisticated, poorly targeted attacks, which make up the bulk of malicious web traffic. For the sake of this series, we’re calling this ‘botnet’ traffic. 

Botnet traffic can be a nuisance, no doubt, but it isn’t necessarily interesting or deserving of action until viewed in aggregate. The posts in this series describe methods for identifying and blocking botnet traffic, and aggregating this data, each through a different case:
1. Identifying Bot Behavior – Form Spam 
2. Bot Behavior – Distributed Attacks
3. Behavioral Analysis – Grouping Bot Actors

Detecting Bot Behavior – Form Spam 

The Behavior

We’ve written before about active interrogation as a method for distinguishing bots from human users, but sometimes telling the two apart can require much lower precision and zero active measures.

A great example of easily identified bot behavior is form spam or comment spam, where the botnet’s activities are pretty straightforward: identify a webform and `POST` data to it in hopes that the content will end up displayed somewhere on the unwitting website. This is very easy to automate with numerous tools — a simple piece of software and a botnet or a list of open proxies and you’re in business.
So what does a mass form spammer post? Links! Links inline, links in a designated ‘website’ field (how has this field ever been a good idea?), links instead of names, link after link from bot after bot. Links all posted with the intention of raising the profile of the spammer’s website via SEO or getting in front of a human to click.

Tracking Spammers 

Form spam looks very different from legitimate use of the forms — where a legitimate user will post once or twice, at a rate consistent with hand-typing messages, consistent with, you know, having other things to do, sleeping, etc. A spammer or commodity bot, on the other hand, just doesn’t look right. POSTs are sustained, even at a low rate, the content doesn’t match the site it’s posted to, and they overwhelmingly include links.

Some form spam examples:

comment=
331>Air+Max+90+Gialle +Make+existence+skills+an+element+of+your+home+schooling+encounter.
+Training+a+young+child+to+harmony+a+checkbook,+prepare+a+dish+or+shingle+a+roof+top+
has+many+worth.+Moreover,+various+subject+areas,+which+include+math+concepts,+looking+
at+and+technology+may+be+incorporated+into+these+ability+lessons.+It+is+a+smart+way+for
+a+child+to+have+genuine-world+encounter,+achieve+a+valuable+expertise+and+require+a+
hands-on+strategy+to+their+discovering+path. +http://www.some-more-spam[.]com/balenciaga
-envelope-clutch-with-strap-review-545.html +Engage+in+golf+using+a+buddy+rather+than+single+
if+you+really+want+to+boost+your+video+game.+Not+only+will+you+be+capable+of+talk+about+tips+
and+phrases+of+assistance+having+a+close+friend,+and+viceversa,+but+there’s+yet+another+tiny+
rivalry+there+that+will+draw+out+the+most+effective+within+you.
[.]it/nike-free-5.0-uomo-nere>Nike+Free+5.0+Uomo+Nere +http://wow-i-guess-youre-going-
for-3-spam-sites[.]com/635-converse-black-high-tops-mens.html
 
author=EugeneLieft&email=inbox458@a-mail-host[.]&comment=it’s+my+first+time+visiting+
your+site+and+I+am+very+fascinated.+Thanks+for+sharing+and+keep+up+;)+ [url=http://www.
spam-site[.]nl/2016/03/12/essay-writing-online-service-custom-paper-writing/]http://www.
spam-site[.]nl/2016/03/12/essay-writing-online-service-custom-paper-writing/[/url]
&recaptcha_challenge_field=it’s+my+first+time+visiting+your+site+and+I+am+very+
fascinated.+Thanks+for+sharing+and+keep+up+;)+ [url=http://www.spam-site[.]nl/2016/03/12/
essay-writing-online-service-custom-paper-writing/]http://www.spam-site[.].nl/2016/03/12/
essay-writing-online-service-custom-paper-writing/[/url] &recaptcha_response_field=
manual_challenge&submit=Post+Comment&comment_post_ID=841&comment_parent=0
 
ohid=709498&chkshowshipaddr=1&savestep=g1&nextstep=&offer_2=offer_2_182_US_IGZN&ship_
firstname=daytona&ship_lastname=180&ship_address=180&ship_address2=180&ship_city=New+
York&ship_country=US&ship_state=68&ship_zip=180&countryinput=180&giftemail=1&
giftemailaddress=11849@another-mail-host[.]com&giftdate=60165@another-mail-host[.]&
gifttext=[b]daytona[/b][b][url=http://
spam-site[.]com/cgi_bin/]daytona[/url][/b][b][url=http://spam-site[.]com/cgi_bin/]
rolex+oyster+perpetual+date+price+list[/url][/b]

spam-site[.]com/cgi_bin/”>daytona

com/cgi_bin/”>daytona

cgi_bin/”>rolex+oyster+perpetual+date+price+list
&giftfromname
=41053@a-mail-host[.]com&submit=Next+Recipient

These posts are easily recognized as spam to the human reader and many popular blogging platforms can detect them and send them to a spam folder if configured to do so, but this typically relies on either IP reputation (error prone), analysis of each and every post (inefficient), or a phrase denylist (inefficient AND error prone).

Protecting Your Application 

For many applications, an IP reputation list or comment denylist is difficult to implement or not available, enter behavioral analysis. The ThreatX WAAP combines analysis of individual form `POSTs` with long-lived reputation for a given actor and analysis of all your application’s traffic to quickly identify the bot behavior and block the spammer.
We automatically deploy signatures to detect common spam behavior, like including multiple embedded links (no matter what field they’re in), posting at a sustained rate over time, and including content that doesn’t match the corresponding page content (character sets, keyword density, and nonsense phrases). When an actor looks like spam over time, it is automatically blocked.

Bot Behavior – Distributed Attacks

Distributed Attack Behavior

Botnets enable an attacker to avoid basic volumetric detection of their malicious traffic by spreading the same overall number of malicious requests across multiple bots under their control. To a web server with no capability to aggregate or group these bots (because each reports a unique source IP address), a distributed attack can look very similar to legitimate user traffic.

As a security analyst or site administrator, this simply means you can’t ask “Why is this IP address sending us 1000 times the requests of a regular user?” Or, better yet, you can’t ask: “This source IP address has failed login 100 times in the past few minutes, can we block it?”, leaving your web application open to several classes of distributed attacks illustrated below:

Case #1: Distributed Password Guessing/Brute Force Attack

The following example traffic shows a number of HTTP requests to a WordPress login page. Note that each request comes from a different IP address but sends the same log=Administrator username field.

10.73.140.3 POST https://wordpress.threatxlabs.local/wp-login.php log=Administrator&pwd=Password1
10.73.140.4 POST https://wordpress.threatxlabs.local/wp-login.php log=Administrator&pwd=Abcdefg
10.73.140.5 POST https://wordpress.threatxlabs.local/wp-login.php log=Administrator&pwd=1qwerty
10.74.220.1 POST https://wordpress.threatxlabs.local/wp-login.php log=Administrator&pwd=@admin@
10.74.220.2 POST https://wordpress.threatxlabs.local/wp-login.php log=Administrator&pwd=Administrator99
Common components of a distributed password guessing or brute force attack like this include:
1. Requests don’t show clear signs of malicious traffic without viewing in aggregate (no  tags or   ../dir/ect/ory/../tra/ver/sal , etc.)
2. No single source IP address is sending high amounts of traffic.
Even in a relatively desirable distributed attack scenario, where each bot sends 2-3x more requests than a “normal” user, you can quickly find that “normal” request volumes are difficult to make blocking decisions around.
3. We could always lock out the Administrator account after too many failed logins to mitigate the attack, but what operational impact might that have?

Case #2: Distributed Parameter Fuzzing (local file inclusion)

Our second example demonstrates several bot actors attempting to use a debugging script to access sensitive system data using a potential local file inclusion vulnerability.

10.120.10.6 GET https://www.threatxlabs.local/path/to/debug/script.php?path=../../../../../../../etc/passwd
10.120.10.6 GET https://www.threatxlabs.local/path/to/debug/script.php?path=../../../../../../etc/passwd 
10.115.70.8 GET https://securecheckout.threatxlabs.local/path/to/debug/script.php?path=../../../../../../etc/passwd
10.115.70.8 GET https://securecheckout.threatxlabs.local/path/to/debug/script.php?path=’/system/.restricted’
10.115.70.4 GET https://securecheckout.threatxlabs.local/path/to/debug/script.php?path=/system/.restricted
10.115.70.4 GET https://securecheckout.threatxlabs.local/path/to/debug/script.php?path=../../../etc/passwd
1. This attack is typically easier to identify than one made against /wp-login.php on a WordPress site or a similar login page generally intended to be publicly available; a debugging script is probably not intended to be accessed externally, that is if it’s intended to be deployed in production at all.
2. The individual requests do look malicious in and of themselves (../etc/passwd).
3. We still don’t have a great way to stop the attack from new bots yet, short of blocking all requests to the debugging script, and certainly not in an automated way.

Gaining Visibility Into Malicious Traffic

Blocking individual bots or source IP addresses is a losing battle against a distributed attack. If there are any particularly unique characteristics about the attack traffic, it may make sense to block requests matching those (Maybe a User-Agent HTTP header, or the Request URL is very unique). Ultimately, however, we need a more robust approach.

So how can you detect and block a distributed attack? 

To detect and block these kinds of distributed attacks we need to instead look for trends in the distribution of traffic among webpages on a given site. Simple traffic baselining allows us to know how much traffic a page typically receives as well as the number of requests a typical client may send to it. A basic heuristic for alerting, then, might be: “We usually receive 10 login requests a minute, suddenly we’re seeing 1000.”

ThreatX goes further than visibility into top URLs and average traffic patterns, and can automatically throttle, or “tarpit,” the rate of requests to a page. This, combined with our suspicious traffic interrogation capabilities, can detect and stop distributed attacks without adversely affecting legitimate user traffic.

Behavioral Analysis – Grouping Bot Actors

Tracking Behavior Across Multiple Actors

As we saw in the previous post, Bot Behavior – Distributed Attacks, bots can be used to evade basic volumetric attack detection. Because each bot may use a unique source IP address, aggregating the traffic for alerting, reporting, or blocking can be difficult.

We can try to work around this by baselining traffic patterns for a given URL within a web application, but developing rules around these traffic norms is not always sufficient to distinguish botnet traffic and take appropriate action for offenders.

To complement web application baselining we also need a method for grouping attackers by their behavior. Below is an example heuristic for clustering attacks based on common characteristics: the first step in using behavior to identify likely botnets.

Clustering Attacks – Analysis

For the purpose of this analysis, the overall profile for an attacker or “threat actor” is made up of all their  attacks, which are just a subset of HTTP requests (those which we’ve already decided are malicious in some way).

If the attacker’s attacks are similar in enough ways to another threat actor’s, we can group them and call the combined entity a ‘botnet’, ascribing control of each of the threat actors to a single entity or individual, i.e. they’re bots and each of their individual risks, behaviors, etc. make up some component of the whole entity’s.

Example attacks.json

The following 25 records are a stripped down and sanitized set of  attack logs from the ThreatX WAF, containing only enough information for the clustering example in the next section. In practice, these attacks log over 15 fields each on which we could potentially make clustering decisions. 

[
{“ip”:”10.120.10.6″,”timestamp”:”2018-06-25T13:09:03+00:00″,”hostname”:”www.threatxlabs.local”,”path”:”/path/to/de
bug/script.php”},
  {“ip”:”10.120.10.6″,”timestamp”:”2018-06-25T13:09:03+00:00″,”hostname”:”www.threatxlabs.local”,”path”:”/path/to/deb
ug/script.php”},
  {“ip”:”10.254.40.5″,”timestamp”:”2018-06-25T13:13:51+00:00″,”hostname”:”a
pp-dev.threatxlabs.local”,”path”:”/”},
  {“ip”:”10.254.40.5″,”timestamp”:”2018-06-25T13:17:33+00:00″,”hostname”:”ap
p-dev.threatxlabs.local”,”path”:”/”},
  {“ip”:”10.115.70.8″,”timestamp”:”2018-06-25T13:23:54+00:00″,”hostname”:”securecheckout.threatxlabs.local”,”path”:”/pa
th/to/debug/script.php”},
  {“ip”:”10.115.70.8″,”timestamp”:”2018-06-25T13:23:54+00:00″,”hostname”:”securecheckout.threatxlabs.local”,”path”:”/”},
  {“ip”:”10.115.70.8″,”timestamp”:”2018-06-25T13:23:54+00:00″,”hostname”:”securecheckout.threatxlabs.local”,”path”:”/pat
h/to/debug/script.php”},
  {“ip”:”10.115.70.4″,”timestamp”:”2018-06-25T13:23:54+00:00″,”hostname”:”securecheckout.threatxlabs.local”,”path”:”/pat
h/to/debug/script.php”},
  {“ip”:”10.115.70.4″,”timestamp”:”2018-06-25T13:23:55+00:00″,”hostname”:”securecheckout.threatxlabs.local”,”path”:”/pat
h/to/debug/script.php”},
  {“ip”:”10.73.140.3″,”timestamp”:”2018-06-25T13:24:02+00:00″,”hostname”:”wordpress.threatxlabs.local”,”path”:”/wp-login
.php”},
  {“ip”:”10.73.140.4″,”timestamp”:”2018-06-25T13:24:21+00:00″,”hostname”:”wordpress.threatxlabs.local”,”path”:”/wp-login
.php”},
  {“ip”:”10.73.140.5″,”timestamp”:”2018-06-25T13:24:30+00:00″,”hostname”:”wordpress.threatxlabs.local”,”path”:”/wp-login
.php”},
  {“ip”:”10.73.150.8″,”timestamp”:”2018-06-25T13:32:14+00:00″,”hostname”:”payments.threatxlabs.local”,”path”:”/includes
“},
  {“ip”:”10.73.150.8″,”timestamp”:”2018-06-25T13:35:08+00:00″,”hostname”:”payments.threatxlabs.local”,”path”:”/wwwstats
“},
  {“ip”:”10.121.30.9″,”timestamp”:”2018-06-25T13:36:07+00:00″,”hostname”:”store.threatxlabs.local”,”path”:”/reply”},
  {“ip”:”10.73.150.8″,”timestamp”:”2018-06-25T13:39:45+00:00″,”hostname”:”payments.threatxlabs.local”,”path”:”/redirect
“},
  {“ip”:”10.55.130.2″,”timestamp”:”2018-06-25T13:40:11+00:00″,”hostname”:”wordpress.threatxlabs.local”,”path”:”/wp-content/plugins/lazy-content-slider/lzcs_admin.php”},
  {“ip”:”10.55.130.2″,”timestamp”:”2018-06-25T13:40:11+00:00″,”hostname”:”wordpress.threatxlabs.local”,”path”:”/wp-content/plugins/lazy-content-slider/lzcs_admin.php”},
  {“ip”:”10.121.30.9″,”timestamp”:”2018-06-25T13:43:55+00:00″,”hostname”:”store.threatxlabs.local”,”path”:”/wp-config.
php”},
  {“ip”:”10.212.40.7″,”timestamp”:”2018-06-25T15:56:37+00:00″,”hostname”:”securecheckout.threatxlabs.local”,”path”:”/bac
kup”},
  {“ip”:”10.212.40.7″,”timestamp”:”2018-06-25T15:56:51+00:00″,”hostname”:”securecheckout.threatxlabs.local”,”path”:”/hta
ccess”},
  {“ip”:”10.74.220.1″,”timestamp”:”2018-06-25T15:57:07+00:00″,”hostname”:”wordpress.threatxlabs.local”,”path”:”/wp-login.php”},
  {“ip”:”10.74.220.2″,”timestamp”:”2018-06-25T15:57:14+00:00″,”hostname”:”wordpress.threatxlabs.local”,”path”:”/wp-login.php”}
]

Clustering Attacks – Python Example

The example code below performs clustering of the attach json provided in the previous section. In order to cluster attacks we’re using a very basic set of rules:

1. Set a likeness score for each attack

2. If the attack hostname (HTTP  Host:) header matches an existing cluster, add 0.25 to likeness

3. If the attack path ( resource path portion of the URL in the request) header matches an existing cluster, add 0.50 to likeness

4. If the attack timestamp is within ~16 hours (60000s) of the last timestamp on an existing cluster, add 0.25 to likeness

5. If the attack’s total likeness is >= 0.75 for an existing cluster, add the attack to that cluster (and stop checking likeness against the rest of the clusters)

6. If the attack’s total likeness is create a cluster for the attack, using the attack’s hostname, path, and timestamp as metadata for the new cluster

Clustering based primarily on resource path gives us a pretty simple approximation of intent:

The actor intended to perform some kind attack against the resource located at this path.

At this point, we don’t really know if the actor was successful, the type of attack (we’ve stripped those parts out), or who actually controls the actor, but adding the other scores and clustering based on total likeness allows us to at least state:

These actors intended to perform some attacks against this resource path maybe somewhere around the same time, and possibly against the same host.

Which, though a big leap, is enough to say for the purpose of this example:

These actors could be coordinated, let’s consider them to be controlled by the same entity (botnet).

Example Code

#! /usr/bin/env python3
# cluster-example.py
 
import argparse, datetime, hashlib, json
 
# get attacks source file
parser = argparse.ArgumentParser()
parser.add_argument(‘-a’,’–attack_file’, required=True, help=”attacks.json”)
args = parser.parse_args()
 
# attempt to cluster the attack with others
def cluster(attack,clusters):
 
    # get an epoch timestamp we can work with easily
    last_timestamp =
int(datetime.datetime.strptime(attack[‘timestamp’],’%Y-%m-%dT%H:%M:%S+00:00′).strftime(‘%s’))
 
    for cluster in clusters:
 
        likeness = 0.0
 
        # test attack likeness to existing cluster based on field matches
        if attack[‘hostname’] == cluster[‘prototype_hostname’]:
            likeness += 0.25
        if attack[‘path’] == cluster[‘prototype_path’]:
            likeness += 0.50
        if abs(last_timestamp – cluster[‘last_timestamp’])
            likeness += 0.25
        # if alike enough, add the attack to cluster
        if likeness >= 0.75:
            cluster[‘attacks’].append(attack)
 
            # update the cluster last_timestamp if newer
            if last_timestamp > cluster[‘last_timestamp’]:
                cluster[‘last_timestamp’] = last_timestamp
 
            return clusters
 
    # if the attack didn’t match a cluster, it is unique enough to prototype new cluster
   
cluster= {}
   
cluster[‘attacks’] = [attack]
 
    # cluster metadata
    cid = “%s:%s:%s” % (attack[‘ip’],attack[‘hostname’],attack[‘path’])
    cluster[‘cluster_id’] = hashlib.sha1(cid.encode(‘utf-8’)).hexdigest()
    cluster[‘last_timestamp’] = last_timestamp
 
    # prototype defines the cluster
    cluster[‘prototype_hostname’] = attack[‘hostname’]
    cluster[‘prototype_path’] = attack[‘path’]
 
    # add new cluster to list of clusters
    clusters.append(cluster)
 
    return clusters
 
 
def main():
    # init clusters data
    clusters = []
 
    # load the attack data
    with open(args.attack_file, ‘r’) as data:
        attack_data = json.load(data)
 
    # cluster the attacks
    for attack in attack_data:
        clusters = cluster(attack,clusters)
 
    # print the clusters
    print(json.dumps(clusters, indent=4))
 
 
if __name__ == ‘__main__’:
    main()
Example Clustering Results

Below are selected results from our basic clustering algorithm. It was able to successfully identify the behavior seen in Part II: Case #1 – Distributed Password Guessing  /wp-login.php and Part II: Case #2 – Distributed Parameter Fuzzing  /path/to/debug/script.php as related even though each attack came from multiple source IP addresses, and, in the case of the debug script, also targeted different hosts.

[

    {  

        “cluster_id”: “2d303dc16dc0a40c5c2b97d28a3c2d32cf881362”,

        “last_timestamp”: 1529963834,

        “prototype_path”: “/wp-login.php”,

        “prototype_hostname”: “wordpress.threatxlabs.local”,

        “attacks”: [

            {“ip”:”10.73.140.3″,”timestamp”:”2018-06-25T13:24:02+00:00″,”path”:”/
wp-login.php”,”hostname”:”wordpress.threatxlabs.local”},
            {“ip”:”10.73.140.4″,”timestamp”:”2018-06-25T13:24:21+00:00″,”path”:”/
wp-login.php”,”hostname”:”wordpress.threatxlabs.local”},
            {“ip”:”10.73.140.5″,”timestamp”:”2018-06-25T13:24:30+00:00″,”path”:”/
wp-login.php”,”hostname”:”wordpress.threatxlabs.local”},
            {“ip”:”10.74.220.1″,”timestamp”:”2018-06-25T15:57:07+00:00″,”path”:”/
wp-login.php”,”hostname”:”wordpress.threatxlabs.local”},
            {“ip”:”10.74.220.2″,”timestamp”:”2018-06-25T15:57:14+00:00″,”path”:”/
wp-login.php”,”hostname”:”wordpress.threatxlabs.local”}

        ]

    },

    {

        “cluster_id”: “94d0b5a7b4f6ff580d7474f26eac0a6a3ecc9c10”,

        “last_timestamp”: 1529954635,

        “prototype_path”: “/path/to/debug/script.php”,

        “prototype_hostname”: “www.threatxlabs.local”,

        “attacks”: [

            {“ip”:”10.120.10.6″,”timestamp”:”2018-06-25T13:09:03+00:00″,”path”:”/path/to/debug/script.php”,”hostname”:”www.threatxlabs.

local”},

            {“ip”:”10.120.10.6″,”timestamp”:”2018-06-25T13:09:03+00:00″,”path”:”/path/to/debug/script.php”,”hostname”:”www.threatxlabs.

local”},

            {“ip”:”10.115.70.8″,”timestamp”:”2018-06-25T13:23:54+00:00″,”path”:”/path/to/debug/script.php”,”hostname”:”securecheckout.
threatxlabs.local”},
            {“ip”:”10.115.70.8″,”timestamp”:”2018-06-25T13:23:54+00:00″,”path”:”/path/to/debug/script.php”,”hostname”:”securecheckout.
threatxlabs.local”},
            {“ip”:”10.115.70.4″,”timestamp”:”2018-06-25T13:23:54+00:00″,”path”:”/path/to/debug/script.php”,”hostname”:”securecheckout.
threatxlabs.local”},
            {“ip”:”10.115.70.4″,”timestamp”:”2018-06-25T13:23:55+00:00″,”path”:”/path/to/debug/script.php”,”hostname”:”securecheckout.
threatxlabs.local”}

        ]

    },

    {

        “cluster_id”: “adc0522852e356492dbc3939658ebe3c6ad5db27”,

        “last_timestamp”: 1529955308,

        “prototype_path”: “/wwwstats”,

        “prototype_hostname”: “payments.threatxlabs.local”,

        “attacks”: [

            {“ip”:”10.73.150.8″,”timestamp”:”2018-06-25T13:35:08+00:00″,”path”:”/wwwstats”,”hostname”:”payments.threatxlabs.local”}

        ]

    },

    {

        “cluster_id”: “1f1d730e936e217d903c895f7f8ff5485b7725ef”,

        “last_timestamp”: 1529963811,

        “prototype_path”: “/htaccess”,

        “prototype_hostname”: “securecheckout.threatxlabs.local”,

        “attacks”: [

            {“ip”:”10.212.40.7″,”timestamp”:”2018-06-25T15:56:51+00:00″,”path”:”/htaccess”,”hostname”:”securecheckout.threatxlabs.
local”}

        ]

    }

]

We also see examples of one-off attacks from the example data set; these were sorted into clusters of their own.

Taking Action on the Entire Botnet

What does grouping attackers get us?

The clustering example included in this post is just the beginning of the techniques that can be applied to identify like attack traffic. By clustering attacks and grouping attackers we gain the ability to make decisions based on the entire group’s behavior. For botnet traffic, this means quickly detecting, tarpitting, and/or blocking malicious traffic from all identified members of the botnet, potentially before the individual member is able to fully participate in the attack.

ThreatX is continuously updating our capability to perform this kind of clustering based on attacker behavior. This, combined with active interrogation of suspicious actors and dynamic site profiling, ensures malicious bots are quickly identified and stopped.

Learn More About Advanced Web Application Firewalls

Tags

About the Author

Will Woodson

Will's background is in security operations, working in the financial services sector and as a federal employee in engineering & analytical roles. He holds several industry certifications including a CISSP and is active in multiple information security community groups.