LAST UPDATED Apr 03, 2023
One of the main challenges that security operation centers (SOC) and threat hunting teams run into is trying to determine what is noise vs. a targeted attack when looking at large-scale levels (thousands, millions, or even billions!) of requests in their logs. You can use automation to detect anomalous or malformed traffic, but accounting for all modifications to headers and body content is an ordeal.
How do you separate good scanning from bad scanning amidst an overwhelming amount of scanner and bot traffic? In this post, we talk about all the different types of “noise” and discuss some ways to identify good vs. bad.
Types of Scanners
Today, there are bots and scanners that look for common endpoints (i.e., files, directories, API paths) on IPv4, and ongoing efforts from nation-state entities and cybersecurity companies are already moving into the IPv6 realm. The scanners can be categorized as follows:
Active host scanners: These rely only on Layer 3 (ICMP) and Layer 4 (TCP/UDP/QUIC) information to determine whether an IP address is attached to a machine or device that is up and reachable.
Endpoint and directory enumerators: These scanners are focused on taking common endpoints, URI paths, and filenames and determining if they exist on targeted IP addresses.
Vulnerability scanners: The most aggressive scanners, vulnerability scanners target hosts looking for common (or not!) vulnerabilities that can be exploited. These can range from injection of malformed SQL (e.g., sqlmap) to arbitrary file reads and remote command execution.
Types of Bots
Bots can vary from simple scripts that reuse common tools with default configurations to complex networks of compromised or borrowed IPs tailored to specific purposes (such as DDoS) and/or applications (e.g., targeting only WordPress sites or checking vulnerabilities only applicable to WordPress).
Learn more about how bots are used in attacks in The Role of Bots in API Attacks.
Data Analysis Pain Points
This volume of scanners and bots introduces several pain points for those trying to analyze the data:
- How do we classify and detect common scanners and tools?
- How do we determine if an attack or payload is targeting the application’s technology (e.g., nginx and Apache for web servers; WordPress and Rails for application frameworks)?
- How do we find out if the attack is tailored to the application or just looking for generic entries?
Identifying Attacks Among the Noise
We can try to answer some of the questions above by looking at parts of the request, identifying what information they provide, and figuring out how to use them in an investigation or hunting scenario.
The following HTTP traffic metadata can be beneficial in isolating the attacks within the noise:
User Agent: While attackers can easily rotate user agents, most “benign” and/or popular scanners and bots will customize their user agent to a specific string to provide some sort of identification to the hosts being scanned.
TLS Fingerprint: TLS fingerprints, as provided by tools such as JA3/JA3S/JARM, can offer insight on what low-level libraries are being used on both client and server sides, supplying a more complete picture of the real actor behind the requests being made. Although rotating these fingerprints is possible, this information can still be used to detect less-sophisticated attacks.
Targeted URIs: Here is where knowledge of the application or host that the requests are trying to reach comes in handy. For example: a request is trying to access .htpasswd or trying to log in against WordPress’s wp-login endpoint, but the targeted application or service uses neither, such as a nginx server proxying traffic to a Python-based web server. This is not normal user behavior, so you could be a bit more suspicious about the IP where that traffic is coming from.
Arguments and Body Content: Like URI, knowledge is key to detecting what should be allowed against the application you are trying to protect/investigate. This is one of the main pain points for automated traffic analysis, since the variability of endpoints and payloads that can be accepted by the application must be properly tracked and understood to provide better protection coverage.
Moreover, the type of arguments or data that can be sent to the application can provide insight on what normal traffic should look like, and when there is an attack trying to bypass current protections by manipulating their values.
Source IP: Basic checks for source IP addresses for proxy or VPN capabilities (including Tor) can be used to determine the intent of a request by combining it with other metadata, including geolocation and if the IP has previously generated anomalous traffic. Nowadays, this is becoming more difficult to use as a clear separator due to the ubiquitous existence of hosting and cloud providers (AWS EC2 platform, VPS services like DigitalOcean, etc.). Also, the distribution of these can help us understand the speed and magnitude of the attack or scan.
While this is not an exhaustive list, it provides a baseline to start digging into the traffic and isolating the real attacks from the noise.
Get more of my analysis on bot-based attacks in the new ThreatX Labs research paper, Anatomy of a Targeted Credential Stuffing Attack.