Skip to content

sungjunleeee/dns-exfiltration

Repository files navigation

DNS Exfiltration Analysis

This document summarizes the analysis of the PCAP and CSV files in this project, with a focus on identifying potential DNS exfiltration techniques.

Repository Structure

This repository contains scripts to run analysis on .pcap and .csv files (which contain extended information) located in the /captures directory. Note that payloads in the PCAP files are truncated to protect the privacy of the collected data.

dns-exfiltration/
├── captures/
│   └── ... (PCAP and CSV files)
├── detect_extended_label.py
├── inspect_packet.py
├── advanced_analysis.py
└── README.md

Investigation of DNS Queries

To analyze the DNS queries for potential exfiltration, the advanced_analysis.py script is used. This script analyzes DNS CSV files for several indicators of DNS tunneling.

advanced_analysis.py

This script provides an automated way to scan for common signs of DNS exfiltration within the provided CSV files.

Features

  • High-Entropy Subdomain Detection: Calculates the Shannon entropy of the first subdomain in a query name. A high entropy score can indicate that the subdomain is not a human-readable name but rather encoded data, a common technique in DNS tunneling. The default threshold is 4.0.
  • Suspicious Query Type Identification: It tracks the usage of DNS query types like TXT (16), NULL (10), and CNAME (5), which can be used to carry data payloads in exfiltration attempts.
  • Long Query Name Detection: The script identifies and reports the longest query name found in each file, as unusually long queries can be used to exfiltrate larger chunks of data.

Usage

To run the script, provide the path to a single CSV file or a directory containing multiple CSV files. You can also customize the entropy threshold.

python3 archive/advanced_analysis.py <path_to_csv_or_directory> [--entropy <threshold>]

Example Output

Here is an example of the output from the script, showing high-entropy queries and suspicious query types found in one of the capture files:

--- Analyzing file: dns3-2024-01-28-1000_20240128162143-extrafields.csv ---
  Total queries found: 41068
  Longest query found: '160-39-*-*_s-23-44-*-*_ts-1706459418-clienttons-s.akamaihd.net' (69 chars)
  Suspicious query types (TXT, NULL, CNAME):
    Type 16: 32 queries
  High-entropy queries (subdomain entropy > 4.0):
    Entropy=4.53: apimgmthsnz4xwudphzahgsnqqkbyboaubdk6nrhzfej17zahk.cloudapp.net
    Entropy=4.53: apimgmthsnz4xwudphzahgsnqqkbyboaubdk6nrhzfej17zahk.cloudapp.net
    Entropy=4.53: apimgmthsnz4xwudphzahgsnqqkbyboaubdk6nrhzfej17zahk.cloudapp.net
    Entropy=4.53: apimgmthsnz4xwudphzahgsnqqkbyboaubdk6nrhzfej17zahk.cloudapp.net
    Entropy=4.44: uatr7fyxfte2kznwqena-p4ap9t-bdae74924-clientnsv4-s.akamaihd.net

Observations

Running the advanced_analysis.py script across the dataset reveals several interesting patterns:

  • Highly Suspicious Queries: The domain abcdefghijklmnopqrstuvwxyz012345.plex.direct appears in multiple capture files. Its subdomain has a very high entropy score of 5.0, which could be indicative of a user-defined domain or random testing by a client.
  • High-Entropy Subdomains in Legitimate Services: Many of the high-entropy queries are directed to subdomains of legitimate services like Akamai (*.akamaihd.net), Microsoft Azure (*.cloudapp.net), and Heroku (*.herokudns.com). While this traffic could be benign, the use of high-entropy strings in subdomains is a known technique for data exfiltration, where attackers leverage the reputation of these cloud services to evade detection. This demonstrates the script's ability to identify potentially malicious patterns even in traffic to legitimate services.
  • Use of TXT Queries: TXT queries (Type 16) are observed in many of the capture files. Although TXT records have many legitimate uses (e.g., SPF, DKIM), they are also a popular choice for DNS exfiltration because they can hold more data than other record types. Further investigation is challenging as the payloads were truncated during data collection.

Source

Data exfiltration via DNS tunnelling

Investigation of Original PCAP Files

Originally, attempts were made to examine all potential fields. Due to the nature of abundant data, this requires automated approaches to identify anomalies or outliers. However, no significant patterns were found in the PCAP files. This would require more sophisticated statistical analysis or machine learning techniques to identify subtle patterns that may indicate exfiltration attempts. It is still possible with the following approaches:

  • Header Field Abuses: Some advanced techniques repurpose reserved header fields, such as the three “Z” bits in the DNS header, which are supposed to be set to zero but are generally ignored by servers and middleboxes. Attackers can use these bits for sequence numbers or as end-of-stream markers, especially when chunking large payloads across several queries.
  • DNS Opcode: The DNS opcode (normally 0 for a standard query) can be set to rarely used values to signal unusual meaning, though this often risks being dropped by honest infrastructure.
  • Record Class and Type: The “question class” field is almost always “IN” (Internet), but exfiltration traffic has appeared with other values—such as “NONE”—to evade simple filters, or to encode extra bits of data. Alternative record types (e.g., CNAME, MX, SRV, or even experimental types) allow for a slightly larger or more structured payload but are generally less commonly abused than TXT records or query names.

However, this is still impractical. Most observed DNS data exfiltration still focuses on the query name and RDATA fields due to their reliability and compatibility with recursive DNS infrastructure. Other field abuses are considered suspicious and merit special attention during forensic or hunting activities, as they rarely occur in normal traffic.

Source

DNS Packet Inspection for Network Threat Hunters

Findings on "<Unknown extended label>"

During the analysis, the string <Unknown extended label> was found in the dns_qry_name field of the CSV files. This initially suggested a non-standard DNS label type.

This seemed suspicious; however, there was only one occurrence of this label type across all analyzed files. It appears to be a common error when Wireshark's DNS parser encounters an incorrectly formatted name. It tries to interpret the first character as a label length byte. The first two bits (01) indicate an extended label type, which the parser does not recognize, triggering the "unknown extended label" error.

Example

CSV Entry:

  • File: captures/dns3-2024-01-28-1000_20240128193522-extrafields.csv
  • Line:
2265    Jan 28, 2024 14:35:52.866437000 EST     DNS     54.203.*.*   160.39.23.0     13682569        14157750                                <Unknown extended label>        29813   0

Limitations

To check for unknown extended label indicators, it requires recapturing network traffic with full packet capture enabled (snaplen=0 or snaplen=65535) so Wireshark can parse the DNS layer and display whether the QNAME field contains malformed label encoding.

Additional Information

UUID-like Domain Queries

During the analysis of the provided data, a significant number of DNS queries to domains with a UUID-like format were observed. These invalid queries, such as those to f82f1060-58b7-45b2-b075-06287004d60c.com and 3b2d589e-bc35-4928-af8c-58eb6f6cd052.com were directed to Columbia's DNS servers.

Further investigation revealed that Columbia's DNS servers are accessible from outside the university's network. This finding means that external entities can directly query Columbia's DNS infrastructure. The combination of UUID-like domain queries and the external accessibility of the DNS servers could be an indicator of DNS tunneling or other exfiltration techniques, where data is encoded into the domain names themselves. However, this was done between only two entities, therefore it was solely for unknown network maintenance done by Colubia CUIT.

Outgoing Destination IP Analysis

To understand the destinations of outgoing traffic, an analysis was performed on the ip_dst column, excluding internal IP ranges. The top 5 most frequent IP addresses are listed explicitly, while the remaining IPs are aggregated into /16 subnets to provide a broader overview.

The following list shows the top 5 outgoing IP addresses, followed by the most frequent /16 subnets for the rest of the traffic:

113278 8.8.8.8
2429 8.8.4.4
510 1.1.1.1
180 172.26.38.1
90 208.67.220.220
68 208.83.*.*
58 208.67.*.*
53 114.114.*.*
42 205.251.*.*
41 10.0.*.*
36 211.115.*.*
34 162.253.*.*
21 80.82.*.*
15 185.46.*.*
12 156.145.*.*
10 172.18.*.*
9 54.203.*.*
9 140.163.*.*
8 192.168.*.*
6 10.115.*.*
6 1.0.*.*
4 44.228.*.*
3 45.227.*.*
2 94.74.*.*
2 43.225.*.*
2 100.20.*.*
1 194.26.*.*
1 180.76.*.*
1 100.64.*.*

Observations:

  • The vast majority of outgoing traffic is directed to Google's DNS servers (8.8.8.8 and 8.8.4.4).
  • Cloudflare's DNS server (1.1.1.1) is also a frequent destination.
  • The aggregated subnets provide a high-level view of the other services being accessed, which can be useful for identifying broader patterns of external communication.

Things I Overlooked

Initially, I thought that DNS exfiltration could only be accomplished by querying malicious DNS servers rather than normal or legitimate recursive resolvers (such as Google DNS or Cloudflare 1.1.1.1). However, unknown subdomain queries on public and benign resolvers will eventually contact the attacker's authoritative nameserver, which is maintained by the attacker. Therefore, it is worth looking into all DNS queries, regardless of the resolver used.

Update on November 24, 2025

  • Read the paper "Monitoring Enterprise DNS Queries for Detecting Data Exfiltration from Internal Hosts" (https://www2.ee.unsw.edu.au/~vijay/pubs/jrnl/20TNSMexfil.pdf)

    • The authors used machine learning algorithm that is lightweight and can be deployed in real-time.

      • Mentioned that the code was available to the public, but the website is down.
    • They have used Majestic dataset to build a whitelist, complemented by additional domains that were commonly used within their organizations. (such as anti-virus software they were using)

      • They trusted this dataset because
    • They also used shannon entropy for filtering out malicious domains.

    • They addressed the same issue such as inaddr-arpa queries, which are used for reverse DNS lookups (for mail servers).

  • Read the blog post "Using the power of Cloudflare’s global network to detect malicious domains using machine learning" from Cloudflare (https://blog.cloudflare.com/threat-detection-machine-learning-models/)

    • The blog post also mentioned about using machine learning models to detect malicious domains.
    • Most of the DNS exfiltration uses
    • Described the usecases of this such as UUID (which we also discovered)
      Queries like this are often used by security and monitoring applications and network appliances to check in.
      
      • Admitted that there are benign usecases of UUID-like domains, making it difficult to classify them as malicious or benign.
      • They paired with human element on top of their ML model to delivery protection.
    • Mentioned that responses to these queries can utilize TXT records to deliver the data back to the requester.

Next Steps

  • Explore machine learning techniques, including the ones mentioned in the paper or similar.
  • Filter out known benign domains using public datasets such as Majestic, paired with some human review
    • for Columbia, perhaps CUIT can provide a list of known benign domains used within the university network.
    • Majestic top 100 dataset doesn't include legitimate domains such as columbia.edu or any educational service, so it is necessary to complement with additional known benign domains.
    • From there, it would be possible to build a whitelist for training or filtering, or maybe it's easier for a human to review the remaining unknown domains.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages