Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Benchmark overview
We compare four subdomain discovery pipelines per apex domain that aim at testing the subwiz model in a realistic setting, were a user firsts discover a set of domains using traditional tools (e.g. Subfinder, Amass, Gobuster) and then use this output as input to the subwiz model. The problem is a difficult one since these tools might have already discovered most of the subdomains for a given apex domain, however it remains useful in a setting of exhaustive subdomain discovery.
Here are the specific quantities compared:
Subfinder subdomains: This baseline captures the unique resolved subdomains identified by running the
subfindertool for each apex domain. Since running subfinder across many domains can be very time-consuming, we have already performed this step and stored the results inbenchmark_dataset.json, but feel free to do it yourself. This data is as of May 2025.Subfinder subdomains --> subwiz v0: Starting with the subdomains discovered by Subfinder as seed inputs, subwiz v0 (version 0.4.1) generates additional candidate subdomains. This version only uses the input subdomains themselves as context for generation, without incorporating the apex domain information. We use
max-recursion=1given that v0 initially presented no recursion and these changes came later with v1.Subfinder subdomains --> subwiz v1: Starting with the Subfinder results as seed inputs, subwiz v1 generates candidates using the improved model with apex domain context. Additionally, this pipeline uses the default recursive generation (
max_recursion=5): newly discovered subdomains from each iteration are automatically fed back as inputs for the next iteration, allowing the model to discover deeper nested subdomains that might only be found by building upon previously generated candidates.Subfinder subdomains --> subwiz v1 + maximum recursion: Starting with the Subfinder results as seed inputs, subwiz v1 generates candidates using the improved model with apex domain context. Additionally, this pipeline uses the maximum allowed recursion (
max_recursion=50) to protray full potential.