diff --git a/.github/workflows/ruby.yml b/.github/workflows/ruby.yml
index 37ca9fe..1ec5062 100644
--- a/.github/workflows/ruby.yml
+++ b/.github/workflows/ruby.yml
@@ -15,7 +15,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
- ruby-version: ["3.2", "3.4"]
+ ruby-version: ["3.3", "3.4", "4.0.3"]
steps:
- uses: actions/checkout@v4
diff --git a/README.md b/README.md
index e172560..2566bbd 100644
--- a/README.md
+++ b/README.md
@@ -32,6 +32,8 @@ The default rule set includes:
## Installation
+Crawlscope requires Ruby 3.3 or newer.
+
Add this line to your application's Gemfile:
```ruby
@@ -166,6 +168,7 @@ Available environment overrides:
- `TIMEOUT=30`
- `NETWORK_IDLE_TIMEOUT=10`
- `CONCURRENCY=5`
+- `FETCH_EXECUTOR=threaded` or `FETCH_EXECUTOR=async`
Available tasks:
@@ -196,6 +199,12 @@ bundle exec rake 'crawlscope:validate:ldjson[https://example.com/article]'
Plain `rake` does not pass `--url` style flags to tasks. Use `URL=...` or the
task-argument form above instead.
+`FETCH_EXECUTOR=async` is the default for HTTP crawling. It uses Ruby's fiber
+scheduler and Async::HTTP through Faraday, preserving the same `CONCURRENCY`
+bound. Use `FETCH_EXECUTOR=threaded` or `--fetch-executor threaded` for the
+thread-pool executor. Browser rendering uses the threaded executor by default
+because async fetch execution is only supported with HTTP rendering.
+
`crawlscope:validate:ldjson` is separate because it directly checks the URL or semicolon-separated URLs in `URL`; it does not crawl the sitemap. Without `URL`, it checks the configured base URL, falling back to `http://localhost:3000`.
### Structured Data URL Audit
diff --git a/Rakefile b/Rakefile
index 840ad74..7f3ae01 100644
--- a/Rakefile
+++ b/Rakefile
@@ -183,7 +183,7 @@ namespace :release do
end
namespace :crawlscope do
- desc "Validate URLs with all default Crawlscope rules. Args: [url,sitemap,rules]. ENV: URL, SITEMAP, RULES, JS=1, TIMEOUT, NETWORK_IDLE_TIMEOUT, CONCURRENCY"
+ desc "Validate URLs with all default Crawlscope rules. Args: [url,sitemap,rules]. ENV: URL, SITEMAP, RULES, JS=1, TIMEOUT, NETWORK_IDLE_TIMEOUT, CONCURRENCY, FETCH_EXECUTOR"
task :validate, [:url, :sitemap, :rules] do |_task, args|
Crawlscope::RakeTasks.validate(url: args[:url], sitemap_path: args[:sitemap], rule_names: args[:rules])
end
diff --git a/UPGRADE.md b/UPGRADE.md
index 3a6cb49..f3c9415 100644
--- a/UPGRADE.md
+++ b/UPGRADE.md
@@ -4,4 +4,19 @@ Use this file for host-app migration notes when a release changes public
contracts, required setup, component locals, generated assets, or runtime
behavior.
-No special upgrade notes have been published yet.
+## Next Release
+
+### Ruby 3.3 is now required
+
+Crawlscope now depends on the current Async runtime for production async HTTP
+fetching. Host applications must run Ruby 3.3 or newer before upgrading.
+
+Recommended migration:
+
+1. Upgrade the host application runtime to Ruby 3.3 or newer.
+2. Run `bundle update crawlscope async async-http async-http-faraday`.
+3. Crawlscope now uses `FETCH_EXECUTOR=async` by default for HTTP crawling.
+4. Set `FETCH_EXECUTOR=threaded` or pass `--fetch-executor threaded` for a
+ conservative rollout or for explicit thread-pool execution.
+5. Browser rendering continues to use threaded execution by default because
+ async fetch execution is only supported with HTTP rendering.
diff --git a/async-performance-assessment.md b/async-performance-assessment.md
new file mode 100644
index 0000000..ef7c30c
--- /dev/null
+++ b/async-performance-assessment.md
@@ -0,0 +1,167 @@
+# Async Performance Assessment
+
+Date: 2026-06-01
+
+## Conclusion
+
+Async HTTP is not at least 2x faster than Crawlscope's existing threaded fetch
+baseline at the same concurrency.
+
+It is much faster than sequential fetching, but Crawlscope was already parallel
+before this work. The meaningful comparison is therefore:
+
+- `fetch_executor: :threaded`, `CONCURRENCY=N`
+- `fetch_executor: :async`, `CONCURRENCY=N`
+
+On that comparison, async is roughly 1.01x to 1.11x faster in most local
+delayed-response scenarios, with one Ruby 3.4.8 short-delay/high-concurrency
+case reaching 1.35x. It does not reach 2x.
+
+The additional production work did produce a 2x+ improvement in a different
+place: child sitemap expansion now uses the bounded fetch executor instead of
+walking sitemap indexes serially. In the local delayed sitemap benchmark,
+bounded expansion is about 7.4x to 7.6x faster than sequential expansion on
+Ruby 3.4.8 and Ruby 4.0.3.
+
+## Benchmark Setup
+
+Benchmarks use a local delayed HTTP server and full `Crawlscope::Crawl` runs,
+not isolated HTTP calls.
+
+Measured scenarios:
+
+- direct sitemap pages,
+- uncrawled internal link targets resolved by the links rule,
+- 48 delayed pages or targets,
+- 20ms and 80ms response delays,
+- concurrency 8 and 16,
+- median of 3 runs per executor.
+
+Scripts:
+
+- `test/performance/async_fetch_benchmark.rb`
+- `test/performance/fetch_executor_matrix.rb`
+- `test/performance/sitemap_expansion_benchmark.rb`
+
+## Results
+
+### Ruby 4.0.3
+
+| Scenario | Threaded | Async | Async vs Threaded |
+| --- | ---: | ---: | ---: |
+| direct pages, 20ms, c8 | 0.159s | 0.147s | 1.08x |
+| direct pages, 20ms, c16 | 0.092s | 0.088s | 1.04x |
+| direct pages, 80ms, c8 | 0.528s | 0.521s | 1.01x |
+| direct pages, 80ms, c16 | 0.280s | 0.275s | 1.02x |
+| link targets, 20ms, c8 | 0.179s | 0.170s | 1.06x |
+| link targets, 80ms, c8 | 0.616s | 0.601s | 1.02x |
+
+### Ruby 3.4.8
+
+| Scenario | Threaded | Async | Async vs Threaded |
+| --- | ---: | ---: | ---: |
+| direct pages, 20ms, c8 | 0.163s | 0.159s | 1.03x |
+| direct pages, 20ms, c16 | 0.110s | 0.082s | 1.35x |
+| direct pages, 80ms, c8 | 0.524s | 0.518s | 1.01x |
+| direct pages, 80ms, c16 | 0.278s | 0.271s | 1.02x |
+| link targets, 20ms, c8 | 0.190s | 0.170s | 1.11x |
+| link targets, 80ms, c8 | 0.610s | 0.607s | 1.01x |
+
+The earlier simple benchmark also showed the same pattern:
+
+| Runtime | Sequential threaded | Threaded c8 | Async c8 |
+| --- | ---: | ---: | ---: |
+| Ruby 4.0.3 | 2.141s | 0.284s | 0.305s |
+| Ruby 3.4.8 | 2.133s | 0.272s | 0.328s |
+
+### Child Sitemap Expansion
+
+This benchmark measures a sitemap index with eight delayed child sitemaps.
+The first row forces a sequential executor by setting concurrency to 1. The
+bounded rows use the same crawl path with concurrency 8.
+
+| Runtime | Sequential | Threaded c8 | Async c8 | Best Speedup |
+| --- | ---: | ---: | ---: | ---: |
+| Ruby 4.0.3 | 2.123s | 0.280s | 0.285s | 7.58x |
+| Ruby 3.4.8 | 2.100s | 0.284s | 0.276s | 7.61x |
+
+## Why Async Is Not 2x Faster
+
+The current threaded implementation is already near the latency lower bound.
+
+For 48 pages at 80ms delay and concurrency 16, the theoretical network floor is
+roughly:
+
+```text
+ceil(48 / 16) * 0.08s = 0.24s
+```
+
+Measured results:
+
+- threaded: 0.279s on Ruby 4.0.3
+- async: 0.267s on Ruby 4.0.3
+
+That leaves only about 39ms of overhead for threaded execution to eliminate.
+Async cannot produce a 2x improvement when the threaded baseline is already
+within about 16% of the ideal network floor.
+
+The main reasons:
+
+1. Crawlscope already had bounded parallel fetching through
+ `Concurrent::FixedThreadPool`.
+2. The benchmark is I/O latency dominated, not CPU or thread scheduling
+ dominated.
+3. Faraday still adds request/response abstraction overhead in both modes.
+4. Local HTTP/1.1 requests do not create a major multiplexing advantage for
+ async.
+5. The async executor improves resource shape more than wall-clock time when
+ the thread pool is already sized correctly.
+
+## What Did Improve
+
+This work still improves the architecture and specific crawl paths:
+
+- async HTTP is now a real transport through `async-http-faraday`,
+- fetch executor selection is explicit and documented,
+- browser rendering is guarded from async misuse,
+- output ordering remains stable,
+- uncrawled link targets are resolved as a bounded batch instead of repeatedly
+ through the single-target path,
+- child sitemap indexes are expanded through the same bounded executor,
+- canonical target resolution reuses the link-target cache,
+- per-page link extraction and uniqueness summaries can run in bounded parallel
+ when the selected executor supports it.
+
+The biggest speed improvement is not async versus threads. It is batching extra
+network-bound and per-page rule work through the same bounded executor. In slow
+child-sitemap or link-target cases, the batch shape turns what would otherwise
+be a serial second wave into roughly `ceil(items / concurrency) * latency`.
+
+## Recommendation
+
+Keep the async implementation, but do not claim a 2x wall-clock speedup over
+the current threaded executor.
+
+Position it as:
+
+- production-ready for HTTP crawling,
+- useful for fiber-scheduler deployments,
+- potentially better at higher concurrency with lower thread pressure,
+- equivalent-to-slightly-faster than threaded for the measured local workloads.
+
+If the product goal is a reliable 2x improvement over the previous threaded
+baseline, the next performance work should not be "more async." It should
+target:
+
+1. persistent fetch/result caching across rule phases,
+2. optional higher concurrency with per-host rate limits,
+3. streaming per-page analysis so CPU work overlaps network waits,
+4. reducing retained response bodies when rules only need parsed document state.
+
+## Decision
+
+Async is production-ready, but it is not a 2x speed feature against the existing
+threaded baseline. The production claim should be about scalability and
+executor choice. The measured 2x+ speedup comes from applying bounded
+parallelism to previously serial crawl phases, especially child sitemap
+expansion.
diff --git a/crawlscope.gemspec b/crawlscope.gemspec
index ccc80c5..5306e0c 100644
--- a/crawlscope.gemspec
+++ b/crawlscope.gemspec
@@ -12,7 +12,7 @@ Gem::Specification.new do |spec|
spec.description = "A small Ruby gem for sitemap-driven SEO validation with structured issues, configurable rules and schema registries, optional browser rendering, and Rails rake task integration."
spec.homepage = "https://www.ethos-link.com/opensource/crawlscope"
spec.license = "MIT"
- spec.required_ruby_version = ">= 3.2.0"
+ spec.required_ruby_version = ">= 3.3.0"
repo = "https://github.com/ethos-link/crawlscope"
branch = "main"
@@ -46,6 +46,8 @@ Gem::Specification.new do |spec|
spec.require_paths = ["lib"]
spec.add_dependency "concurrent-ruby", ">= 1.3"
+ spec.add_dependency "async", ">= 2.0"
+ spec.add_dependency "async-http-faraday", ">= 0.22"
spec.add_dependency "faraday", ">= 2.0"
spec.add_dependency "faraday-follow_redirects", ">= 0.3"
spec.add_dependency "json-schema", ">= 5.0"
diff --git a/lib/crawlscope/cli.rb b/lib/crawlscope/cli.rb
index 39f9ac3..638db9a 100644
--- a/lib/crawlscope/cli.rb
+++ b/lib/crawlscope/cli.rb
@@ -134,6 +134,8 @@ def run_validate
configure_renderer(resolved_renderer)
@configuration.concurrency = resolved_concurrency
+ fetch_executor_configured = !normalized_string(ENV["FETCH_EXECUTOR"]).nil?
+ @configuration.fetch_executor = resolved_fetch_executor
@configuration.network_idle_timeout_seconds = resolved_integer("NETWORK_IDLE_TIMEOUT", default: @configuration.network_idle_timeout_seconds, minimum: 1)
@configuration.timeout_seconds = resolved_integer("TIMEOUT", default: @configuration.timeout_seconds, minimum: 1)
@@ -167,9 +169,15 @@ def run_validate
opts.on("--concurrency COUNT", Integer, "Set crawl concurrency") do |value|
@configuration.concurrency = integer_option(value, minimum: 1, name: "concurrency")
end
+
+ opts.on("--fetch-executor NAME", "Use threaded or async fetch execution") do |value|
+ fetch_executor_configured = true
+ @configuration.fetch_executor = value
+ end
end
parser.parse!(@argv)
+ @configuration.fetch_executor = :threaded if @configuration.renderer == :browser && !fetch_executor_configured
result = task.validate(
base_url: options[:url],
@@ -221,6 +229,14 @@ def resolved_concurrency
end
end
+ def resolved_fetch_executor
+ configured_executor = normalized_string(ENV["FETCH_EXECUTOR"])
+ return configured_executor if configured_executor
+ return :threaded if @configuration.renderer == :browser
+
+ @configuration.fetch_executor
+ end
+
def resolved_integer(name, default:, minimum:)
raw_value = normalized_string(ENV[name])
return default if raw_value.nil?
diff --git a/lib/crawlscope/configuration.rb b/lib/crawlscope/configuration.rb
index dd6af3e..76df814 100644
--- a/lib/crawlscope/configuration.rb
+++ b/lib/crawlscope/configuration.rb
@@ -7,10 +7,11 @@ class Configuration
DEFAULT_BROWSER_NETWORK_IDLE_TIMEOUT_SECONDS = 5
DEFAULT_BROWSER_SCROLL_PAGE = true
DEFAULT_CONCURRENCY = 10
+ DEFAULT_FETCH_EXECUTOR = :async
RENDERERS = %i[http browser].freeze
DEFAULT_TIMEOUT_SECONDS = 20
- attr_writer :allowed_statuses, :base_url, :browser_factory, :concurrency, :network_idle_timeout_seconds, :output, :renderer, :rule_registry, :schema_registry, :scroll_page, :site_name, :sitemap_path, :timeout_seconds
+ attr_writer :allowed_statuses, :base_url, :browser_factory, :concurrency, :fetch_executor, :network_idle_timeout_seconds, :output, :renderer, :rule_registry, :schema_registry, :scroll_page, :site_name, :sitemap_path, :timeout_seconds
def allowed_statuses
value = resolve(@allowed_statuses)
@@ -30,6 +31,13 @@ def concurrency
positive_integer(value, default: DEFAULT_CONCURRENCY, name: "concurrency")
end
+ def fetch_executor
+ value = resolve(@fetch_executor)
+ default = (renderer == :browser) ? :threaded : DEFAULT_FETCH_EXECUTOR
+
+ FetchExecutor.normalize(value.nil? ? default : value)
+ end
+
def browser_concurrency
value = concurrency
default_value = DEFAULT_BROWSER_CONCURRENCY
@@ -83,6 +91,7 @@ def audit(base_url: self.base_url, sitemap_path: self.sitemap_path, rule_names:
sitemap_path: sitemap_path,
browser_factory: browser_factory,
concurrency: concurrency,
+ fetch_executor: fetch_executor,
network_idle_timeout_seconds: network_idle_timeout_seconds,
renderer: renderer,
timeout_seconds: timeout_seconds,
diff --git a/lib/crawlscope/context.rb b/lib/crawlscope/context.rb
index 5335eca..1e9d2c4 100644
--- a/lib/crawlscope/context.rb
+++ b/lib/crawlscope/context.rb
@@ -1,7 +1,7 @@
# frozen_string_literal: true
module Crawlscope
- Context = Data.define(:allowed_statuses, :base_url, :resolve_target, :schema_registry) do
+ Context = Data.define(:allowed_statuses, :base_url, :concurrency, :fetch_executor, :resolve_target, :resolve_targets, :schema_registry) do
def fetch(name)
public_send(name)
end
diff --git a/lib/crawlscope/crawl.rb b/lib/crawlscope/crawl.rb
index ad3af48..4bb42dc 100644
--- a/lib/crawlscope/crawl.rb
+++ b/lib/crawlscope/crawl.rb
@@ -2,7 +2,7 @@
module Crawlscope
class Crawl
- def initialize(base_url:, sitemap_path:, rules:, schema_registry:, browser_factory: nil, concurrency: Configuration::DEFAULT_CONCURRENCY, network_idle_timeout_seconds: Configuration::DEFAULT_BROWSER_NETWORK_IDLE_TIMEOUT_SECONDS, renderer: :http, scroll_page: Configuration::DEFAULT_BROWSER_SCROLL_PAGE, timeout_seconds: Configuration::DEFAULT_TIMEOUT_SECONDS, allowed_statuses: Configuration::DEFAULT_ALLOWED_STATUSES)
+ def initialize(base_url:, sitemap_path:, rules:, schema_registry:, browser_factory: nil, concurrency: Configuration::DEFAULT_CONCURRENCY, fetch_executor: nil, network_idle_timeout_seconds: Configuration::DEFAULT_BROWSER_NETWORK_IDLE_TIMEOUT_SECONDS, renderer: :http, scroll_page: Configuration::DEFAULT_BROWSER_SCROLL_PAGE, timeout_seconds: Configuration::DEFAULT_TIMEOUT_SECONDS, allowed_statuses: Configuration::DEFAULT_ALLOWED_STATUSES)
@base_url = base_url
@sitemap_path = sitemap_path
@rules = Array(rules)
@@ -11,16 +11,19 @@ def initialize(base_url:, sitemap_path:, rules:, schema_registry:, browser_facto
@concurrency = concurrency
@network_idle_timeout_seconds = network_idle_timeout_seconds
@renderer = renderer.to_sym
+ @fetch_executor = fetch_executor || default_fetch_executor
@scroll_page = scroll_page
@timeout_seconds = timeout_seconds
@allowed_statuses = allowed_statuses
end
def call
+ validate_fetch_executor!
+
urls = sitemap_urls
@page_fetcher = page
- pages = Crawler.new(page_fetcher: @page_fetcher, concurrency: @concurrency).call(urls)
+ pages = Crawler.new(page_fetcher: @page_fetcher, concurrency: @concurrency, fetch_executor: @fetch_executor).call(urls)
issues = IssueCollection.new
collect(pages, issues)
@@ -41,7 +44,13 @@ def call
private
def sitemap_urls
- urls = Sitemap.new(path: @sitemap_path).urls(base_url: @base_url)
+ urls = Sitemap.new(
+ path: @sitemap_path,
+ adapter: http_adapter,
+ concurrency: @concurrency,
+ fetch_executor: @fetch_executor,
+ timeout_seconds: @timeout_seconds
+ ).urls(base_url: @base_url)
raise ValidationError, "No URLs found in sitemap: #{@sitemap_path}" if urls.empty?
urls
@@ -62,15 +71,35 @@ def page
if @renderer == :browser
(@browser_factory || method(:browser)).call
else
- Http.new(base_url: @base_url, timeout_seconds: @timeout_seconds)
+ Http.new(base_url: @base_url, timeout_seconds: @timeout_seconds, adapter: http_adapter)
end
end
+ def http_adapter
+ return unless FetchExecutor.normalize(@fetch_executor) == :async
+
+ require "async/http/faraday"
+ :async_http
+ end
+
+ def validate_fetch_executor!
+ return unless @renderer == :browser && FetchExecutor.normalize(@fetch_executor) == :async
+
+ raise ConfigurationError, "Async fetch execution is only supported with http rendering"
+ end
+
+ def default_fetch_executor
+ (@renderer == :browser) ? :threaded : Configuration::DEFAULT_FETCH_EXECUTOR
+ end
+
def context
Context.new(
allowed_statuses: @allowed_statuses,
base_url: @base_url,
+ concurrency: @concurrency,
+ fetch_executor: @fetch_executor,
resolve_target: method(:resolve),
+ resolve_targets: method(:resolve_all),
schema_registry: @schema_registry
)
end
@@ -93,11 +122,15 @@ def cache(pages)
@targets = {}
pages.each do |page|
- @pages[page.normalized_url] = page unless page.normalized_url.to_s.empty?
- @pages[page.normalized_final_url] = page unless page.normalized_final_url.to_s.empty?
+ cache_page(page)
end
end
+ def cache_page(page)
+ @pages[page.normalized_url] = page unless page.normalized_url.to_s.empty?
+ @pages[page.normalized_final_url] = page unless page.normalized_final_url.to_s.empty?
+ end
+
def scan(urls, pages, issues)
@rules.each do |rule|
rule.call(urls: urls, pages: pages, issues: issues, context: context)
@@ -105,17 +138,40 @@ def scan(urls, pages, issues)
end
def resolve(target_url)
- normalized_url = Url.normalize(target_url, base_url: @base_url)
- return @targets[normalized_url] if @targets.key?(normalized_url)
+ resolve_all([target_url]).fetch(target_url)
+ end
+
+ def resolve_all(target_urls)
+ normalized_by_url = Array(target_urls).to_h do |target_url|
+ [target_url, Url.normalize(target_url, base_url: @base_url)]
+ end
+ normalized_urls = normalized_by_url.values.compact.uniq
+ missing_urls = []
+
+ normalized_urls.each do |normalized_url|
+ next if @targets.key?(normalized_url)
+
+ resolved = resolved_page(normalized_url)
+ if resolved
+ @targets[normalized_url] = resolved
+ else
+ missing_urls << normalized_url
+ end
+ end
+
+ fetched_pages(missing_urls).each do |page|
+ normalized_url = Url.normalize(page.url, base_url: @base_url)
+ cache_page(page)
+ @targets[normalized_url] = resolution(page, normalized_url, crawled: false)
+ end
- @targets[normalized_url] = resolved_page(normalized_url) || fetched_page(normalized_url)
+ normalized_by_url.to_h { |target_url, normalized_url| [target_url, @targets[normalized_url]] }
end
- def fetched_page(normalized_url)
- page = @page_fetcher.fetch(normalized_url)
- @pages[page.normalized_url] = page unless page.normalized_url.to_s.empty?
- @pages[page.normalized_final_url] = page unless page.normalized_final_url.to_s.empty?
- resolution(page, normalized_url, crawled: false)
+ def fetched_pages(normalized_urls)
+ return [] if normalized_urls.empty?
+
+ Crawler.new(page_fetcher: @page_fetcher, concurrency: @concurrency, fetch_executor: @fetch_executor).call(normalized_urls)
end
def resolved_page(normalized_url)
diff --git a/lib/crawlscope/crawler.rb b/lib/crawlscope/crawler.rb
index 5da2007..967c308 100644
--- a/lib/crawlscope/crawler.rb
+++ b/lib/crawlscope/crawler.rb
@@ -1,28 +1,14 @@
# frozen_string_literal: true
-require "concurrent"
-
module Crawlscope
class Crawler
- def initialize(page_fetcher:, concurrency:)
+ def initialize(page_fetcher:, concurrency:, fetch_executor: :threaded)
@page_fetcher = page_fetcher
- @concurrency = concurrency
+ @fetch_executor = FetchExecutor.build(name: fetch_executor, concurrency: concurrency)
end
def call(urls)
- pages = Concurrent::Array.new
- pool = Concurrent::FixedThreadPool.new(@concurrency)
-
- urls.each do |url|
- pool.post do
- pages << fetch(url)
- end
- end
-
- pool.shutdown
- pool.wait_for_termination
-
- pages.to_a
+ @fetch_executor.call(urls) { |url| fetch(url) }
end
private
diff --git a/lib/crawlscope/fetch_executor.rb b/lib/crawlscope/fetch_executor.rb
new file mode 100644
index 0000000..c462a71
--- /dev/null
+++ b/lib/crawlscope/fetch_executor.rb
@@ -0,0 +1,43 @@
+# frozen_string_literal: true
+
+module Crawlscope
+ module FetchExecutor
+ NAMES = %i[threaded async].freeze
+
+ module_function
+
+ def build(name:, concurrency:)
+ return name if name.respond_to?(:call)
+
+ case normalized_name(name)
+ when :threaded
+ Threaded.new(concurrency: concurrency)
+ when :async
+ Async.new(concurrency: concurrency)
+ end
+ end
+
+ def map(name:, concurrency:, items:, &block)
+ items = Array(items)
+ return items.map(&block) if items.size < 2 || concurrency.to_i <= 1
+
+ build(name: name, concurrency: concurrency).call(items, &block)
+ end
+
+ def normalize(name)
+ return name if name.respond_to?(:call)
+
+ normalized_name(name)
+ end
+
+ def normalized_name(name)
+ normalized = name.to_s.strip
+ normalized = "threaded" if normalized.empty?
+
+ value = normalized.to_sym
+ return value if NAMES.include?(value)
+
+ raise ConfigurationError, "Crawlscope fetch_executor must be threaded or async"
+ end
+ end
+end
diff --git a/lib/crawlscope/fetch_executor/async.rb b/lib/crawlscope/fetch_executor/async.rb
new file mode 100644
index 0000000..5bdf563
--- /dev/null
+++ b/lib/crawlscope/fetch_executor/async.rb
@@ -0,0 +1,32 @@
+# frozen_string_literal: true
+
+require "async"
+require "async/semaphore"
+
+module Crawlscope
+ module FetchExecutor
+ class Async
+ def initialize(concurrency:)
+ @concurrency = concurrency
+ end
+
+ def call(items)
+ indexed_items = Array(items).each_with_index.to_a
+ results = Array.new(indexed_items.size)
+
+ Sync do |parent|
+ semaphore = ::Async::Semaphore.new(@concurrency)
+ tasks = indexed_items.map do |item, index|
+ semaphore.async(parent: parent) do
+ results[index] = yield(item)
+ end
+ end
+
+ tasks.each(&:wait)
+ end
+
+ results
+ end
+ end
+ end
+end
diff --git a/lib/crawlscope/fetch_executor/threaded.rb b/lib/crawlscope/fetch_executor/threaded.rb
new file mode 100644
index 0000000..d819104
--- /dev/null
+++ b/lib/crawlscope/fetch_executor/threaded.rb
@@ -0,0 +1,32 @@
+# frozen_string_literal: true
+
+require "concurrent"
+
+module Crawlscope
+ module FetchExecutor
+ class Threaded
+ def initialize(concurrency:)
+ @concurrency = concurrency
+ end
+
+ def call(items)
+ indexed_items = Array(items).each_with_index.to_a
+ results = Array.new(indexed_items.size)
+ mutex = Mutex.new
+ pool = Concurrent::FixedThreadPool.new(@concurrency)
+
+ indexed_items.each do |item, index|
+ pool.post do
+ result = yield(item)
+ mutex.synchronize { results[index] = result }
+ end
+ end
+
+ pool.shutdown
+ pool.wait_for_termination
+
+ results
+ end
+ end
+ end
+end
diff --git a/lib/crawlscope/http.rb b/lib/crawlscope/http.rb
index c170770..d9eaed7 100644
--- a/lib/crawlscope/http.rb
+++ b/lib/crawlscope/http.rb
@@ -10,13 +10,18 @@ class Http
MAX_REDIRECTS = 5
USER_AGENT = "Mozilla/5.0 (compatible; Crawlscope/1.0)"
- def initialize(base_url:, timeout_seconds:)
+ def initialize(base_url:, timeout_seconds:, adapter: nil)
@base_url = base_url
@timeout_seconds = timeout_seconds
+ @adapter = adapter
@connections_by_thread = Concurrent::Map.new
end
def close
+ @connections_by_thread.each_value do |connection|
+ connection.close if connection.respond_to?(:close)
+ end
+
@connections_by_thread.clear
end
@@ -65,6 +70,7 @@ def connection
faraday.response :follow_redirects, limit: MAX_REDIRECTS
faraday.options.timeout = @timeout_seconds
faraday.options.open_timeout = @timeout_seconds
+ faraday.adapter @adapter if @adapter
end
end
end
diff --git a/lib/crawlscope/reporter.rb b/lib/crawlscope/reporter.rb
index 9425c6f..4c8e1c0 100644
--- a/lib/crawlscope/reporter.rb
+++ b/lib/crawlscope/reporter.rb
@@ -4,6 +4,8 @@
module Crawlscope
class Reporter
+ MAX_ISSUES_PER_GROUP = 20
+
def initialize(io:)
@io = io
end
@@ -15,13 +17,13 @@ def report(result)
@io.puts("URLs: #{result.urls.size}")
@io.puts("Pages: #{result.pages.size}")
- if result.ok?
+ if result.issues.size.zero?
@io.puts("Status: OK")
return
end
- @io.puts("Status: FAILED")
- @io.puts("Issues: #{result.issues.size} #{severity_summary(result.issues)}")
+ @io.puts("Status: #{status_for(result.issues)}")
+ @io.puts("Issues: #{result.issues.size} total (#{severity_summary(result.issues)})")
@io.puts("")
report_summary(result.issues)
@@ -31,6 +33,18 @@ def report(result)
private
+ def status_for(issues)
+ grouped = issues.by_severity
+
+ if grouped.key?(:error)
+ "FAILED"
+ elsif grouped.key?(:warning)
+ "WARNINGS"
+ else
+ "NOTICES"
+ end
+ end
+
def severity_summary(issues)
grouped = issues.by_severity
return "" if grouped.empty?
@@ -59,10 +73,12 @@ def report_issue_groups(issues, base_url:)
.each do |(category, code), grouped_issues|
@io.puts("#{category} / #{code}: #{grouped_issues.size}")
- grouped_issues.each do |issue|
+ grouped_issues.first(MAX_ISSUES_PER_GROUP).each do |issue|
@io.puts(" - #{compact_issue(issue, base_url: base_url)}")
end
+ remaining_count = grouped_issues.size - MAX_ISSUES_PER_GROUP
+ @io.puts(" ... #{remaining_count} more") if remaining_count.positive?
@io.puts("")
end
end
@@ -127,7 +143,7 @@ def relative_url(url, base_url:)
end
def format_number(value)
- return format("%.2f", value) if value.is_a?(Float)
+ return format("%.3f", value) if value.is_a?(Float)
value.to_s
end
diff --git a/lib/crawlscope/result.rb b/lib/crawlscope/result.rb
index eef888e..bf3f522 100644
--- a/lib/crawlscope/result.rb
+++ b/lib/crawlscope/result.rb
@@ -3,7 +3,7 @@
module Crawlscope
Result = Data.define(:base_url, :sitemap_path, :urls, :pages, :issues) do
def ok?
- issues.none?(&:error?) && issues.none?(&:warning?) && issues.none?(&:notice?)
+ issues.none?(&:error?)
end
end
end
diff --git a/lib/crawlscope/rules/links.rb b/lib/crawlscope/rules/links.rb
index 86cbe75..bc14b78 100644
--- a/lib/crawlscope/rules/links.rb
+++ b/lib/crawlscope/rules/links.rb
@@ -21,8 +21,13 @@ def initialize
def call(urls:, pages:, issues:, context:)
@allowed_statuses = context.fetch(:allowed_statuses)
@base_url = context.fetch(:base_url)
+ @concurrency = context_value(context, :concurrency, default: 1)
+ @fetch_executor = context_value(context, :fetch_executor)
@resolve_target = context.fetch(:resolve_target)
+ @resolve_targets = context.resolve_targets if context.respond_to?(:resolve_targets)
+ @resolve_targets ||= context[:resolve_targets] if context.respond_to?(:[])
@base_host = URI.parse(@base_url).host
+ @resolved_targets_by_url = {}
links = extract_links(pages)
validate_url_hygiene(urls, links, issues)
@@ -42,7 +47,12 @@ def contextual_links(doc)
end
def extract_links(pages)
- pages.select(&:html?).flat_map { |page| page_links(page) }
+ html_pages = pages.select(&:html?)
+ FetchExecutor.map(
+ name: @fetch_executor,
+ concurrency: @concurrency,
+ items: html_pages
+ ) { |page| page_links(page) }.flatten
end
def page_links(page)
@@ -175,9 +185,12 @@ def report_unresolved_target(target_url, grouped_links, issues, resolution)
def resolve_links(links, issues)
resolved_links = []
+ grouped_links_by_target = links.group_by { |link| link[:target_url] }
+ targets_by_url = resolve_targets(grouped_links_by_target.keys)
+ @resolved_targets_by_url.merge!(targets_by_url)
- links.group_by { |link| link[:target_url] }.each do |target_url, grouped_links|
- target = resolve_target(target_url)
+ grouped_links_by_target.each do |target_url, grouped_links|
+ target = target_for(target_url)
if target.unresolved?
report_unresolved_target(target_url, grouped_links, issues, target.resolution)
@@ -221,6 +234,23 @@ def resolve_target(target_url)
LinkTarget.new(target_url: target_url, resolution: resolution)
end
+ def resolve_targets(target_urls)
+ if @resolve_targets
+ @resolve_targets.call(target_urls).to_h do |target_url, resolution|
+ [target_url, LinkTarget.new(target_url: target_url, resolution: resolution)]
+ end
+ else
+ target_urls.to_h { |target_url| [target_url, resolve_target(target_url)] }
+ end
+ end
+
+ def target_for(target_url)
+ @resolved_targets_by_url.fetch(target_url) do
+ target = resolve_target(target_url)
+ @resolved_targets_by_url[target_url] = target
+ end
+ end
+
LinkTarget = Data.define(:target_url, :resolution) do
def allowed?(statuses)
statuses.include?(status)
@@ -424,7 +454,7 @@ def validate_indexable_pages_missing_from_sitemap(urls, resolved_links, issues)
next if reported_urls.include?(final_url)
next unless crawlable_path?(link[:final_path])
- target = resolve_target(final_url)
+ target = target_for(final_url)
next unless target.allowed?(@allowed_statuses) && target.html?
next if target.noindex?
@@ -506,16 +536,14 @@ def validate_canonical_targets(urls, pages, resolved_links, issues)
return if sitemap_pages.size < 2
dofollow_counts_by_path = dofollow_counts_by_final_path(resolved_links)
-
- sitemap_pages.each do |page|
- canonical_url = canonical_url_for(page)
- next if canonical_url.nil?
-
- target_uri = URI.parse(canonical_url)
- next if target_uri.host != @base_host
-
- canonical_path = Url.path(canonical_url)
- if canonical_path && dofollow_counts_by_path[canonical_path].zero?
+ canonical_entries = canonical_entries_for(sitemap_pages)
+ @resolved_targets_by_url.merge!(resolve_targets(canonical_entries.map { |entry| entry.fetch(:canonical_url) }))
+
+ canonical_entries.each do |entry|
+ page = entry.fetch(:page)
+ canonical_url = entry.fetch(:canonical_url)
+ canonical_path = entry.fetch(:canonical_path)
+ if canonical_path && !root_path?(canonical_path) && dofollow_counts_by_path[canonical_path].zero?
issues.add(
code: :canonical_no_internal_inlinks,
severity: :warning,
@@ -527,8 +555,24 @@ def validate_canonical_targets(urls, pages, resolved_links, issues)
end
validate_canonical_target_status(page, canonical_url, issues)
+ end
+ end
+
+ def canonical_entries_for(pages)
+ pages.filter_map do |page|
+ canonical_url = canonical_url_for(page)
+ next if canonical_url.nil?
+
+ target_uri = URI.parse(canonical_url)
+ next if target_uri.host != @base_host
+
+ {
+ canonical_path: Url.path(canonical_url),
+ canonical_url: canonical_url,
+ page: page
+ }
rescue URI::InvalidURIError
- next
+ nil
end
end
@@ -548,8 +592,12 @@ def canonical_url_for(page)
Url.normalize(canonical, base_url: page.url)
end
+ def root_path?(path)
+ path == "/"
+ end
+
def validate_canonical_target_status(page, canonical_url, issues)
- target = resolve_target(canonical_url)
+ target = target_for(canonical_url)
if target.unresolved? || target.ignored_error?
return
@@ -575,6 +623,16 @@ def validate_canonical_target_status(page, canonical_url, issues)
)
end
end
+
+ def context_value(context, name, default: nil)
+ if context.respond_to?(name)
+ context.public_send(name)
+ elsif context.respond_to?(:key?) && context.key?(name)
+ context[name]
+ else
+ default
+ end
+ end
end
end
end
diff --git a/lib/crawlscope/rules/uniqueness.rb b/lib/crawlscope/rules/uniqueness.rb
index 95519f9..3a3dc07 100644
--- a/lib/crawlscope/rules/uniqueness.rb
+++ b/lib/crawlscope/rules/uniqueness.rb
@@ -26,11 +26,10 @@ def initialize(
end
def call(urls:, pages:, issues:, context:)
- page_summaries = pages.filter_map do |page|
- next unless page.html?
+ @concurrency = context_value(context, :concurrency, default: 1)
+ @fetch_executor = context_value(context, :fetch_executor)
- summary_for(page)
- end
+ page_summaries = summarize_pages(pages)
validate_duplicates(page_summaries, issues)
validate_near_duplicates(page_summaries, issues)
@@ -66,6 +65,16 @@ def summary_for(page)
}
end
+ def summarize_pages(pages)
+ html_pages = pages.select(&:html?)
+
+ FetchExecutor.map(
+ name: @fetch_executor,
+ concurrency: @concurrency,
+ items: html_pages
+ ) { |page| summary_for(page) }
+ end
+
def validate_duplicates(page_summaries, issues)
duplicates_for(page_summaries, :title).each do |value, urls|
issues.add(
@@ -177,6 +186,16 @@ def shingle_similarity(left, right)
intersection_size.to_f / smaller_set_size
end
+
+ def context_value(context, name, default: nil)
+ if context.respond_to?(name)
+ context.public_send(name)
+ elsif context.respond_to?(:key?) && context.key?(name)
+ context[name]
+ else
+ default
+ end
+ end
end
end
end
diff --git a/lib/crawlscope/sitemap.rb b/lib/crawlscope/sitemap.rb
index b2b1eef..8ebfb33 100644
--- a/lib/crawlscope/sitemap.rb
+++ b/lib/crawlscope/sitemap.rb
@@ -9,20 +9,31 @@ module Crawlscope
class Sitemap
SITEMAP_NAMESPACE = {"xmlns" => "http://www.sitemaps.org/schemas/sitemap/0.9"}.freeze
- def initialize(path:)
+ def initialize(path:, adapter: nil, concurrency: Configuration::DEFAULT_CONCURRENCY, fetch_executor: Configuration::DEFAULT_FETCH_EXECUTOR, timeout_seconds: Configuration::DEFAULT_TIMEOUT_SECONDS)
@path = path
+ @adapter = adapter
+ @concurrency = concurrency
+ @fetch_executor = fetch_executor
+ @timeout_seconds = timeout_seconds
end
def urls(base_url:)
- collect_urls(@path, base_url: base_url, visited: Set.new).uniq
+ collect_urls(@path, base_url: base_url, visited: Set.new, visited_mutex: Mutex.new).uniq
end
private
- def collect_urls(source, base_url:, visited:)
- return [] if visited.include?(source)
+ def collect_urls(source, base_url:, visited:, visited_mutex:)
+ already_visited = visited_mutex.synchronize do
+ if visited.include?(source)
+ true
+ else
+ visited.add(source)
+ false
+ end
+ end
+ return [] if already_visited
- visited.add(source)
document = Nokogiri::XML(read(source))
root_name = document.root&.name
unless %w[sitemapindex urlset].include?(root_name)
@@ -30,10 +41,13 @@ def collect_urls(source, base_url:, visited:)
end
if root_name == "sitemapindex"
- document.xpath("//xmlns:sitemap/xmlns:loc", SITEMAP_NAMESPACE).flat_map do |node|
- child_source = resolve_child_source(source, node.text.to_s.strip, base_url: base_url)
- collect_urls(child_source, base_url: base_url, visited: visited)
+ child_sources = document.xpath("//xmlns:sitemap/xmlns:loc", SITEMAP_NAMESPACE).map do |node|
+ resolve_child_source(source, node.text.to_s.strip, base_url: base_url)
end
+
+ fetch_executor.call(child_sources) do |child_source|
+ collect_urls(child_source, base_url: base_url, visited: visited, visited_mutex: visited_mutex)
+ end.flatten
else
document.xpath("//xmlns:url/xmlns:loc", SITEMAP_NAMESPACE).map do |node|
Url.normalize_for_base(node.text.to_s.strip, base_url: base_url)
@@ -77,11 +91,16 @@ def local_child_path(parent_source, child_loc)
end
def connection
- @connection ||= Faraday.new do |faraday|
+ Faraday.new do |faraday|
faraday.response :follow_redirects, limit: Http::MAX_REDIRECTS
- faraday.options.timeout = 20
- faraday.options.open_timeout = 20
+ faraday.options.timeout = @timeout_seconds
+ faraday.options.open_timeout = @timeout_seconds
+ faraday.adapter @adapter if @adapter
end
end
+
+ def fetch_executor
+ @fetch_executor_instance ||= FetchExecutor.build(name: @fetch_executor, concurrency: @concurrency)
+ end
end
end
diff --git a/lib/tasks/crawlscope_tasks.rake b/lib/tasks/crawlscope_tasks.rake
index 80ad80c..ae12dd1 100644
--- a/lib/tasks/crawlscope_tasks.rake
+++ b/lib/tasks/crawlscope_tasks.rake
@@ -1,5 +1,5 @@
namespace :crawlscope do
- desc "Validate URLs with all default Crawlscope rules. Args: [url,sitemap,rules]. ENV: URL, SITEMAP, RULES, JS=1, TIMEOUT, NETWORK_IDLE_TIMEOUT, CONCURRENCY"
+ desc "Validate URLs with all default Crawlscope rules. Args: [url,sitemap,rules]. ENV: URL, SITEMAP, RULES, JS=1, TIMEOUT, NETWORK_IDLE_TIMEOUT, CONCURRENCY, FETCH_EXECUTOR"
task :validate, [:url, :sitemap, :rules] => :environment do |_task, args|
Crawlscope::RakeTasks.validate(url: args[:url], sitemap_path: args[:sitemap], rule_names: args[:rules])
end
diff --git a/perf-improvement.md b/perf-improvement.md
new file mode 100644
index 0000000..1b5482f
--- /dev/null
+++ b/perf-improvement.md
@@ -0,0 +1,366 @@
+# Crawlscope Performance Improvement Plans
+
+Date: 2026-06-01
+
+## Executive Summary
+
+Crawlscope is already partially parallel: `Crawlscope::Crawler` uses a
+`Concurrent::FixedThreadPool` to fetch sitemap URLs, and HTTP connections are
+cached per worker thread. The remaining time growth is likely coming from three
+places:
+
+1. Sitemap expansion is recursive and sequential.
+2. Rule execution is sequential after the crawl completes.
+3. The links rule can trigger additional target resolution fetches after the
+ main crawl, also through a sequential rule path.
+
+The best next step is not "parallelize everything." The best next step is a
+bounded async fetch layer that unifies page fetches, child sitemap fetches, and
+link target resolution behind one concurrency budget, while keeping analysis
+mostly deterministic and sequential until rules are explicitly classified as
+parallel-safe.
+
+Recommendation: implement Plan 1 first, then adopt the safe parts of Plan 2.
+Defer Plan 3 until benchmarks prove the graph-level complexity is needed.
+
+## Implementation Update
+
+Plan 1 has been implemented with the deliberate Ruby `>= 3.3` contract change.
+The default executor is now async for HTTP crawls through
+`async-http-faraday`, with `fetch_executor: :threaded` retained as an explicit
+fallback.
+
+The safe follow-on parallelization work is also implemented:
+
+- child sitemap indexes expand through the bounded fetch executor,
+- extra link target fetches are resolved in bounded batches,
+- canonical target checks reuse the link target cache,
+- link extraction and uniqueness summaries use bounded parallel maps.
+
+Measured outcome: async is production-ready, but async alone is not 2x faster
+than the existing threaded executor. The 2x+ improvement comes from applying
+bounded parallelism to previously serial phases, especially child sitemap
+expansion. See `async-performance-assessment.md` for current benchmark numbers.
+
+## Current Implementation Evidence
+
+- `lib/crawlscope/crawler.rb` delegates URL scheduling to
+ `Crawlscope::FetchExecutor`.
+- `lib/crawlscope/http.rb` uses Faraday and can run through the async HTTP
+ adapter for `fetch_executor: :async`.
+- `lib/crawlscope/crawl.rb` currently runs in this order: read sitemap URLs,
+ crawl all pages, collect fetch errors, cache pages, then scan rules.
+- `lib/crawlscope/crawl.rb` calls each rule sequentially in `scan`.
+- `lib/crawlscope/sitemap.rb` recursively expands child sitemaps in sequence.
+- `lib/crawlscope/rules/links.rb` groups links and resolves unique internal
+ targets through the crawl context, which can batch-fetch uncrawled target
+ pages.
+- `lib/crawlscope/issue_collection.rb` is a simple collection around an Array,
+ suitable for current sequential rule execution but not safe as a shared sink
+ for parallel rule writers without synchronization or per-task collections.
+- The gemspec currently requires Ruby `>= 3.3.0`.
+
+## Async Ecosystem Notes
+
+Async is now a mature fiber-scheduler stack, not a speculative experiment. Its
+current docs describe it as a composable asynchronous I/O framework based on
+`io-event`, with lightweight fiber concurrency and multi-thread/process
+containers available when actual parallelism is needed.
+
+Relevant current constraints:
+
+- `async` 2.39.0 was released on April 5, 2026 and requires Ruby `>= 3.3`.
+- `async-http` 0.95.1 was released on May 6, 2026 and requires Ruby `>= 3.3`.
+- `async-http-faraday` 0.22.2 was released on April 10, 2026 and requires Ruby
+ `>= 3.3`.
+- Current Crawlscope now supports Ruby `>= 3.3.0`, so adding the latest Async
+ gems as runtime dependencies is an explicit public compatibility change.
+
+Implication: async work should either:
+
+- bump Crawlscope's Ruby requirement to `>= 3.3` in a planned contract change,
+ with `UPGRADE.md` guidance; or
+- keep the current threaded Faraday path as the default while exposing async as
+ an opt-in executor.
+
+## Plan 1: Bounded Async Fetch Layer
+
+Replace the current fetch executor with a small fetch orchestration abstraction:
+
+- Keep `Crawlscope::Crawler.call(urls)` as the public boundary.
+- Add a `FetchExecutor` interface with threaded and async implementations.
+- Use the threaded implementation as the default.
+- Add an async implementation using `async` and either `async-http` directly or
+ `async-http-faraday`.
+- Apply one shared concurrency budget to:
+ - sitemap URL page fetches,
+ - child sitemap fetches,
+ - link target resolution fetches.
+- Deduplicate normalized URLs before scheduling fetches.
+- Preserve deterministic output by sorting returned pages back to input URL
+ order, or by sorting by normalized URL before rule analysis.
+- Keep rule execution sequential for now.
+
+### Why This Fits The Current Code
+
+The existing code already has a natural page-fetch boundary:
+`Crawler.new(page_fetcher:, concurrency:).call(urls)`. The HTTP fetcher already
+encapsulates transport details. The crawl context already resolves extra link
+targets through one `resolve_target` callable. That means the highest-leverage
+change is to make all URL I/O share a smarter scheduler, not to rewrite every
+rule.
+
+### Expected Measurable Effect
+
+- Wall-clock time should improve most on I/O-heavy crawls with many URLs,
+ redirects, missing pages, or internal links not already in the sitemap.
+- CPU-heavy rule time will not improve much.
+- Memory usage should stay bounded because one concurrency budget controls all
+ network fan-out.
+- Determinism can remain high if page ordering and issue ordering are
+ normalized before reporting.
+
+### Reductio Ad Absurdum
+
+Assume this plan is pushed to the extreme: every discovered URL and link target
+is scheduled immediately with unbounded async tasks. The result is faster only
+on a toy site. On a real site it creates too many open sockets, hits rate
+limits, destabilizes browser mode, and makes failures look nondeterministic.
+
+Therefore the core of the plan cannot be "async means unlimited." It must be
+"async with explicit global and per-host bounds." If that constraint is kept,
+the contradiction disappears and the plan remains valid.
+
+### Implementation Steps
+
+1. Add a benchmark fixture with delayed sitemap pages and delayed link targets.
+2. Extract current thread-pool logic into `Crawlscope::FetchExecutor::Threaded`.
+3. Add a shared URL cache keyed by normalized requested URL and normalized final
+ URL.
+4. Route `Crawler` and `Crawl#resolve` through the shared executor/cache.
+5. Add optional `Crawlscope::FetchExecutor::Async` behind a feature flag or
+ renderer/transport setting.
+6. Keep issue collection and rule execution unchanged.
+7. Document Ruby 3.3 requirements for the async transport and release
+ contract.
+
+## Plan 2: Split Crawl And Analyze Pipeline
+
+Turn the current "crawl everything, then analyze everything" flow into a
+two-phase pipeline:
+
+- Fetch pages concurrently.
+- As each page finishes, run per-page rules immediately:
+ - indexability,
+ - metadata,
+ - structured data,
+ - content quality checks that inspect only one page.
+- Store per-page issues in page-local collections.
+- After all pages finish, run aggregate rules:
+ - uniqueness,
+ - link graph checks,
+ - orphan/internal inlink checks.
+- Merge issue collections in deterministic order.
+
+### Why This Fits The Current Code
+
+Several rules already accept the same broad signature:
+
+```ruby
+rule.call(urls: urls, pages: pages, issues: issues, context: context)
+```
+
+That interface is convenient, but it hides whether a rule is per-page or
+aggregate. Splitting rules by capability would let Crawlscope overlap useful
+analysis with network wait time while keeping global checks accurate.
+
+### Expected Measurable Effect
+
+- Users get earlier partial results in future streaming/reporting modes.
+- Wall-clock time improves when per-page parsing/checking is non-trivial and
+ network latency leaves idle CPU time.
+- Total runtime may improve less than Plan 1 if most time is still network.
+- Implementation risk is moderate because rule contracts need classification.
+
+### Reductio Ad Absurdum
+
+Assume every rule is treated as page-local and run as soon as each page arrives.
+Uniqueness becomes wrong because it needs all pages. Link graph checks become
+wrong because inbound counts need all resolved links. Orphan-page detection can
+report false positives before all links have been seen.
+
+Therefore the plan is only valid if Crawlscope makes rule phase explicit:
+`per_page`, `post_crawl`, or `graph`. Without that classification, faster
+analysis produces faster wrong answers.
+
+### Implementation Steps
+
+1. Add rule metadata, for example `phase: :per_page` or `phase: :aggregate`.
+2. Keep existing rules aggregate by default.
+3. Move only obviously local rules into `:per_page` after focused tests.
+4. Change `IssueCollection` usage to per-task collections, then merge in stable
+ order.
+5. Add tests proving aggregate rules still see the complete page set.
+6. Add timing instrumentation for crawl, per-page analysis, and aggregate
+ analysis.
+
+## Plan 3: Unified Async Crawl Graph
+
+Model the crawl as a bounded async graph rather than a linear crawl:
+
+- Fetch the root sitemap.
+- Fetch child sitemaps concurrently.
+- Schedule page fetches as URLs are discovered.
+- Extract links as pages arrive.
+- Schedule uncrawled internal link targets through the same bounded queue.
+- Run page-local checks as soon as a page arrives.
+- Run graph/global checks after the queue drains.
+- Maintain one canonical `PageStore` and `TargetResolutionStore`.
+
+### Why This Fits The Problem
+
+The slowest large crawls are not just "N sitemap URLs." They are closer to a
+graph:
+
+- a root sitemap points to child sitemaps,
+- pages point to internal targets,
+- redirects create final URLs,
+- rules need both requested and final URL identity.
+
+A graph executor can remove duplicate work and prevent the current shape where
+extra link-target fetches happen later as a second network wave.
+
+### Expected Measurable Effect
+
+- Best theoretical wall-clock reduction for large sites with many child
+ sitemaps and link targets.
+- Better dedupe because sitemap URLs, final URLs, and link targets share one
+ store.
+- More complex failure semantics: cancellation, retries, partial graph state,
+ and stable issue ordering must be designed up front.
+
+### Reductio Ad Absurdum
+
+Assume the crawl graph eagerly follows every internal link, every redirect, and
+every discovered target until no new URL remains. Crawlscope stops being a
+sitemap validation tool and becomes a general web crawler. Runtime becomes
+unbounded, robots/rate-limit behavior becomes harder to reason about, and the
+reported URL set drifts away from the sitemap contract.
+
+Therefore the graph must be constrained by Crawlscope's product boundary:
+sitemap URLs are the primary audit set; extra target resolution exists only to
+validate links from those pages. The graph executor is valid only if it refuses
+to become an unlimited site crawler.
+
+### Implementation Steps
+
+1. Define graph scope formally:
+ - audit URLs from sitemap,
+ - child sitemap URLs,
+ - internal link targets from audited pages,
+ - no recursive full-site expansion unless explicitly configured.
+2. Add `PageStore` and `ResolutionStore` with stable URL normalization.
+3. Add a bounded async work queue with global and per-host limits.
+4. Add cancellation and timeout policy at the graph level.
+5. Add stable result materialization before calling aggregate rules.
+6. Add benchmark coverage for:
+ - many child sitemaps,
+ - many repeated internal links,
+ - redirects,
+ - broken targets,
+ - slow responses.
+
+## Ratings
+
+Scores are 1-5, where 5 is best.
+
+| Aspect | Plan 1: Bounded Async Fetch Layer | Plan 2: Split Crawl/Analyze | Plan 3: Unified Async Crawl Graph |
+| --- | ---: | ---: | ---: |
+| Expected wall-clock improvement | 4 | 3 | 5 |
+| Time to implement | 4 | 3 | 1 |
+| Correctness risk | 4 | 3 | 2 |
+| Deterministic reporting | 4 | 3 | 2 |
+| Public API stability | 4 | 3 | 2 |
+| Ruby 3.3 compatibility path | 4 | 5 | 3 |
+| Async ecosystem leverage | 4 | 2 | 5 |
+| Memory/socket boundability | 4 | 4 | 3 |
+| Testability | 4 | 3 | 2 |
+| Observability potential | 4 | 5 | 5 |
+| Overall score | 40 | 34 | 30 |
+
+## Recommendation
+
+Implement Plan 1 first.
+
+It attacks the most likely bottleneck, keeps the current public shape intact,
+and can be shipped in small steps. It also creates the infrastructure needed by
+the other two plans: a shared fetch executor, URL dedupe, timing metrics, and
+transport abstraction.
+
+Use this sequence:
+
+1. Add timing instrumentation and benchmark fixtures before changing behavior.
+2. Refactor the existing threaded fetcher into an explicit executor.
+3. Share the executor/cache between initial sitemap page fetches and link
+ target resolution.
+4. Add async transport with the deliberate Ruby `>= 3.3` contract change
+ documented in `UPGRADE.md`.
+5. After the fetch layer is measurable, implement the safe subset of Plan 2:
+ classify rules by phase and move only clearly per-page rules into overlapped
+ analysis.
+
+Do not start with Plan 3. It is attractive architecturally, but it creates too
+many new semantics before we have measurements. The reductio failure mode is
+also severe: Crawlscope could quietly become an unbounded crawler instead of a
+sitemap-driven validator.
+
+## Metrics To Track
+
+Add these measurements before and after each step:
+
+- total runtime,
+- sitemap expansion time,
+- audited page fetch time,
+- link target resolution time,
+- per-rule analysis time,
+- total requested URLs,
+- unique fetched URLs,
+- duplicate suppressed URLs,
+- max active fetches,
+- max active fetches per host,
+- socket/open connection count if available,
+- issue count by code,
+- output ordering stability across repeated runs.
+
+## Acceptance Criteria
+
+Plan 1 is successful when:
+
+- a delayed-response benchmark shows reduced wall-clock time at the same issue
+ count,
+- output is stable across repeated runs,
+- `CONCURRENCY=1` preserves current sequential semantics,
+- `CONCURRENCY=N` never schedules more than N active network fetches globally,
+- link target resolution reuses already fetched pages,
+- Ruby 3.3+ users have a working threaded default and an opt-in async executor.
+
+## Sources
+
+- Local implementation: `lib/crawlscope/crawler.rb`,
+ `lib/crawlscope/crawl.rb`, `lib/crawlscope/http.rb`,
+ `lib/crawlscope/sitemap.rb`, `lib/crawlscope/rules/links.rb`,
+ `lib/crawlscope/issue_collection.rb`, `crawlscope.gemspec`.
+- Async docs: https://socketry.github.io/async/
+- Async getting started: https://socketry.github.io/async/guides/getting-started/
+- Async scheduler docs:
+ https://socketry.github.io/async/source/Async/Scheduler/index.html
+- Async::HTTP getting started:
+ https://socketry.github.io/async-http/guides/getting-started/index
+- Async::HTTP::Faraday getting started:
+ https://socketry.github.io/async-http-faraday/guides/getting-started/index.html
+- RubyGems async 2.39.0:
+ https://rubygems.org/gems/async/versions/2.39.0
+- RubyGems async-http 0.95.1:
+ https://rubygems.org/gems/async-http
+- RubyGems async-http-faraday 0.22.2:
+ https://rubygems.org/gems/async-http-faraday/versions/0.22.2
diff --git a/test/crawlscope/cli_test.rb b/test/crawlscope/cli_test.rb
index b520755..1a33968 100644
--- a/test/crawlscope/cli_test.rb
+++ b/test/crawlscope/cli_test.rb
@@ -4,11 +4,12 @@
class CrawlscopeCliTest < Minitest::Test
class FakeConfiguration
- attr_accessor :base_url, :concurrency, :network_idle_timeout_seconds, :output, :renderer, :timeout_seconds
+ attr_accessor :base_url, :concurrency, :fetch_executor, :network_idle_timeout_seconds, :output, :renderer, :timeout_seconds
def initialize
@base_url = nil
@concurrency = 10
+ @fetch_executor = :async
@network_idle_timeout_seconds = 5
@renderer = :http
@timeout_seconds = 20
@@ -95,7 +96,7 @@ def test_validate_passes_arguments_to_task
err = StringIO.new
status = Crawlscope::Cli.start(
- ["validate", "--url", "https://example.com", "--sitemap", "https://example.com/sitemap-pages.xml", "--rules", "metadata,links", "--renderer", "browser", "--timeout", "30", "--network-idle-timeout", "9", "--concurrency", "3"],
+ ["validate", "--url", "https://example.com", "--sitemap", "https://example.com/sitemap-pages.xml", "--rules", "metadata,links", "--renderer", "browser", "--timeout", "30", "--network-idle-timeout", "9", "--concurrency", "3", "--fetch-executor", "async"],
out: out,
err: err,
configuration: configuration,
@@ -115,10 +116,35 @@ def test_validate_passes_arguments_to_task
assert_equal 30, configuration.timeout_seconds
assert_equal 9, configuration.network_idle_timeout_seconds
assert_equal 3, configuration.concurrency
+ assert_equal "async", configuration.fetch_executor
assert_same out, configuration.output
assert_empty err.string
end
+ def test_validate_reads_fetch_executor_from_environment
+ configuration = FakeConfiguration.new
+ task = FakeTask.new
+
+ with_env("FETCH_EXECUTOR" => "async") do
+ status = Crawlscope::Cli.start(["validate", "--url", "https://example.com"], out: StringIO.new, err: StringIO.new, configuration: configuration, task: task)
+
+ assert_equal 0, status
+ end
+
+ assert_equal "async", configuration.fetch_executor
+ end
+
+ def test_validate_uses_threaded_executor_for_browser_rendering_by_default
+ configuration = FakeConfiguration.new
+ task = FakeTask.new
+
+ status = Crawlscope::Cli.start(["validate", "--url", "https://example.com", "--renderer", "browser"], out: StringIO.new, err: StringIO.new, configuration: configuration, task: task)
+
+ assert_equal 0, status
+ assert_equal :browser, configuration.renderer
+ assert_equal :threaded, configuration.fetch_executor
+ end
+
def test_ldjson_reads_urls_from_environment
configuration = FakeConfiguration.new
task = FakeTask.new
diff --git a/test/crawlscope/configuration_test.rb b/test/crawlscope/configuration_test.rb
index 31ac7ed..48d490c 100644
--- a/test/crawlscope/configuration_test.rb
+++ b/test/crawlscope/configuration_test.rb
@@ -13,6 +13,7 @@ def test_audit_builds_from_configured_callables
config.sitemap_path = -> { "/tmp/sitemap.xml" }
config.site_name = -> { "Example" }
config.concurrency = -> { 4 }
+ config.fetch_executor = -> { :threaded }
end
audit = Crawlscope.configuration.audit
@@ -20,6 +21,7 @@ def test_audit_builds_from_configured_callables
assert_equal "https://example.com", audit.instance_variable_get(:@base_url)
assert_equal "/tmp/sitemap.xml", audit.instance_variable_get(:@sitemap_path)
assert_equal 4, audit.instance_variable_get(:@concurrency)
+ assert_equal :threaded, audit.instance_variable_get(:@fetch_executor)
assert_equal %i[
indexability
metadata
@@ -55,6 +57,7 @@ def test_defaults_are_normalized
assert_equal [200, 301, 302], config.allowed_statuses
assert_equal 10, config.concurrency
+ assert_equal :async, config.fetch_executor
assert_equal 4, config.browser_concurrency
assert_equal 5, config.network_idle_timeout_seconds
assert_equal :http, config.renderer
@@ -63,10 +66,18 @@ def test_defaults_are_normalized
assert config.scroll_page?
end
+ def test_browser_renderer_defaults_to_threaded_fetch_executor
+ config = Crawlscope::Configuration.new
+ config.renderer = :browser
+
+ assert_equal :threaded, config.fetch_executor
+ end
+
def test_configured_values_are_normalized
config = Crawlscope::Configuration.new
config.allowed_statuses = ["200", "404"]
config.concurrency = "2"
+ config.fetch_executor = "async"
config.network_idle_timeout_seconds = "7"
config.renderer = "browser"
config.timeout_seconds = "9"
@@ -74,6 +85,7 @@ def test_configured_values_are_normalized
assert_equal [200, 404], config.allowed_statuses
assert_equal 2, config.concurrency
+ assert_equal :async, config.fetch_executor
assert_equal 2, config.browser_concurrency
assert_equal 7, config.network_idle_timeout_seconds
assert_equal :browser, config.renderer
@@ -90,6 +102,15 @@ def test_renderer_must_be_supported
assert_equal "Crawlscope renderer must be http or browser", error.message
end
+ def test_fetch_executor_must_be_supported
+ config = Crawlscope::Configuration.new
+ config.fetch_executor = "processes"
+
+ error = assert_raises(Crawlscope::ConfigurationError) { config.fetch_executor }
+
+ assert_equal "Crawlscope fetch_executor must be threaded or async", error.message
+ end
+
def test_numeric_values_must_be_positive_integers
config = Crawlscope::Configuration.new
config.concurrency = "0"
diff --git a/test/crawlscope/crawl_test.rb b/test/crawlscope/crawl_test.rb
index 6940107..353cebd 100644
--- a/test/crawlscope/crawl_test.rb
+++ b/test/crawlscope/crawl_test.rb
@@ -3,6 +3,36 @@
require "test_helper"
class CrawlscopeCrawlTest < Minitest::Test
+ class RecordingExecutor
+ attr_reader :batches
+
+ def initialize
+ @batches = []
+ end
+
+ def call(urls)
+ @batches << urls
+ urls.map { |url| yield(url) }
+ end
+ end
+
+ class PageMapFetcher
+ attr_reader :closed
+
+ def initialize(pages)
+ @pages = pages
+ @closed = false
+ end
+
+ def close
+ @closed = true
+ end
+
+ def fetch(url)
+ @pages.fetch(url)
+ end
+ end
+
def setup
@tmp_dir = Dir.mktmpdir
@sitemap_path = File.join(@tmp_dir, "sitemap.xml")
@@ -12,6 +42,29 @@ def teardown
FileUtils.rm_rf(@tmp_dir)
end
+ def test_http_renderer_defaults_to_async_executor
+ crawl = Crawlscope::Crawl.new(
+ base_url: "https://example.com",
+ sitemap_path: @sitemap_path,
+ rules: [],
+ schema_registry: Crawlscope::SchemaRegistry.default
+ )
+
+ assert_equal :async, crawl.instance_variable_get(:@fetch_executor)
+ end
+
+ def test_browser_renderer_defaults_to_threaded_executor
+ crawl = Crawlscope::Crawl.new(
+ base_url: "https://example.com",
+ sitemap_path: @sitemap_path,
+ rules: [],
+ schema_registry: Crawlscope::SchemaRegistry.default,
+ renderer: :browser
+ )
+
+ assert_equal :threaded, crawl.instance_variable_get(:@fetch_executor)
+ end
+
def test_returns_ok_when_metadata_is_valid
File.write(
@sitemap_path,
@@ -56,7 +109,8 @@ def test_returns_ok_when_metadata_is_valid
base_url: "https://example.com",
sitemap_path: @sitemap_path,
rules: Crawlscope::RuleRegistry.default(site_name: "Example").rules,
- schema_registry: Crawlscope::SchemaRegistry.default
+ schema_registry: Crawlscope::SchemaRegistry.default,
+ fetch_executor: :threaded
).call
assert result.ok?
@@ -97,10 +151,11 @@ def test_collects_metadata_issues_for_invalid_page
base_url: "https://example.com",
sitemap_path: @sitemap_path,
rules: Crawlscope::RuleRegistry.default(site_name: "Example").rules,
- schema_registry: Crawlscope::SchemaRegistry.default
+ schema_registry: Crawlscope::SchemaRegistry.default,
+ fetch_executor: :threaded
).call
- refute result.ok?
+ assert result.ok?
assert_equal %i[
incomplete_open_graph_tags
meta_description_too_long
@@ -189,6 +244,31 @@ def fetch(url)
assert fake_browser.closed
end
+ def test_async_executor_requires_http_renderer
+ File.write(
+ @sitemap_path,
+ <<~XML
+
+
#{path} benchmark page
#{path} benchmark page
#{links}