Skip to content

feat: use HTML parser (scraper crate) instead of regex for URL extraction#40

Merged
ChefControl merged 1 commit into
masterfrom
issue-26
Feb 22, 2026
Merged

feat: use HTML parser (scraper crate) instead of regex for URL extraction#40
ChefControl merged 1 commit into
masterfrom
issue-26

Conversation

@ChefControl
Copy link
Copy Markdown
Collaborator

Summary

  • Replace custom regex (r"https?://[\w\-.]+(?::\d+)?") with the scraper crate to parse HTML and extract URLs from <a href="..."> tags
  • Resolve relative URLs against the page's base URL using the url crate
  • Remove regex dependency from workspace and shared crate, add scraper and url
  • Update callers in feeder and manager to pass base URL to the new extract_urls(html, base_url) signature
  • Expand test suite from 6 to 14 tests covering relative paths, query strings, fragments, protocol-relative URLs, non-http scheme filtering, and invalid base URL handling

Test plan

  • All 40 shared crate tests pass (cargo test -p shared)
  • Full project compiles (cargo check)
  • Manual integration test: crawl a site and verify URLs with paths/query strings are now discovered

Closes #26

🤖 Generated with Claude Code

…tion

Replace the custom regex URL extractor with the `scraper` crate to
properly parse HTML and extract URLs from `<a href="...">` tags.
Relative URLs are now resolved against the page's base URL using the
`url` crate. This discovers more URLs (paths, query strings, fragments)
and produces more accurate results.

Closes #26

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChefControl ChefControl merged commit 1b1bf65 into master Feb 22, 2026
1 check passed
@ChefControl ChefControl deleted the issue-26 branch February 22, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: use HTML parser (scraper crate) instead of regex for URL extraction

1 participant