Skip to content

fix: XML-entity-decode sitemap <loc> URLs (#50)#51

Merged
arberx merged 1 commit into
mainfrom
arberx/windhoek-v2
Jun 17, 2026
Merged

fix: XML-entity-decode sitemap <loc> URLs (#50)#51
arberx merged 1 commit into
mainfrom
arberx/windhoek-v2

Conversation

@arberx

@arberx arberx commented Jun 17, 2026

Copy link
Copy Markdown
Member

Fixes #50. Spec-compliant sitemaps escape & in <loc> URLs as &amp; (per sitemaps.org), but parseSitemapXml passed the literal ...&amp;... to the fetcher, which the origin treats as a different (usually empty) request — so a sitemap index (every child <loc> carries query params, e.g. BigCommerce) aborted with No auditable URLs found in sitemap., and flat <urlset> pages were silently dropped. Both the urlset and sitemapindex branches now decode the five predefined XML entities plus decimal/hex numeric character references (&amp; resolved last; out-of-range refs left untouched so a malformed sitemap can't throw), with no API or scoring change. Adds 4 unit tests (urlset query params, the sitemap-index repro, and numeric/hex/ampersand-last ordering) and bumps the version to 4.0.1 with a CHANGELOG entry. Full suite: 363/363 pass, typecheck and lint clean.

🤖 Generated with Claude Code

Spec-compliant sitemaps escape `&` in `<loc>` URLs as `&amp;` per
sitemaps.org. parseSitemapXml passed the literal `...&amp;...` to the
fetcher, which the origin treats as a different (usually empty) request.
On a sitemap index — where every child <loc> carries query params
(BigCommerce, paginated CMS sitemaps) — every child fetch failed and the
audit aborted with "No auditable URLs found in sitemap."; flat <urlset>
pages were silently dropped.

Both the urlset and sitemapindex branches now decode the five predefined
XML entities plus decimal/hex numeric character references (&amp; last,
out-of-range refs left untouched so a malformed sitemap can't throw).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arberx arberx merged commit c87fcfd into main Jun 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sitemap <loc> URLs are not XML-entity-decoded: &amp; query params break fetch (sitemap-index returns "No auditable URLs found")

1 participant