Switch URL discovery to sitemap-driven ingestion#18
Conversation
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This pull request switches the URL discovery mechanism from the Blogger API to sitemap-based ingestion, making the tool applicable to any website with a sitemap. The core Indexing API workflow remains unchanged.
Changes:
- Replaced Blogger API integration with sitemap XML parsing using queue-based traversal
- Added security features including defusedxml for safe XML parsing, SSL verification, and HTTPS-only URL filtering
- Implemented URL normalization to reduce duplicate registrations
- Updated environment variables from Blogger-specific keys to
SITEMAP_URL
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| requirements.txt | Added defusedxml dependency for secure XML parsing |
| blogger_register/blogger_register.py | Replaced Blogger API calls with sitemap fetching/parsing functions, updated environment variables, added URL normalization and security features |
| README.md | Updated documentation to reflect sitemap-based URL discovery instead of Blogger API |
| LICENSES/defusedxml==0.7.1 | Added license file for new defusedxml dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if ( | ||
| child_url not in visited_sitemaps | ||
| and child_url not in pending_sitemaps | ||
| ): |
There was a problem hiding this comment.
The in operator on a deque has O(n) time complexity. For large sitemaps with many child sitemaps, this check could become a performance bottleneck. Consider converting pending_sitemaps membership checks to use a set for O(1) lookups, or maintain a separate set to track pending sitemap URLs.
Add the content_encoding parameter in the Args section Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The tool now sources registration URLs from sitemaps instead of the Blogger API to broaden applicability while keeping the Indexing API workflow unchanged.
Sitemap ingestion
Security & robustness
Normalization
Configuration & docs
SITEMAP_URLand update README + dependency list.Example:
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.