Skip to content

Switch URL discovery to sitemap-driven ingestion#18

Open
Copilot wants to merge 24 commits into
mainfrom
copilot/update-url-extraction-from-sitemap
Open

Switch URL discovery to sitemap-driven ingestion#18
Copilot wants to merge 24 commits into
mainfrom
copilot/update-url-extraction-from-sitemap

Conversation

Copilot AI commented Feb 1, 2026

Copy link
Copy Markdown
Contributor

The tool now sources registration URLs from sitemaps instead of the Blogger API to broaden applicability while keeping the Indexing API workflow unchanged.

  • Sitemap ingestion

    • Fetch sitemap + sitemap index URLs (queue-based traversal) and register extracted URLs into Firestore.
    • Warn on non-HTTPS sitemap URLs and skip non-HTTPS content URLs.
  • Security & robustness

    • Safe XML parsing via defusedxml, gzip/encoding-aware sitemap decoding, SSL bundle verification.
    • Clear error messages for fetch, SSL, parse, and decompression failures.
  • Normalization

    • Trim whitespace, remove query/fragment, and unify trailing slashes to reduce duplicate URL registrations.
  • Configuration & docs

    • Replace Blogger env vars with SITEMAP_URL and update README + dependency list.

Example:

# URL discovery now starts from a sitemap URL
register_sitemap_urls_to_firestore(env["sitemap_url"])
Original prompt

This section details on the original issue you should resolve

<issue_title>[機能要望] サイトマップから登録するURLを取得する</issue_title>
<issue_description>## 概要
サイトマップから登録するURLを取得するように仕様を変更する。
Bloggerに縛られなくなるため、ツール名も「Google Search Console」に登録するためのものであることを主眼としたものに変更する。

背景

Bloggerにも対応できるようにし、本ツールの汎用性を高める。</issue_description>

<agent_instructions>ツール名は変更しなくてかまいません。Pythonコードの修正だけでなく、必要に応じて README.md も更新してください。</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 12 commits February 1, 2026 15:37
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Copilot AI changed the title [WIP] Update URL extraction from sitemap for tool flexibility Switch URL discovery to sitemap-driven ingestion Feb 1, 2026
Copilot AI requested a review from Shinoryo February 1, 2026 15:48
@Shinoryo Shinoryo marked this pull request as ready for review February 1, 2026 16:15

@Shinoryo Shinoryo left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot コメントを確認し、必要があれば修正お願いします。

Comment thread blogger_register/blogger_register.py Outdated
Comment thread blogger_register/blogger_register.py
Comment thread blogger_register/blogger_register.py Outdated
Comment thread blogger_register/blogger_register.py
Comment thread blogger_register/blogger_register.py Outdated
Comment thread blogger_register/blogger_register.py Outdated
Copilot AI and others added 7 commits February 1, 2026 16:28
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Copilot AI requested a review from Shinoryo February 1, 2026 16:35

@Shinoryo Shinoryo left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot 修正お願いします。

Comment thread blogger_register/blogger_register.py Outdated
Comment thread blogger_register/blogger_register.py
Comment thread blogger_register/blogger_register.py Outdated
Comment thread blogger_register/blogger_register.py Outdated
Copilot AI and others added 2 commits February 2, 2026 00:34
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>
Co-authored-by: Shinoryo <78910349+Shinoryo@users.noreply.github.com>

@Shinoryo Shinoryo left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

確認しました。

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request switches the URL discovery mechanism from the Blogger API to sitemap-based ingestion, making the tool applicable to any website with a sitemap. The core Indexing API workflow remains unchanged.

Changes:

  • Replaced Blogger API integration with sitemap XML parsing using queue-based traversal
  • Added security features including defusedxml for safe XML parsing, SSL verification, and HTTPS-only URL filtering
  • Implemented URL normalization to reduce duplicate registrations
  • Updated environment variables from Blogger-specific keys to SITEMAP_URL

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File Description
requirements.txt Added defusedxml dependency for secure XML parsing
blogger_register/blogger_register.py Replaced Blogger API calls with sitemap fetching/parsing functions, updated environment variables, added URL normalization and security features
README.md Updated documentation to reflect sitemap-based URL discovery instead of Blogger API
LICENSES/defusedxml==0.7.1 Added license file for new defusedxml dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread blogger_register/blogger_register.py
Comment thread blogger_register/blogger_register.py
Comment thread blogger_register/blogger_register.py
Comment on lines +424 to +427
if (
child_url not in visited_sitemaps
and child_url not in pending_sitemaps
):

Copilot AI Feb 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in operator on a deque has O(n) time complexity. For large sitemaps with many child sitemaps, this check could become a performance bottleneck. Consider converting pending_sitemaps membership checks to use a set for O(1) lookups, or maintain a separate set to track pending sitemap URLs.

Copilot uses AI. Check for mistakes.
Add the content_encoding parameter in the Args section

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[機能要望] サイトマップから登録するURLを取得する

3 participants