Skip to content

Integrate Scrapy-based scrapers into Archi interfaces#547

Open
nausikt wants to merge 63 commits intoarchi-physics:devfrom
nausikt:ref/scrapers-to-scrapy
Open

Integrate Scrapy-based scrapers into Archi interfaces#547
nausikt wants to merge 63 commits intoarchi-physics:devfrom
nausikt:ref/scrapers-to-scrapy

Conversation

@nausikt
Copy link
Copy Markdown

@nausikt nausikt commented Apr 8, 2026

Resolve #464, Part of #546

This PR includes

Migrate web scraping backend from custom scrapers to Scrapy framework.

Architecture

  • Replace legacy scraper.py, scraper_manager.py, and integrations/ (sso_scraper, git_scraper) with a Scrapy-based spider architecture
  • New ScraperManager orchestrates all web crawls via a single CrawlerProcess (one Twisted reactor, all spiders concurrent)
  • Ships with safe-by-default crawler settings src/data_manager/collectors/scrapers/settings.py:
    e.g. CONCURRENT_REQUESTS=1, CONCURRENT_REQUESTS_PER_DOMAIN=1, ...etc.

Spiders can tighten these further via custom_settings (e.g. TWiki sets DOWNLOAD_DELAY: 60, Discourse sets CLOSESPIDER_PAGECOUNT: 500).

  • Decouple Git collection into standalone GitManager with its own GitResource model
  • New AuthDownloaderMiddleware + CERNSSOProvider for SSO authentication (replaces sso_scraper)
  • Resource Adapters, as a clear boundary/bridge between Scrapy (Item) <--> Archi's ScrapedResources during in PersistencePipeline.

Spiders

  • LinkSpider — generic link-following spider with configurable allow/deny, max_depth, domain scoping
  • TwikiSpider — extends LinkSpider with TWiki-specific URL normalization, deny patterns, and parser
  • DiscourseSpider — JSON API pagination spider for Discourse forums, seeded by category URLs from input lists

UX: Unified web source configuration

  • Single web.input_lists for all web URLs — domain-based routing automatically assigns the correct spider per URL
  • Per-site behavior overrides under web.sites (twiki, discourse) for delay, allow/deny, keywords, etc.
  • Fallback spider (default: link) for URLs that don't match any configured site domain
  • Same input_lists pattern for git sources

Pipelines & processing

  • PersistencePipeline — adapts Scrapy Items to Archi's ScrapedResource and persists via PersistenceService
  • AnonymizationPipeline — inline data anonymization during crawl using existing Anonymization utils.
  • MarkitdownPipeline — HTML/RSS to Markdown conversion as a second pass (builds partially on top Markitdown work by Liv's feat(data-manager): add Indico scraper integration #550)

Example & config

  • examples/deployments/basic-scraping/ — complete deployment example with twiki.list, cms-talk.list, miscellanea.list, git.list
  • Updated base-config.yaml template with new web.input_lists + web.sites schema
  • new Indico scraper params from feat(data-manager): add Indico scraper integration #550 for reference.

Tests

  • Offline unit tests for TWiki parser with real HTML fixtures
  • Scrapy contract checks for LinkSpider and TwikiSpider
  • Unit tests for resource adapter (Scrapy Item → ScrapedResource)

53 files changed, ~3970 insertions, ~1848 deletions

End-to-End Integrated Archi Infrastructure Test

Please, feels free to adjust the example/deployment/basic-scraping/config.yaml as you see fit, Currently, it might take very long time to pass through every web sources.

  • LinkSpider for MIT sources

    may take less than 1 hr.

  • BUT... TWiki HeavyIon might take

    at least (300 + 100 from CRAB)++ docs x 60 seconds (crawl delay)
    ~= 24,000 seconds (6-7hrs)

  • Discourse may take

    10 seconds (crawl delay) each (47 pages + at least 500 docs)
    ~= 1 hr 40 mins

  • new GitManager ~= as i've tested on dmwm/CRABServer and DMWM/CRABClient

    2 repos have ~400 files
    its take less than an hr, work efficiently well like in the past.

Expected Result
[1] All comprehensive sources are ingesting which might take hrs to finished.
Screenshot 2026-04-08 at 18 53 22
[2] Also, most of our docs should be in anonymized, markdown format right away.
P.S. been anonymized as best as NLP module and our heuristic-based (hard-code) name-replacement patterns per source can.
575417008-f7cccaad-f416-4415-ada5-978217ad0a59

Standalone Scrapy Test

Spiders include Scrapy contracts — lightweight inline assertions that validate against real endpoints without a full deployment:

# Run all spider contracts (link, twiki, discourse)
scrapy check
# Run a specific spider's contracts
scrapy check link
scrapy check twiki

Run individual spiders standalone independently from Archi architecture with -a args:

# Link spider with custom depth and delay
scrapy crawl link -a start_urls='["https://quotes.toscrape.com/"]' -a max_depth=2 -a delay=1

# TWiki spider (public page, no SSO)
scrapy crawl twiki -a start_urls='["https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"]' -a max_depth=1 -a max_pages=5

# Discourse spider with category URLs
scrapy crawl discourse -a start_urls='["https://cms-talk.web.cern.ch/c/offcomp/ais/150"]' -a domain=cms-talk.web.cern.ch -a max_pages=10

Unit tests (no network required):

pytest tests/unit/test_twiki_parser.py tests/unit/test_scrapers_resource_adapter.py -v

Misc

There's one caveat for SSO Playwright + Chromium to work as expected inside containers. Please beware:

  • On macOS with an ARM-based CPU, make sure Rosetta is enabled for your Podman machine (podman machine set --rootful and verify Rosetta is available in the VM).

Otherwise, Playwright may fall back to a QEMU-emulated Chromium instance, which causes race conditions and flaky SSO authentication!

  • Linux x86_64, production machines, and VMs should work fine out of the box.

@nausikt nausikt mentioned this pull request Apr 8, 2026
7 tasks
@nausikt
Copy link
Copy Markdown
Author

nausikt commented Apr 8, 2026

Note that we still lack the mit_sso auth provider, which will be added ASAP.

For now, we can work around this by testing MIT sources and CERN sources in a mutually exclusive manner.

@pmlugato
Copy link
Copy Markdown
Collaborator

pmlugato commented Apr 8, 2026

Hi @nausikt , thanks for this PR! Looking into it and testing now -- one thing I noticed right away, please add any new packages to the requirements files -- having them in the pyproject is fine for now for testing but this way they the images will be updated accordingly once merged to main

@nausikt
Copy link
Copy Markdown
Author

nausikt commented Apr 8, 2026

Oh dear, let me fix it now!

Comment on lines +44 to +51
urls:
- https://ppc.mit.edu/news/
max_depth: 2
max_pages: 100
delay: 10
markitdown: true
input_lists:
- examples/deployments/basic-scraping/miscellanea.list
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can both pass urls directly in the config, or via a .list file with urls as before? For the latter case, the file can have both public and sso, without any need for a prefix, right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly Pietro! for all scrapers (spiders) by design.

Comment on lines +92 to +95
git:
urls:
- https://github.com/dmwm/CRABServer
- https://github.com/dmwm/CRABClient
Copy link
Copy Markdown
Collaborator

@pmlugato pmlugato Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.

Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?

Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.

For web.git now, Yes only take urls. Roger that, will support input_list there!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

web.link, web.twiki and web.discourse support both urls and input_list.

Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?

@pmlugato Got your point! agree on UX can be better by not taking separate list, will converging!🫡

For clarity, only 1web.links.input_lists as a main portal, site/scraper specific asides under web.links.<site|spider> pattern.

    web:
        links:
          ### Global Default Link configs go here, implicitly set here
          max_depth: 2
          max_pages: 100
          delay: 10
          markitdown: true
          input_lists:  # Only 1 portal, we pour every links here... as long as it fits scraper nature.
          - examples/deployments/basic-scraping/miscellanea.list
          ### Site/Spider non-list-related/specific configuration goes belows.
          # <spider/site>:
          twiki:
              delay: 60  # <---- override global
              deny: ....
              allow: ....
              # .... w/o list
          discourse: 
              category_paths:
                - ....
              delay: 10
              keywords: ...
          indico:   # <--- upcoming IncdioSpider will stay in this level as well.
              keywords:

Then behind the scene, Archi scraper_manager automatically aware of Spider and there config via newly introduced registry instead.

DOMAIN_SPIDER_REGISTRY = {
    "twiki.cern.ch": TWikiSpider,
    "cms-talk.web.cern.ch": DiscourseSpider,
    "indico.cern.ch": IndicoSpider,
}

Is this what you had in mind? but this registry has to be configured by users somewhere anyhow🤔

Otherwise, how can scraper_manager resolved link and know source_kind from flat list of urls? but i don't like prefix approach! IMO it's deteriorate UX.

Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just added domain for scraper_manager to pick-up as registry!

          links:
              input_lists:
              - ....
          twiki:
              domain: "twiki.cern.ch"
              delay: 60  # <---- override global
              deny: ....
              allow: ....
              # .... w/o list
          discourse: 
              domain: "cms-talk.web.cern.ch"
                - ....
              delay: 10
              keywords: ...
          indico:   # <--- upcoming IncdioSpider will stay in this level as well.
              domain: "indico.cern.ch"
              keywords:

Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we give users best UX, resolving spider are transparent and have least footprint to the users's config.

with probably only caveats[?] for now, like we can't have 2 twiki/discourse/indico source instances at the same time.

          indico:
              domain: "indico.cern.ch", "indico.mit.edu"
              keywords: ...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.

Ack for the comment,

BTW, about naming,

  1. would you prefer me to rename git.input_lists -> git.list as well? or avoid list words and remain the same?
  2. would you like me to completely drop git.urls, web.links.urls and just change to input_lists style to avoid confusing users? preserve only 1 standard way?.
  • Although, personally, web.urls are fairly convenience for me to peek/edit everything with just a glance Archi in the same config.yaml but it may be just better UX for debugging things*

Comment thread src/cli/managers/config_manager.py Outdated
Comment on lines +271 to +283
if not isinstance(sources_section, dict):
continue
web = sources_section.get("web", {}) or {}
if not isinstance(web, dict):
continue
for spider_key, sub in web.items():
if spider_key in _WEB_TOP_LEVEL_STATIC_KEYS:
continue
if not isinstance(sub, dict):
continue
wlists = sub.get("input_lists") or []
if isinstance(wlists, list):
collected.extend(wlists)
Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmlugato input_lists are understood by all scrapers (Spider) here and no prefix like sso-, eos-, or git-* is required in any urls.

For SSO-protected urls, though, we must explicitly provide the right auth_provider_name: cern_sso|.... ourself. and the SSO-protected url has to come first, but by design this should not be a caveat. URLs should works in any order!

@pmlugato
Copy link
Copy Markdown
Collaborator

pmlugato commented Apr 8, 2026

@nausikt thanks a lot for this PR, it looks really nice! I have tested it with a few different source types and configurations and everything seems to be working smoothly which is great! Thanks a lot also for the active iteration offline :)

I would be happy to merge this soon into dev, maybe we can have one more person look at it but can also be done after the fact...

One request before doing so is the following: if you could write some nice documentation about all of this in the docs/ it would be great, including deprecating the old version of things there. Once that's done, I think we should be almost good to go into dev.

Thanks a lot again for all the hard work!

@nausikt
Copy link
Copy Markdown
Author

nausikt commented Apr 8, 2026

@pmlugato Thanks to you guys for having me on board!

One last thing! i've summarized all my micro-repsonses for you to finalize here.
Before, I dive into converging to your review & and docs.

    web:
        links:
          ### Global Default Link configs go here, implicitly set here
          max_depth: 2
          max_pages: 100
          delay: 10
          markitdown: true
          # Only 1 input_lists portal, we pour every links here..
          input_lists:
          - examples/deployments/basic-scraping/miscellanea.list
          ### Site/Spider non-list-related/specific configuration goes belows.
          # <spider/site>:
          twiki:
              domain: "twiki.cern.ch"
              delay: 60  # <---- override global
              deny: ....
              allow: ....
              # .... w/o list
          discourse:
              domain: "cms-talk.web.cern.ch"
                - ....
              delay: 10
              keywords: ...
          indico:   # <--- upcoming IncdioSpider will stay in this level as well.
              domain: "indico.cern.ch"
              keywords:
  1. Is this ideal you had in mind?
  2. I will deprecated <manager>.urls and stick with web.links.input_lists, git.input_lists pattern/style
  3. shall we? reduced to web.input_lists no need for redundant links in web.links (I highly encourage this, nested/indentation have a significant chances of confuse users, this should be better UX)

@nausikt
Copy link
Copy Markdown
Author

nausikt commented Apr 8, 2026

Agreed with what we discussed offline! converging...

nausikt added 19 commits April 14, 2026 19:40
…C via parsers, introduce extension points parse_item, parse_follow_links, default implementation, toscrape and twiki example.
…-world twiki crawling contracts with saftest default values.
…ace & instantiations to GitManager and Scrapy's ScraperManager.
nausikt added 26 commits April 14, 2026 19:58
…sible, should refactoring later to have structure/unstructure redact more separately.
@nausikt nausikt force-pushed the ref/scrapers-to-scrapy branch from 9d4f69c to 429de8d Compare April 14, 2026 18:47
@nausikt
Copy link
Copy Markdown
Author

nausikt commented Apr 14, 2026

@pmlugato I've resolved the conflicts, tested [1] and noted well on the keys details needed to migrate Liv's IndicoScraper back into the current structure. See other details in this PR updated write-up

But i have made only a small update to the docs, just the bare minimum necessary for now.
Let me follow up real-quick with:

  • [PR#2] to bring back Liv’s IndicoScraper through (guided) , which preserving Liv's original authorship.
  • Then, [PR#3] Unified docs, examples, *final bug fixes, which might fixed on Liv's side when things fully integrated.

Please let me know is everything O.K. to you for merging!

[1]
Screenshot 2026-04-14 at 21 09 18

@nausikt nausikt changed the title [Ref] Integrate Scrapy-based scrapers into Archi interfaces Integrate Scrapy-based scrapers into Archi interfaces Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants