Integrate Scrapy-based scrapers into Archi interfaces by nausikt · Pull Request #547 · archi-physics/archi

nausikt · 2026-04-08T15:36:51Z

Resolve #464, Part of #546

This PR includes

Migrate web scraping backend from custom scrapers to Scrapy framework.

Architecture

Replace legacy scraper.py, scraper_manager.py, and integrations/ (sso_scraper, git_scraper) with a Scrapy-based spider architecture
New ScraperManager orchestrates all web crawls via a single CrawlerProcess (one Twisted reactor, all spiders concurrent)
Ships with safe-by-default crawler settings src/data_manager/collectors/scrapers/settings.py:
e.g. CONCURRENT_REQUESTS=1, CONCURRENT_REQUESTS_PER_DOMAIN=1, ...etc.

Spiders can tighten these further via custom_settings (e.g. TWiki sets DOWNLOAD_DELAY: 60, Discourse sets CLOSESPIDER_PAGECOUNT: 500).

Decouple Git collection into standalone GitManager with its own GitResource model
New AuthDownloaderMiddleware + CERNSSOProvider for SSO authentication (replaces sso_scraper)
Resource Adapters, as a clear boundary/bridge between Scrapy (Item) <--> Archi's ScrapedResources during in PersistencePipeline.

Spiders

LinkSpider — generic link-following spider with configurable allow/deny, max_depth, domain scoping
TwikiSpider — extends LinkSpider with TWiki-specific URL normalization, deny patterns, and parser
DiscourseSpider — JSON API pagination spider for Discourse forums, seeded by category URLs from input lists

UX: Unified web source configuration

Single web.input_lists for all web URLs — domain-based routing automatically assigns the correct spider per URL
Per-site behavior overrides under web.sites (twiki, discourse) for delay, allow/deny, keywords, etc.
Fallback spider (default: link) for URLs that don't match any configured site domain
Same input_lists pattern for git sources

Pipelines & processing

PersistencePipeline — adapts Scrapy Items to Archi's ScrapedResource and persists via PersistenceService
AnonymizationPipeline — inline data anonymization during crawl using existing Anonymization utils.
MarkitdownPipeline — HTML/RSS to Markdown conversion as a second pass (builds partially on top Markitdown work by Liv's feat(data-manager): add Indico scraper integration #550)

Example & config

examples/deployments/basic-scraping/ — complete deployment example with twiki.list, cms-talk.list, miscellanea.list, git.list
Updated base-config.yaml template with new web.input_lists + web.sites schema
new Indico scraper params from feat(data-manager): add Indico scraper integration #550 for reference.

Tests

Offline unit tests for TWiki parser with real HTML fixtures
Scrapy contract checks for LinkSpider and TwikiSpider
Unit tests for resource adapter (Scrapy Item → ScrapedResource)

53 files changed, ~3970 insertions, ~1848 deletions

End-to-End Integrated Archi Infrastructure Test

Please, feels free to adjust the example/deployment/basic-scraping/config.yaml as you see fit, Currently, it might take very long time to pass through every web sources.

LinkSpider for MIT sources

may take less than 1 hr.
BUT... TWiki HeavyIon might take

at least (300 + 100 from CRAB)++ docs x 60 seconds (crawl delay)
~= 24,000 seconds (6-7hrs)
Discourse may take

10 seconds (crawl delay) each (47 pages + at least 500 docs)
~= 1 hr 40 mins
new GitManager ~= as i've tested on dmwm/CRABServer and DMWM/CRABClient

2 repos have ~400 files
its take less than an hr, work efficiently well like in the past.

Expected Result
[1] All comprehensive sources are ingesting which might take hrs to finished.

[2] Also, most of our docs should be in anonymized, markdown format right away.
P.S. been anonymized as best as NLP module and our heuristic-based (hard-code) name-replacement patterns per source can.

Standalone Scrapy Test

Spiders include Scrapy contracts — lightweight inline assertions that validate against real endpoints without a full deployment:

# Run all spider contracts (link, twiki, discourse)
scrapy check
# Run a specific spider's contracts
scrapy check link
scrapy check twiki

Run individual spiders standalone independently from Archi architecture with -a args:

# Link spider with custom depth and delay
scrapy crawl link -a start_urls='["https://quotes.toscrape.com/"]' -a max_depth=2 -a delay=1

# TWiki spider (public page, no SSO)
scrapy crawl twiki -a start_urls='["https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab"]' -a max_depth=1 -a max_pages=5

# Discourse spider with category URLs
scrapy crawl discourse -a start_urls='["https://cms-talk.web.cern.ch/c/offcomp/ais/150"]' -a domain=cms-talk.web.cern.ch -a max_pages=10

Unit tests (no network required):

pytest tests/unit/test_twiki_parser.py tests/unit/test_scrapers_resource_adapter.py -v

Misc

There's one caveat for SSO Playwright + Chromium to work as expected inside containers. Please beware:

On macOS with an ARM-based CPU, make sure Rosetta is enabled for your Podman machine (podman machine set --rootful and verify Rosetta is available in the VM).

Otherwise, Playwright may fall back to a QEMU-emulated Chromium instance, which causes race conditions and flaky SSO authentication!

Linux x86_64, production machines, and VMs should work fine out of the box.

nausikt · 2026-04-08T16:56:56Z

Note that we still lack the mit_sso auth provider, which will be added ASAP.

For now, we can work around this by testing MIT sources and CERN sources in a mutually exclusive manner.

pmlugato · 2026-04-08T17:50:13Z

Hi @nausikt , thanks for this PR! Looking into it and testing now -- one thing I noticed right away, please add any new packages to the requirements files -- having them in the pyproject is fine for now for testing but this way they the images will be updated accordingly once merged to main

nausikt · 2026-04-08T17:51:51Z

Oh dear, let me fix it now!

pmlugato · 2026-04-08T18:04:01Z

+        urls:
+          - https://ppc.mit.edu/news/
+        max_depth: 2
+        max_pages: 100
+        delay: 10
+        markitdown: true
+        input_lists:
+          - examples/deployments/basic-scraping/miscellanea.list


so you can both pass urls directly in the config, or via a .list file with urls as before? For the latter case, the file can have both public and sso, without any need for a prefix, right?

Exactly Pietro! for all scrapers (spiders) by design.

pmlugato · 2026-04-08T18:10:39Z

+    git:
+      urls:
+        - https://github.com/dmwm/CRABServer
+        - https://github.com/dmwm/CRABClient


unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.

Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?

Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.

unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.

For web.git now, Yes only take urls. Roger that, will support input_list there!

web.link, web.twiki and web.discourse support both urls and input_list.

Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?

@pmlugato Got your point! agree on UX can be better by not taking separate list, will converging!🫡

For clarity, only 1web.links.input_lists as a main portal, site/scraper specific asides under web.links.<site|spider> pattern.

web: links: ### Global Default Link configs go here, implicitly set here max_depth: 2 max_pages: 100 delay: 10 markitdown: true input_lists: # Only 1 portal, we pour every links here... as long as it fits scraper nature. - examples/deployments/basic-scraping/miscellanea.list ### Site/Spider non-list-related/specific configuration goes belows. # <spider/site>: twiki: delay: 60 # <---- override global deny: .... allow: .... # .... w/o list discourse: category_paths: - .... delay: 10 keywords: ... indico: # <--- upcoming IncdioSpider will stay in this level as well. keywords:

Then behind the scene, Archi scraper_manager automatically aware of Spider and there config via newly introduced registry instead.

DOMAIN_SPIDER_REGISTRY = { "twiki.cern.ch": TWikiSpider, "cms-talk.web.cern.ch": DiscourseSpider, "indico.cern.ch": IndicoSpider, }

Is this what you had in mind? but this registry has to be configured by users somewhere anyhow🤔

Otherwise, how can scraper_manager resolved link and know source_kind from flat list of urls? but i don't like prefix approach! IMO it's deteriorate UX.

How about just added domain for scraper_manager to pick-up as registry!

links: input_lists: - .... twiki: domain: "twiki.cern.ch" delay: 60 # <---- override global deny: .... allow: .... # .... w/o list discourse: domain: "cms-talk.web.cern.ch" - .... delay: 10 keywords: ... indico: # <--- upcoming IncdioSpider will stay in this level as well. domain: "indico.cern.ch" keywords:

This way we give users best UX, resolving spider are transparent and have least footprint to the users's config.

with probably only caveats[?] for now, like we can't have 2 twiki/discourse/indico source instances at the same time.

indico: domain: "indico.cern.ch", "indico.mit.edu" keywords: ...

Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.

Ack for the comment,

BTW, about naming,

would you prefer me to rename git.input_lists -> git.list as well? or avoid list words and remain the same?

would you like me to completely drop git.urls, web.links.urls and just change to input_lists style to avoid confusing users? preserve only 1 standard way?.

Although, personally, web.urls are fairly convenience for me to peek/edit everything with just a glance Archi in the same config.yaml but it may be just better UX for debugging things*

nausikt · 2026-04-08T18:15:19Z

+            if not isinstance(sources_section, dict):
+                continue
+            web = sources_section.get("web", {}) or {}
+            if not isinstance(web, dict):
+                continue
+            for spider_key, sub in web.items():
+                if spider_key in _WEB_TOP_LEVEL_STATIC_KEYS:
+                    continue
+                if not isinstance(sub, dict):
+                    continue
+                wlists = sub.get("input_lists") or []
+                if isinstance(wlists, list):
+                    collected.extend(wlists)


@pmlugato input_lists are understood by all scrapers (Spider) here and no prefix like sso-, eos-, or git-* is required in any urls.

For SSO-protected urls, though, we must explicitly provide the right auth_provider_name: cern_sso|.... ourself. and the SSO-protected url has to come first, but by design this should not be a caveat. URLs should works in any order!

pmlugato · 2026-04-08T19:35:43Z

@nausikt thanks a lot for this PR, it looks really nice! I have tested it with a few different source types and configurations and everything seems to be working smoothly which is great! Thanks a lot also for the active iteration offline :)

I would be happy to merge this soon into dev, maybe we can have one more person look at it but can also be done after the fact...

One request before doing so is the following: if you could write some nice documentation about all of this in the docs/ it would be great, including deprecating the old version of things there. Once that's done, I think we should be almost good to go into dev.

Thanks a lot again for all the hard work!

nausikt · 2026-04-08T19:48:14Z

@pmlugato Thanks to you guys for having me on board!

One last thing! i've summarized all my micro-repsonses for you to finalize here.
Before, I dive into converging to your review & and docs.

    web:
        links:
          ### Global Default Link configs go here, implicitly set here
          max_depth: 2
          max_pages: 100
          delay: 10
          markitdown: true
          # Only 1 input_lists portal, we pour every links here..
          input_lists:
          - examples/deployments/basic-scraping/miscellanea.list
          ### Site/Spider non-list-related/specific configuration goes belows.
          # <spider/site>:
          twiki:
              domain: "twiki.cern.ch"
              delay: 60  # <---- override global
              deny: ....
              allow: ....
              # .... w/o list
          discourse:
              domain: "cms-talk.web.cern.ch"
                - ....
              delay: 10
              keywords: ...
          indico:   # <--- upcoming IncdioSpider will stay in this level as well.
              domain: "indico.cern.ch"
              keywords:

Is this ideal you had in mind?
I will deprecated <manager>.urls and stick with web.links.input_lists, git.input_lists pattern/style
shall we? reduced to web.input_lists no need for redundant links in web.links (I highly encourage this, nested/indentation have a significant chances of confuse users, this should be better UX)

nausikt · 2026-04-08T20:10:44Z

Agreed with what we discussed offline! converging...

…acts.

…afe DEPTH_LIMIT.

…e test-cases.

…ests.

…C via parsers, introduce extension points parse_item, parse_follow_links, default implementation, toscrape and twiki example.

…ble crawler args.

…both toscrape, twiki examples.

… for toscrape, twiki.

…-world twiki crawling contracts with saftest default values.

…orks in HeavyIon use-cases.

…ace & instantiations to GitManager and Scrapy's ScraperManager.

…as doms convert to markitdown later.

…th limit

… vectorsotre manager.

…sible, should refactoring later to have structure/unstructure redact more separately.

…nymized.

…(discourse) later.

nausikt · 2026-04-14T19:45:11Z

@pmlugato I've resolved the conflicts, tested [1] and noted well on the keys details needed to migrate Liv's IndicoScraper back into the current structure. See other details in this PR updated write-up

But i have made only a small update to the docs, just the bare minimum necessary for now.
Let me follow up real-quick with:

[PR#2] to bring back Liv’s IndicoScraper through (guided) , which preserving Liv's original authorship.
Then, [PR#3] Unified docs, examples, *final bug fixes, which might fixed on Liv's side when things fully integrated.

Please let me know is everything O.K. to you for merging!

[1]

nausikt mentioned this pull request Apr 8, 2026

Migrate Scrapers to Scrapy #546

Open

7 tasks

pmlugato reviewed Apr 8, 2026

View reviewed changes

nausikt commented Apr 8, 2026

View reviewed changes

pmlugato mentioned this pull request Apr 13, 2026

feat(data-manager): add Indico scraper integration #550

Merged

nausikt added 19 commits April 14, 2026 19:40

scrapy project scaffolding for backend revamp, trivial twiki spider.

eba77ed

key scrapy setting with safe defaults.

e5c268a

scrapy check twiki, scrapy's magic to e2e test against any Item contr…

4cf17ba

…acts.

explicitly set RFPDupeFilter, proper Scrapy-Archi Item definitions.

3e29493

added trivial LinkScraper implementation in scrapy, scrapers.utils, s…

a0e5dfe

…afe DEPTH_LIMIT.

Unit-testable parser practice with a trivial real Twiki parser offlin…

34ab4c9

…e test-cases.

scrapers resource adapter, scrapy Item -> Archi's ScrapedResource.

f3c3cec

preserve source_type=web for now, rearrange resource_adapter & unit-t…

ee0d69f

…ests.

generic items, adapters, pipelines which encourage OCP.

df59e44

generic LinkSpider for subclassing, clear Open/Closed boundaries + So…

b247ef3

…C via parsers, introduce extension points parse_item, parse_follow_links, default implementation, toscrape and twiki example.

refactor LinkSpider, accepted/accounted real Twiki-usecases configura…

1575b15

…ble crawler args.

cleaner by put the scrapy contracts under parse, scrapy checkable on …

680e7c2

…both toscrape, twiki examples.

refactored how to proper crawler engine settings, safe default values…

3fd3cab

… for toscrape, twiki.

cleanest way to normalize url via LinkExtractor's process_value, Real…

511de6e

…-world twiki crawling contracts with saftest default values.

add more generic twiki default_deny patterns.

19d216a

refactored AuthProvider, Middlewares, with clear OCP, SoC boundary, w…

aa82698

…orks in HeavyIon use-cases.

decouple git collection from scrapers; add GitResource and GitManager.

4567005

add scraper_manager at collectors level, e2e wired from legacy interf…

a5ea122

…ace & instantiations to GitManager and Scrapy's ScraperManager.

remove deprecated scrapers and dead code before interface revision.

ea3008f

nausikt added 26 commits April 14, 2026 19:58

no links,pdfs has been discard, set title, much robust, collect body …

c85752c

…as doms convert to markitdown later.

ignored old doc and archives format, more robust body extractor.

4334012

fix scrapy will add . for us!

f454226

informative logging about follow_links.

9c6be41

clean basic-scraping config example.

9849278

[Dicourse] support recursion/iterator-based scraper, with cern_sso

29423af

[Discourse] refined interfaces and config example.

706881e

[Discourse] ScraperManager now support iterative-based Spider, no dep…

3a34ec6

…th limit

[Discourse] bring back full example

353bca5

[Discourse] workaround store rss as html, only best support format by…

a844b21

… vectorsotre manager.

[Discourse] scraped resource url better have no .rss

67b300f

scrapers support built-in anonymization

2776794

[Anonymizer][Discourse, Twiki] cover markups html, rss as much as pos…

c9e92c5

…sible, should refactoring later to have structure/unstructure redact more separately.

[Markitdown] support straight forward markitdown with second pass ano…

ce5875e

…nymized.

realistic, comprehensive test configurations.

fc5eace

fix renaming missing from ref PR.

014b412

fix markitdown dep was missing.

ad61d26

remove noises from local deployments.

99c6a48

moved deps to proper requirements.txt

0d76f6d

expected refined web sources and git interfaces.

fe39cdb

patch managers to aware of new interfaces.

a329c58

converged to new interfaces, deprecated urls-style, API-mode spiders …

10d73b1

…(discourse) later.

fix discourse category urls, missing utils bug.

dd677da

test sso-heavyion, fix bug missing lib.

f11e8da

allow all heavyion and crab in basic-scraping example.

9d0aee8

should explicitly allow all .*CMS.*

429de8d

nausikt force-pushed the ref/scrapers-to-scrapy branch from 9d4f69c to 429de8d Compare April 14, 2026 18:47

added back Liv's indico scraper archi-physics#550 params for ref.

8bdfa6f

nausikt changed the title ~~[Ref] Integrate Scrapy-based scrapers into Archi interfaces~~ Integrate Scrapy-based scrapers into Archi interfaces Apr 14, 2026

Conversation

nausikt commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR includes

Architecture

Spiders

UX: Unified web source configuration

Pipelines & processing

Example & config

Tests

End-to-End Integrated Archi Infrastructure Test

Standalone Scrapy Test

Misc

Uh oh!

nausikt commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmlugato commented Apr 8, 2026

Uh oh!

nausikt commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmlugato Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

pmlugato Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmlugato commented Apr 8, 2026

Uh oh!

nausikt commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nausikt commented Apr 8, 2026

Uh oh!

nausikt commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nausikt commented Apr 8, 2026 •

edited

Loading

nausikt commented Apr 8, 2026 •

edited

Loading

nausikt commented Apr 8, 2026 •

edited

Loading

pmlugato Apr 8, 2026 •

edited

Loading

nausikt Apr 8, 2026 •

edited

Loading

nausikt Apr 8, 2026 •

edited

Loading

nausikt Apr 8, 2026 •

edited

Loading

nausikt Apr 8, 2026 •

edited

Loading

nausikt commented Apr 8, 2026 •

edited

Loading

nausikt commented Apr 14, 2026 •

edited

Loading