Integrate Scrapy-based scrapers into Archi interfaces#547
Integrate Scrapy-based scrapers into Archi interfaces#547nausikt wants to merge 63 commits intoarchi-physics:devfrom
Conversation
|
Note that we still lack the For now, we can work around this by testing MIT sources and CERN sources in a mutually exclusive manner. |
|
Hi @nausikt , thanks for this PR! Looking into it and testing now -- one thing I noticed right away, please add any new packages to the |
|
Oh dear, let me fix it now! |
| urls: | ||
| - https://ppc.mit.edu/news/ | ||
| max_depth: 2 | ||
| max_pages: 100 | ||
| delay: 10 | ||
| markitdown: true | ||
| input_lists: | ||
| - examples/deployments/basic-scraping/miscellanea.list |
There was a problem hiding this comment.
so you can both pass urls directly in the config, or via a .list file with urls as before? For the latter case, the file can have both public and sso, without any need for a prefix, right?
There was a problem hiding this comment.
Exactly Pietro! for all scrapers (spiders) by design.
| git: | ||
| urls: | ||
| - https://github.com/dmwm/CRABServer | ||
| - https://github.com/dmwm/CRABClient |
There was a problem hiding this comment.
unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.
Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?
Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.
There was a problem hiding this comment.
unlike
web.links,web.gitandweb.twikionly take the urls directly in the config? if you can also support aninput_listhere, would be nice for people who want to keep it separate.
For web.git now, Yes only take urls. Roger that, will support input_list there!
There was a problem hiding this comment.
web.link, web.twiki and web.discourse support both urls and input_list.
There was a problem hiding this comment.
Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under
web.links.twiki? to support different delay times more easily maybe?
@pmlugato Got your point! agree on UX can be better by not taking separate list, will converging!🫡
For clarity, only 1web.links.input_lists as a main portal, site/scraper specific asides under web.links.<site|spider> pattern.
web:
links:
### Global Default Link configs go here, implicitly set here
max_depth: 2
max_pages: 100
delay: 10
markitdown: true
input_lists: # Only 1 portal, we pour every links here... as long as it fits scraper nature.
- examples/deployments/basic-scraping/miscellanea.list
### Site/Spider non-list-related/specific configuration goes belows.
# <spider/site>:
twiki:
delay: 60 # <---- override global
deny: ....
allow: ....
# .... w/o list
discourse:
category_paths:
- ....
delay: 10
keywords: ...
indico: # <--- upcoming IncdioSpider will stay in this level as well.
keywords:
Then behind the scene, Archi scraper_manager automatically aware of Spider and there config via newly introduced registry instead.
DOMAIN_SPIDER_REGISTRY = {
"twiki.cern.ch": TWikiSpider,
"cms-talk.web.cern.ch": DiscourseSpider,
"indico.cern.ch": IndicoSpider,
}Is this what you had in mind? but this registry has to be configured by users somewhere anyhow🤔
Otherwise, how can scraper_manager resolved link and know source_kind from flat list of urls? but i don't like prefix approach! IMO it's deteriorate UX.
There was a problem hiding this comment.
How about just added domain for scraper_manager to pick-up as registry!
links:
input_lists:
- ....
twiki:
domain: "twiki.cern.ch"
delay: 60 # <---- override global
deny: ....
allow: ....
# .... w/o list
discourse:
domain: "cms-talk.web.cern.ch"
- ....
delay: 10
keywords: ...
indico: # <--- upcoming IncdioSpider will stay in this level as well.
domain: "indico.cern.ch"
keywords:There was a problem hiding this comment.
This way we give users best UX, resolving spider are transparent and have least footprint to the users's config.
with probably only caveats[?] for now, like we can't have 2 twiki/discourse/indico source instances at the same time.
indico:
domain: "indico.cern.ch", "indico.mit.edu"
keywords: ...There was a problem hiding this comment.
Comment: I think it's better to have, e.g.,
links.listandgit.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.
Ack for the comment,
BTW, about naming,
- would you prefer me to rename
git.input_lists->git.listas well? or avoidlistwords and remain the same? - would you like me to completely drop
git.urls,web.links.urlsand just change toinput_listsstyle to avoid confusing users? preserve only 1 standard way?.
- Although, personally,
web.urlsare fairly convenience for me to peek/edit everything with just a glance Archi in the sameconfig.yamlbut it may be just better UX for debugging things*
| if not isinstance(sources_section, dict): | ||
| continue | ||
| web = sources_section.get("web", {}) or {} | ||
| if not isinstance(web, dict): | ||
| continue | ||
| for spider_key, sub in web.items(): | ||
| if spider_key in _WEB_TOP_LEVEL_STATIC_KEYS: | ||
| continue | ||
| if not isinstance(sub, dict): | ||
| continue | ||
| wlists = sub.get("input_lists") or [] | ||
| if isinstance(wlists, list): | ||
| collected.extend(wlists) |
There was a problem hiding this comment.
@pmlugato input_lists are understood by all scrapers (Spider) here and no prefix like sso-, eos-, or git-* is required in any urls.
For SSO-protected urls, though, we must explicitly provide the right auth_provider_name: cern_sso|.... ourself. and the SSO-protected url has to come first, but by design this should not be a caveat. URLs should works in any order!
|
@nausikt thanks a lot for this PR, it looks really nice! I have tested it with a few different source types and configurations and everything seems to be working smoothly which is great! Thanks a lot also for the active iteration offline :) I would be happy to merge this soon into dev, maybe we can have one more person look at it but can also be done after the fact... One request before doing so is the following: if you could write some nice documentation about all of this in the Thanks a lot again for all the hard work! |
|
@pmlugato Thanks to you guys for having me on board! One last thing! i've summarized all my micro-repsonses for you to finalize here. web:
links:
### Global Default Link configs go here, implicitly set here
max_depth: 2
max_pages: 100
delay: 10
markitdown: true
# Only 1 input_lists portal, we pour every links here..
input_lists:
- examples/deployments/basic-scraping/miscellanea.list
### Site/Spider non-list-related/specific configuration goes belows.
# <spider/site>:
twiki:
domain: "twiki.cern.ch"
delay: 60 # <---- override global
deny: ....
allow: ....
# .... w/o list
discourse:
domain: "cms-talk.web.cern.ch"
- ....
delay: 10
keywords: ...
indico: # <--- upcoming IncdioSpider will stay in this level as well.
domain: "indico.cern.ch"
keywords:
|
|
Agreed with what we discussed offline! converging... |
…C via parsers, introduce extension points parse_item, parse_follow_links, default implementation, toscrape and twiki example.
…ble crawler args.
…both toscrape, twiki examples.
… for toscrape, twiki.
…-world twiki crawling contracts with saftest default values.
…orks in HeavyIon use-cases.
…ace & instantiations to GitManager and Scrapy's ScraperManager.
…as doms convert to markitdown later.
… vectorsotre manager.
…sible, should refactoring later to have structure/unstructure redact more separately.
…(discourse) later.
9d4f69c to
429de8d
Compare
|
@pmlugato I've resolved the conflicts, tested [1] and noted well on the keys details needed to migrate Liv's IndicoScraper back into the current structure. See other details in this PR updated write-up But i have made only a small update to the docs, just the bare minimum necessary for now.
Please let me know is everything O.K. to you for merging! |

Resolve #464, Part of #546
This PR includes
Migrate web scraping backend from custom scrapers to Scrapy framework.
Architecture
e.g.
CONCURRENT_REQUESTS=1,CONCURRENT_REQUESTS_PER_DOMAIN=1, ...etc.Spiders
UX: Unified web source configuration
web.input_listsfor all web URLs — domain-based routing automatically assigns the correct spider per URLPipelines & processing
Example & config
examples/deployments/basic-scraping/— complete deployment example withtwiki.list,cms-talk.list,miscellanea.list,git.listweb.input_lists+web.sitesschemaTests
53 files changed, ~3970 insertions, ~1848 deletions
End-to-End Integrated Archi Infrastructure Test
Please, feels free to adjust the
example/deployment/basic-scraping/config.yamlas you see fit, Currently, it might take very long time to pass through every web sources.dmwm/CRABServerandDMWM/CRABClientExpected Result


[1] All comprehensive sources are ingesting which might take hrs to finished.
[2] Also, most of our docs should be in anonymized, markdown format right away.
P.S. been anonymized as best as NLP module and our heuristic-based (hard-code) name-replacement patterns per source can.
Standalone Scrapy Test
Spiders include Scrapy contracts — lightweight inline assertions that validate against real endpoints without a full deployment:
Run individual spiders standalone independently from Archi architecture with -a args:
Unit tests (no network required):
Misc
There's one caveat for SSO Playwright + Chromium to work as expected inside containers. Please beware: