feat(data-manager): add Indico scraper integration by livaage · Pull Request #550 · archi-physics/archi

livaage · 2026-04-13T11:43:59Z

Adds support for scraping Indico events and meeting materials, alongside the existing scrapers.

Scraper (indico_scraper.py)

Fetches event metadata, contributions, and slide attachments via the Indico REST API.
Converts PDF/PPTX/PPT/ODP slides to Markdown via MarkItDown; strips embedded blocks that inflate chunk counts on formula-heavy slides (slide_converter.py).
Deduplicates attachments when the same slides are uploaded in multiple formats (e.g. PDF + PPTX): keeps the higher-priority format.
Detects SSO-protected events and authenticates via CERNSSOScraper.

ScraperManager integration

collect_indico() / schedule_collect_indico() hooks.
URL routing: explicit "indico-" prefix in weblists, plus auto-detection for URLs with "indico" in the hostname and /event/ in the path.
Indico documents use source_type="web" (matching the existing CHECK constraint) with a "scraper": "indico" metadata field for filtering.
Most of the changes to the scraper_manager.py is just linting

Vectorstore

Prepends a one-line metadata header to each Indico chunk (event title, date, contribution, speaker, affiliation, start time, duration, session) so BM25 retrieval can match on speaker name, time of day, etc. Gated on metadata "scraper"="indico"; no other sources affected.

Config / docs / examples

base-config.yaml: indico source block (disabled by default).
docs/docs/data_sources.md: Indico section.
examples/agents/indico-assistant.md: agent spec for Indico queries.
examples/deployments/basic-agent/indico_example.list: example weblist.
SourceRegistry: register "indico" source (depends on links).

Dependencies

pyproject.toml + requirements-base.txt: add markitdown[pdf,pptx].

Known limitations

Images and figures in slides are not extracted or described; only text content is converted to Markdown. Slides that communicate primarily through plots/diagrams will produce thin or empty chunks.
LaTeX in slides: embedded blocks are stripped (they are base64-encoded and useless for retrieval), but inline LaTeX notation and formula-heavy slides may still produce low-quality Markdown that chunks poorly.
Slide context is per-page, not per-deck: each chunk comes from one page/section of the converted Markdown. There is no cross-slide summarisation, so a narrative that spans multiple slides may be split across chunks without connecting context.
Category URLs (/category//) are handled in the code but not yet tested end-to-end; only event URLs are documented.
SSO authentication is CERN-specific (CERNSSOScraper). Other Indico instances with different login flows would need a different auth path.
No rate limiting or incremental scraping; large events with many attachments are processed in a single run.
indico_scraper.py is large (~1300 lines); a follow-up refactor could
split API interaction from resource construction.

Adds support for scraping Indico events and meeting materials, alongside the existing link/git/sso/elog scrapers. Scraper (indico_scraper.py) - Fetches event metadata, contributions, and slide attachments via the Indico REST API. - Converts PDF/PPTX/PPT/ODP slides to Markdown via MarkItDown; strips embedded <latexit> blocks that inflate chunk counts on formula-heavy slides (slide_converter.py). - Deduplicates attachments when the same slides are uploaded in multiple formats (e.g. PDF + PPTX): keeps the higher-priority format. - Detects SSO-protected events and authenticates via CERNSSOScraper. - Stores speaker affiliation alongside speaker name in resource metadata. ScraperManager integration - collect_indico() / schedule_collect_indico() hooks. - URL routing: explicit "indico-" prefix in weblists, plus auto-detection for URLs with "indico" in the hostname and /event/ in the path. - Indico documents use source_type="web" (matching the existing CHECK constraint) with a "scraper": "indico" metadata field for filtering. Vectorstore - Prepends a one-line metadata header to each Indico chunk (event title, date, contribution, speaker, affiliation, start time, duration, session) so BM25 retrieval can match on speaker name, time of day, etc. Gated on metadata "scraper"="indico"; no other sources affected. Config / docs / examples - base-config.yaml: indico source block (disabled by default). - docs/docs/data_sources.md: Indico section. - examples/agents/indico-assistant.md: agent spec for Indico queries. - examples/deployments/basic-agent/indico_example.list: example weblist. - SourceRegistry: register "indico" source (depends on links). Dependencies - pyproject.toml + requirements-base.txt: add markitdown[pdf,pptx]. Known limitations - Images and figures in slides are not extracted or described; only text content is converted to Markdown. Slides that communicate primarily through plots/diagrams will produce thin or empty chunks. - LaTeX in slides: embedded <latexit> blocks are stripped (they are base64-encoded and useless for retrieval), but inline LaTeX notation and formula-heavy slides may still produce low-quality Markdown that chunks poorly. - Slide context is per-page, not per-deck: each chunk comes from one page/section of the converted Markdown. There is no cross-slide summarisation, so a narrative that spans multiple slides may be split across chunks without connecting context. - Category URLs (/category/<id>/) are handled in the code but not yet tested end-to-end; only event URLs are documented. - SSO authentication is CERN-specific (CERNSSOScraper). Other Indico instances with different login flows would need a different auth path. - No rate limiting or incremental scraping; large events with many attachments are processed in a single run.

pmlugato

thanks @livaage , great to see this, will get a lot of use from people I think!

Krittin gave this a review and is happy with it, plus will do the minor changes he needs to integrate it into the scraping refactoring he's done in #547 .

Good to go into dev for me then! Merging now

nausikt mentioned this pull request Apr 13, 2026

Migrate Scrapers to Scrapy #546

Open

7 tasks

pmlugato approved these changes Apr 13, 2026

View reviewed changes

pmlugato merged commit 1e2e857 into archi-physics:dev Apr 13, 2026

nausikt added a commit to nausikt/archi that referenced this pull request Apr 14, 2026

added back Liv's indico scraper archi-physics#550 params for ref.

8bdfa6f

nausikt mentioned this pull request Apr 14, 2026

Integrate Scrapy-based scrapers into Archi interfaces #547

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-manager): add Indico scraper integration#550

feat(data-manager): add Indico scraper integration#550
pmlugato merged 1 commit intoarchi-physics:devfrom
livaage:feature/scrape-indico-on-dev

livaage commented Apr 13, 2026

Uh oh!

pmlugato left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

livaage commented Apr 13, 2026

Uh oh!

pmlugato left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants