Skip to content

feat(data-manager): add Indico scraper integration#550

Merged
pmlugato merged 1 commit intoarchi-physics:devfrom
livaage:feature/scrape-indico-on-dev
Apr 13, 2026
Merged

feat(data-manager): add Indico scraper integration#550
pmlugato merged 1 commit intoarchi-physics:devfrom
livaage:feature/scrape-indico-on-dev

Conversation

@livaage
Copy link
Copy Markdown

@livaage livaage commented Apr 13, 2026

Adds support for scraping Indico events and meeting materials, alongside the existing scrapers.

Scraper (indico_scraper.py)

  • Fetches event metadata, contributions, and slide attachments via the Indico REST API.
  • Converts PDF/PPTX/PPT/ODP slides to Markdown via MarkItDown; strips embedded blocks that inflate chunk counts on formula-heavy slides (slide_converter.py).
  • Deduplicates attachments when the same slides are uploaded in multiple formats (e.g. PDF + PPTX): keeps the higher-priority format.
  • Detects SSO-protected events and authenticates via CERNSSOScraper.

ScraperManager integration

  • collect_indico() / schedule_collect_indico() hooks.
  • URL routing: explicit "indico-" prefix in weblists, plus auto-detection for URLs with "indico" in the hostname and /event/ in the path.
  • Indico documents use source_type="web" (matching the existing CHECK constraint) with a "scraper": "indico" metadata field for filtering.
  • Most of the changes to the scraper_manager.py is just linting

Vectorstore

  • Prepends a one-line metadata header to each Indico chunk (event title, date, contribution, speaker, affiliation, start time, duration, session) so BM25 retrieval can match on speaker name, time of day, etc. Gated on metadata "scraper"="indico"; no other sources affected.

Config / docs / examples

  • base-config.yaml: indico source block (disabled by default).
  • docs/docs/data_sources.md: Indico section.
  • examples/agents/indico-assistant.md: agent spec for Indico queries.
  • examples/deployments/basic-agent/indico_example.list: example weblist.
  • SourceRegistry: register "indico" source (depends on links).

Dependencies

  • pyproject.toml + requirements-base.txt: add markitdown[pdf,pptx].

Known limitations

  • Images and figures in slides are not extracted or described; only text content is converted to Markdown. Slides that communicate primarily through plots/diagrams will produce thin or empty chunks.
  • LaTeX in slides: embedded blocks are stripped (they are base64-encoded and useless for retrieval), but inline LaTeX notation and formula-heavy slides may still produce low-quality Markdown that chunks poorly.
  • Slide context is per-page, not per-deck: each chunk comes from one page/section of the converted Markdown. There is no cross-slide summarisation, so a narrative that spans multiple slides may be split across chunks without connecting context.
  • Category URLs (/category//) are handled in the code but not yet tested end-to-end; only event URLs are documented.
  • SSO authentication is CERN-specific (CERNSSOScraper). Other Indico instances with different login flows would need a different auth path.
  • No rate limiting or incremental scraping; large events with many attachments are processed in a single run.
  • indico_scraper.py is large (~1300 lines); a follow-up refactor could
    split API interaction from resource construction.

Adds support for scraping Indico events and meeting materials, alongside
the existing link/git/sso/elog scrapers.

Scraper (indico_scraper.py)
- Fetches event metadata, contributions, and slide attachments via the
  Indico REST API.
- Converts PDF/PPTX/PPT/ODP slides to Markdown via MarkItDown; strips
  embedded <latexit> blocks that inflate chunk counts on formula-heavy
  slides (slide_converter.py).
- Deduplicates attachments when the same slides are uploaded in multiple
  formats (e.g. PDF + PPTX): keeps the higher-priority format.
- Detects SSO-protected events and authenticates via CERNSSOScraper.
- Stores speaker affiliation alongside speaker name in resource metadata.

ScraperManager integration
- collect_indico() / schedule_collect_indico() hooks.
- URL routing: explicit "indico-" prefix in weblists, plus auto-detection
  for URLs with "indico" in the hostname and /event/ in the path.
- Indico documents use source_type="web" (matching the existing CHECK
  constraint) with a "scraper": "indico" metadata field for filtering.

Vectorstore
- Prepends a one-line metadata header to each Indico chunk (event title,
  date, contribution, speaker, affiliation, start time, duration, session)
  so BM25 retrieval can match on speaker name, time of day, etc.
  Gated on metadata "scraper"="indico"; no other sources affected.

Config / docs / examples
- base-config.yaml: indico source block (disabled by default).
- docs/docs/data_sources.md: Indico section.
- examples/agents/indico-assistant.md: agent spec for Indico queries.
- examples/deployments/basic-agent/indico_example.list: example weblist.
- SourceRegistry: register "indico" source (depends on links).

Dependencies
- pyproject.toml + requirements-base.txt: add markitdown[pdf,pptx].

Known limitations
- Images and figures in slides are not extracted or described; only text
  content is converted to Markdown. Slides that communicate primarily
  through plots/diagrams will produce thin or empty chunks.
- LaTeX in slides: embedded <latexit> blocks are stripped (they are
  base64-encoded and useless for retrieval), but inline LaTeX notation
  and formula-heavy slides may still produce low-quality Markdown that
  chunks poorly.
- Slide context is per-page, not per-deck: each chunk comes from one
  page/section of the converted Markdown. There is no cross-slide
  summarisation, so a narrative that spans multiple slides may be split
  across chunks without connecting context.
- Category URLs (/category/<id>/) are handled in the code but not yet
  tested end-to-end; only event URLs are documented.
- SSO authentication is CERN-specific (CERNSSOScraper). Other Indico
  instances with different login flows would need a different auth path.
- No rate limiting or incremental scraping; large events with many
  attachments are processed in a single run.
@nausikt nausikt mentioned this pull request Apr 13, 2026
7 tasks
Copy link
Copy Markdown
Collaborator

@pmlugato pmlugato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @livaage , great to see this, will get a lot of use from people I think!

Krittin gave this a review and is happy with it, plus will do the minor changes he needs to integrate it into the scraping refactoring he's done in #547 .

Good to go into dev for me then! Merging now

@pmlugato pmlugato merged commit 1e2e857 into archi-physics:dev Apr 13, 2026
nausikt added a commit to nausikt/archi that referenced this pull request Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants