feat(data-manager): add Indico scraper integration#550
Merged
pmlugato merged 1 commit intoarchi-physics:devfrom Apr 13, 2026
Merged
feat(data-manager): add Indico scraper integration#550pmlugato merged 1 commit intoarchi-physics:devfrom
pmlugato merged 1 commit intoarchi-physics:devfrom
Conversation
Adds support for scraping Indico events and meeting materials, alongside the existing link/git/sso/elog scrapers. Scraper (indico_scraper.py) - Fetches event metadata, contributions, and slide attachments via the Indico REST API. - Converts PDF/PPTX/PPT/ODP slides to Markdown via MarkItDown; strips embedded <latexit> blocks that inflate chunk counts on formula-heavy slides (slide_converter.py). - Deduplicates attachments when the same slides are uploaded in multiple formats (e.g. PDF + PPTX): keeps the higher-priority format. - Detects SSO-protected events and authenticates via CERNSSOScraper. - Stores speaker affiliation alongside speaker name in resource metadata. ScraperManager integration - collect_indico() / schedule_collect_indico() hooks. - URL routing: explicit "indico-" prefix in weblists, plus auto-detection for URLs with "indico" in the hostname and /event/ in the path. - Indico documents use source_type="web" (matching the existing CHECK constraint) with a "scraper": "indico" metadata field for filtering. Vectorstore - Prepends a one-line metadata header to each Indico chunk (event title, date, contribution, speaker, affiliation, start time, duration, session) so BM25 retrieval can match on speaker name, time of day, etc. Gated on metadata "scraper"="indico"; no other sources affected. Config / docs / examples - base-config.yaml: indico source block (disabled by default). - docs/docs/data_sources.md: Indico section. - examples/agents/indico-assistant.md: agent spec for Indico queries. - examples/deployments/basic-agent/indico_example.list: example weblist. - SourceRegistry: register "indico" source (depends on links). Dependencies - pyproject.toml + requirements-base.txt: add markitdown[pdf,pptx]. Known limitations - Images and figures in slides are not extracted or described; only text content is converted to Markdown. Slides that communicate primarily through plots/diagrams will produce thin or empty chunks. - LaTeX in slides: embedded <latexit> blocks are stripped (they are base64-encoded and useless for retrieval), but inline LaTeX notation and formula-heavy slides may still produce low-quality Markdown that chunks poorly. - Slide context is per-page, not per-deck: each chunk comes from one page/section of the converted Markdown. There is no cross-slide summarisation, so a narrative that spans multiple slides may be split across chunks without connecting context. - Category URLs (/category/<id>/) are handled in the code but not yet tested end-to-end; only event URLs are documented. - SSO authentication is CERN-specific (CERNSSOScraper). Other Indico instances with different login flows would need a different auth path. - No rate limiting or incremental scraping; large events with many attachments are processed in a single run.
pmlugato
approved these changes
Apr 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds support for scraping Indico events and meeting materials, alongside the existing scrapers.
Scraper (indico_scraper.py)
ScraperManager integration
Vectorstore
Config / docs / examples
Dependencies
Known limitations
split API interaction from resource construction.