Skip to content

Generalized ZIM extraction pipeline #1

@kmbandy

Description

@kmbandy

Summary

Current npc/pipeline/corpus/zim_extractor.py is Stack Exchange-specific (parses .question, .accepted-answer, rel="tag" etc.). Doesn't work on Wikipedia or other ZIM sources.

Goal

Build a generalized ZIM pipeline with:

  • Auto-detection of ZIM type (Stack Exchange, Wikipedia, wiki-style, etc.)
  • Pluggable site-specific parsers that all emit the same ShareGPT JSONL output
  • Single entry point that routes to the right parser automatically

Parsers needed

  • Stack Exchange (exists — zim_extractor.py)
  • Wikipedia — targeted topic extraction (GPU/compute/architecture articles)
  • Generic wiki fallback for other ZIM sources

Notes

  • Wikipedia ZIM is 115GB — targeted extraction by article title list is the right approach, not full scan
  • Output format must be ShareGPT JSONL compatible with dataset.py local_file loading
  • Good candidate for a free big model to build autonomously while training is running

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions