Summary
Current npc/pipeline/corpus/zim_extractor.py is Stack Exchange-specific (parses .question, .accepted-answer, rel="tag" etc.). Doesn't work on Wikipedia or other ZIM sources.
Goal
Build a generalized ZIM pipeline with:
- Auto-detection of ZIM type (Stack Exchange, Wikipedia, wiki-style, etc.)
- Pluggable site-specific parsers that all emit the same ShareGPT JSONL output
- Single entry point that routes to the right parser automatically
Parsers needed
Notes
- Wikipedia ZIM is 115GB — targeted extraction by article title list is the right approach, not full scan
- Output format must be ShareGPT JSONL compatible with
dataset.py local_file loading
- Good candidate for a free big model to build autonomously while training is running
Summary
Current
npc/pipeline/corpus/zim_extractor.pyis Stack Exchange-specific (parses.question,.accepted-answer,rel="tag"etc.). Doesn't work on Wikipedia or other ZIM sources.Goal
Build a generalized ZIM pipeline with:
Parsers needed
zim_extractor.py)Notes
dataset.pylocal_fileloading