Generalized ZIM extraction pipeline

## Summary

Current `npc/pipeline/corpus/zim_extractor.py` is Stack Exchange-specific (parses `.question`, `.accepted-answer`, `rel="tag"` etc.). Doesn't work on Wikipedia or other ZIM sources.

## Goal

Build a generalized ZIM pipeline with:
- Auto-detection of ZIM type (Stack Exchange, Wikipedia, wiki-style, etc.)
- Pluggable site-specific parsers that all emit the same ShareGPT JSONL output
- Single entry point that routes to the right parser automatically

## Parsers needed

- [x] Stack Exchange (exists — `zim_extractor.py`)
- [ ] Wikipedia — targeted topic extraction (GPU/compute/architecture articles)
- [ ] Generic wiki fallback for other ZIM sources

## Notes

- Wikipedia ZIM is 115GB — targeted extraction by article title list is the right approach, not full scan
- Output format must be ShareGPT JSONL compatible with `dataset.py` `local_file` loading
- Good candidate for a free big model to build autonomously while training is running


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalized ZIM extraction pipeline #1

Summary

Goal

Parsers needed

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Generalized ZIM extraction pipeline #1

Description

Summary

Goal

Parsers needed

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions