Record- and token-level data provenance for AI training datasets.
When a data contributor requests removal, model trainers face a practical gap: unlearning algorithms require a forget set, yet no tool can locate which training records belong to a given author. Existing provenance systems operate at file or dataset level, forcing catastrophic over-deletion. We present ob, a record- and token-level data provenance system that propagates author identity through data processing pipelines and resolves revocation requests into precise forget sets via deterministic queries. Evaluation on 219,555 Wikipedia pages demonstrates that record-level provenance eliminates dataset-level over-deletion (from 101× to 1.3×), while integration adds 1.3–4.0% throughput overhead (HuggingFace) and 2.1–19.0% (Datatrove) on wiki data. On a 1.7B model, provenance-based forget sets improve unlearning by 42% over random baselines.
Source code lives in the rust-originblame repository, which contains both the Rust native implementation and the Python package.
# Rust binary (recommended for performance)
cd rust-originblame && cargo build --release
# Python package (with optional Rust backend)
cd rust-originblame/python && pip install .CLI requires Python >= 3.12 and typer. When the Rust binary is available, the Python package automatically delegates to it for performance-critical operations.
# 1. Initialize tracking in your dataset directory
ob init
# 2. Register an author (e.g., a data source you scraped from)
ob author.add "Wikimedia" "wikimedia@example.com"
# 3. Register a section (a file + author + license combination)
ob register.add \
--path raw/wiki_en.xml \
--authors wikimedia@example.com \
--license CC-BY-SA-4.0 \
--year 2024
# 4. Track data lines as you process them (Python API)
python << 'EOF'
import sys
sys.path.insert(0, "path/to/rust-originblame/python/src")
from ob import source, track
source.append("raw/wiki_en.xml") # activate the section
for line in open("data.jsonl"):
record = {"text": line.strip(), "lang": "en"}
track(record, file="data.jsonl")
EOF
# 5. Check provenance for a specific line
ob blame data.jsonl 42That's it. ob blame tells you exactly which author/section a line came from.
OriginBlame tracks provenance at data collection time -- not retroactively. It stores metadata in .ob/ inside your repository, using plain JSONL files organized into three layers:
.ob/
authors/ # who: name, email, id = sha256(name+email)
sections/ # what: file path + authors + license, sharded by sha256
document-index/ # which line came from where: (line_hash, file, sources)
token-index.gpt2/ # how many tokens each document contributed (per tokenizer)
index/ # binary index: id → refs[] (bucket prefixes + token ranges)
log # operation audit trail
- Content-addressable: every record is indexed by SHA-256 hash. No IDs to manage, no central database.
- Decentralized: metadata lives in your repo. No server, no config files, no external state.
- Zero ML dependencies: the core
obpackage has no ML imports. Optionalob-utiladds parsers and embedding-based reconciliation. - Reconcile after edits: when data files change,
ob reconcileuses hash matching (Pass 1) and optional embedding similarity (Pass 2) to re-link provenance to modified lines.
ob init Initialize .ob/ tracking directory
ob author.add NAME EMAIL Register an author
ob register.add --path PATH --authors EMAIL --license LICENSE --year YEAR
Register a section (path + authors + license + year)
ob blame FILE LINE Show which section(s) a specific line belongs to
ob show [--author NAME] [--email EMAIL] [--section HASH] [--license NAME] [--revoked] [--index]
Show provenance metadata by various dimensions
ob revoke (--author NAME | --email EMAIL | --section HASH | --line-hash HASH --file FILE)
Revoke author/section claims at multiple granularities
ob purge --file FILE [--author EMAIL] [--dry-run] [--index]
Physically delete revoked data from tracked files
ob clean Merge PID files, archive revoked records, rotate log
ob merge absorb PATH Absorb provenance from another repository
ob reconcile FILE Reconcile provenance after data edits (hash + optional embedding)
--model/-m MODEL Embedding model name (enables Pass 2 semantic matching)
--threshold/-t FLOAT Cosine similarity threshold (default: 0.85)
--embedding-api/-e URL OpenAI-compatible embedding API URL
--compute-all-embeddings Compute and store embeddings for ALL lines
ob index build Build provenance index for fast lookups
ob status Show summary statistics (authors, sections, records)
ob log Show operation audit trail
ob version Show ob version
# Token-level provenance (add --tokenizer to show/revoke/status)
ob show --author Alice --tokenizer gpt2 Show token counts by author
ob revoke --author Alice --tokenizer gpt2 Revoke author's token entries
ob status --tokenizer gpt2 Show token-index statistics
ob generate-set --tokenizer gpt2 -o forget.bin Generate binary forget set (bitmask)
import sys
sys.path.insert(0, "path/to/rust-originblame/python/src")
from ob import init, author_add, register_section, source, track
init()
# Register provenance metadata
author_add("Wikimedia", "wikimedia@example.com")
register_section("raw/wiki.xml", ["wikimedia@example.com"], "CC-BY-SA-4.0", "2024")
# Track data lines (uses source stack for attribution)
source.append("raw/wiki.xml")
for record in read_jsonl("data.jsonl"):
track(record, file="data.jsonl")
source.pop()
# Section filtering (optional - activate specific section by hash)
source.append("raw/wiki.xml", section="abcd1234efgh5678")
track(record, file="data.jsonl")
source.pop()
# Scoped tracking with context manager
with source.sources("raw/wiki.xml"):
for record in read_jsonl("data.jsonl"):
track(record, file="data.jsonl")
# Query provenance via CLI (delegates to native Rust binary)
# ob blame data.jsonl 42
# ob show --author Wikimedia
# ob revoke --author Wikimedia --tokenizer gpt2The ob-util package lives in rust-originblame/python/packages/ob-util/. It adds parsers, embedding reconciliation, and copyright export:
pip install ./rust-originblame/python/packages/ob-util
pip install ./rust-originblame/python/packages/ob-util[reconcile] # with torch + sentence-transformers
ob parse --format mediawiki --input raw/wiki.xml --output parsed.jsonl
# Reconcile with local embedding model
ob reconcile data.jsonl --model all-MiniLM-L6-v2
# Reconcile via embedding API (e.g. LM Studio, vLLM)
ob reconcile data.jsonl -m nomic-embed-text-v1.5 -e http://localhost:1234/v1
# Compute embeddings for all lines (prepare for future reconcile)
ob reconcile data.jsonl -m nomic-embed-text-v1.5 -e http://localhost:1234/v1 --compute-all-embeddings
ob export-copyright --format dep5 --output debian/copyrightOriginBlame: Record- and Token-Level Data Provenance for AI Training Datasets
See paper/originblame.tex for the full paper (LaTeX source). Artifact at Zenodo. Repository archived at Zenodo.
Paper compiles with pdflatex originblame.tex && bibtex originblame && pdflatex originblame.tex && pdflatex originblame.tex. Benchmark reproduction requires:
| Resource | Size | Source |
|---|---|---|
| zhwiki XML dump | ~2 GB | Wikimedia dumps (8 .7z files; URLs in benchmarks/README.md) |
| Qwen3-1.7B model | 3.8 GB | huggingface-cli download Qwen/Qwen3-1.7B --local-dir benchmarks/models/qwen3-1.7b |
| Linux kernel | ~4 GB | git clone https://mirrors.ustc.edu.cn/linux.git && git checkout e75a43c7cec459a07d91ed17de4de13ede2b7758 |
| Zhipu API key | — | Required for QA data generation in MU experiments (set ZHIPU_API_KEY in .env) |
| Embedding API | — | Required for semantic reconcile only: OpenAI-compatible API at http://localhost:1234/v1 (LM Studio / vLLM with nomic-embed-text-v1.5) |
All pipeline MAU unlearning results are fully deterministic given the same QA data, seed (42), and model weights. Hash-only reconcile and all query benchmarks require no API keys. See benchmarks/README.md for full setup instructions.
Evaluated on a Chinese Wikipedia dump (219,555 pages, 482,543 contributors) at four scales (1k–220k pages):
Revocation Precision (10k scale) — Line-level provenance eliminates over-deletion:
| Revoking Author | Share | Lines Removed (ob) | Over-deletion (dataset-level) |
|---|---|---|---|
| InternetArchiveBot | 79.5% | 7,953 | 1.3× |
| Walter Grassroot | 17.1% | 1,712 | 5.8× |
| KLBot2 | 5.0% | 499 | 20.0× |
| HuangQQ | 1.0% | 99 | 101.0× |
Reconcile Recovery (after 10% edit + 5% delete + 5% insert mutation):
| Scale | Hash Match | Semantic Match | Recovery |
|---|---|---|---|
| 1k | 865 | 103 | 96.3% |
| 10k | 8,479 | 1,294 | 98.1% |
| 100k | 84,821 | 13,222 | 98.2% |
| 100k | 84,821 | — | 84.9%† |
†Hash-only (Pass 1). Semantic matching was not measured at these scales due to embedding API throughput constraints.
Scalability (3-run avg., ms; native implementation with mmap, rayon, binary index):
| Scale | blame | show | show_idx | revoke | purge | purge_idx |
|---|---|---|---|---|---|---|
| 1k | 1 | 3 | 3 | <1 | 0.6 | 3 |
| 10k | 1 | 9 | 10 | <1 | 0.7 | 41 |
| 100k | 1 | 33 | 34 | <1 | 5.8 | 106 |
| 220k | 3 | 80 | 78 | <1 | 12 | 190 |
†Synthetic benchmark. All operations sub-100ms at 220k lines.
Storage overhead: decreases with scale from 0.32× at 1k lines to 0.22× at 220k lines. Line coverage: 100% at all scales.
Token-Level Streaming Benchmark — Real gpt2 tokenization on zhwiki data, no JSONL produced:
| Pages | Tokens | Datatrove Drop | HF Drop | Storage (Datatrove) | Query (ms) |
|---|---|---|---|---|---|
| 1k | 2.8M | −13.8% | −2.0% | 1.33× | 3 |
| 10k | 25.9M | −19.0% | −2.5% | 1.29× | 9 |
| 100k | 302.4M | −13.4% | −1.3% | 1.24× | 33 |
| 219,555 | 712.4M | −2.1% | −4.0% | 1.23× | 69 |
Machine Unlearning Evaluation — 8 unlearning experiments (2 forget set types × 2 algorithms × 2 authors) testing whether ob's line-level provenance produces better forget sets than random baseline. RMU is included as a known-limitation baseline (QLoRA incompatible — see benchmarks/README.md for details). Line-level forget sets dominate random baseline on all four metrics (PPL, ROUGE-L) for NPO, demonstrating that provenance-based localization directly affects unlearning quality.
Cross-Domain Generalization — Linux kernel source code with git blame attribution (3 scales):
| Files | Authors | Datatrove Drop | HF Drop | Storage (Datatrove) | Over-deletion (file vs record) |
|---|---|---|---|---|---|
| 1,000 | 671 | −25.7% | −0.2% | 1.06× | 9× |
| 10,000 | 5,285 | −40.9% | −1.0% | 1.02× | — |
| 44,222 | 6,964 | — | −2.5% | 1.01× | 1.3× |
Attribution uses git blame (line-level authorship) on the top N C/H files from a deep clone of the Linux kernel repository, not git log commit authors. File-level deletion remains wasteful even with accurate attribution: at the smallest scale, revoking Linus Torvalds at file granularity would delete 9× more lines than necessary.
# Rust implementation
cd rust-originblame && cargo test # 71 tests
# Python tests (ob-util package only; core ob delegates to Rust)
cd python && pytest packages/ob-util/tests/ # 76 tests
ruff check src/This repository (originblame) contains the paper and benchmarks:
paper/— LaTeX source for the paperbenchmarks/— evaluation scripts and results
Source code lives in rust-originblame:
- Rust native implementation (
src/) - Python package (
python/src/ob/) with optional Rust backend - Optional utilities (
python/packages/ob-util/)
- Rust native implementation — completed: independent rust-originblame repository with mmap, rayon parallelism, and binary index. All queries sub-100ms at 220k lines.
- Token-level provenance — completed: independent token-index layer with streaming mode, binary forget-set generation, and framework integration (Datatrove, HuggingFace) with provenance tracking overhead as low as 1.3% at 100k scale.
- Multi-level revocation — completed: three-level revocation model (author, section, line-hash) with lazy cascade and reversible tags.
- Cross-domain evaluation — completed: Linux kernel source code with git blame attribution, demonstrating generalization beyond wiki-style text.
- Machine unlearning validation — completed: experimental validation showing that line-level provenance produces better forget sets than random baselines (42% improvement in forgetting, 23% in utility preservation).
Target venue: EDBT 2027 (CCF-B).