This document describes the provenance tracking and integrity verification system implemented in Phase 7.
Every file ingested into the Oraculus-DI-Auditor system is cryptographically verified using SHA-256 checksums and tracked in a provenance ledger. This ensures:
- Data Integrity: Files can be verified against their original checksums
- Audit Trail: Complete lineage from source to ingestion
- Reproducibility: Re-running ingestion produces verifiable results
- Tamper Detection: Any modification to files is immediately detectable
Provenance records are stored in JSONL (JSON Lines) format, with one record per line:
{
"file": "/absolute/path/to/source/file.xml",
"sha256": "a3b2c1d4e5f6789...",
"source": "https://uscode.house.gov/download/...",
"jurisdiction": "federal",
"size": 12345,
"metadata": {
"format": "xml",
"filename": "title42.xml",
"title": "42",
"section": "1983"
}
}file(required): Absolute path to the source filesha256(required): SHA-256 hash of the file contentssource(required): URL or description of the data sourcejurisdiction(required): Legal jurisdiction (e.g., "federal", "california")size(required): File size in bytesmetadata(optional): Additional metadata specific to the document type
When XML files are ingested via scripts/ingest_and_index.py --format xml, provenance is automatically recorded:
from oraculus_di_auditor.ingestion.checksum import record_provenance
record_provenance(
path=xml_file,
source_url="https://uscode.house.gov/...",
output="data/provenance.jsonl",
jurisdiction="federal",
metadata={"format": "xml", "filename": "title42.xml"}
)For custom ingestion workflows:
from pathlib import Path
from oraculus_di_auditor.ingestion.checksum import record_provenance
# Record provenance for a file
record = record_provenance(
path="/data/legal/uscode/title42.xml",
source_url="https://uscode.house.gov/download/releasepoints/us/pl/118/xml_uscAll.zip",
output="data/provenance.jsonl",
jurisdiction="federal",
metadata={
"format": "xml",
"title": "42",
"description": "Public Health and Welfare"
}
)
print(f"Recorded provenance for {record['file']}")
print(f"SHA-256: {record['sha256']}")The recommended method for verification:
# Verify all files in provenance log
python scripts/verify_integrity.py --input data/provenance.jsonl
# Show detailed results for each file
python scripts/verify_integrity.py --input data/provenance.jsonl --verboseSample output:
======================================================================
File Integrity Verification
======================================================================
Verifying files from: data/provenance.jsonl
======================================================================
Verification Results
======================================================================
Total records: 150
✓ Verified: 148
✗ Failed: 1
⚠ Missing: 1
Success rate: 98.7%
⚠ Some files failed verification or are missing
Detailed output (--verbose):
======================================================================
Detailed Results
======================================================================
✓ /data/legal/uscode/title42.xml
Status: verified
Expected: a3b2c1d4e5f6789...
Actual: a3b2c1d4e5f6789...
✗ /data/legal/cfr/title21.xml
Status: failed
Expected: b4c5d6e7f8a9012...
Actual: c5d6e7f8a9012b4...
⚠ /data/legal/ca/penal_code.xml
Status: missing
Expected: d6e7f8a9012b4c5...
from oraculus_di_auditor.ingestion.checksum import verify_integrity
# Verify all files
results = verify_integrity("data/provenance.jsonl")
print(f"Total: {results['total']}")
print(f"Verified: {results['verified']}")
print(f"Failed: {results['failed']}")
print(f"Missing: {results['missing']}")
# Check specific files
for detail in results['details']:
if detail['status'] != 'verified':
print(f"Issue with {detail['file']}: {detail['status']}")Calculate SHA-256 hash for any file:
from oraculus_di_auditor.ingestion.checksum import file_checksum
# Calculate checksum
hash_value = file_checksum("/data/legal/uscode/title42.xml")
print(f"SHA-256: {hash_value}")Implementation details:
- Reads file in 8KB chunks to handle large files efficiently
- Returns lowercase hexadecimal string
- Raises
FileNotFoundErrorif file doesn't exist
For each provenance record:
- Load the expected SHA-256 hash from the record
- Check if the file still exists at the recorded path
- Calculate the current SHA-256 hash
- Compare expected vs. actual hash
- Record verification status (verified, failed, or missing)
Default location: data/provenance.jsonl
This file is gitignored to prevent committing large provenance logs to the repository.
JSON Lines (JSONL) format allows:
- Streaming: Process one record at a time
- Append-only: New records added without reading entire file
- Human-readable: Each line is valid JSON
- Tool-friendly: Compatible with jq, grep, etc.
# Count total records
cat data/provenance.jsonl | wc -l
# Find all federal documents
cat data/provenance.jsonl | jq 'select(.jurisdiction == "federal")'
# Extract all checksums
cat data/provenance.jsonl | jq -r '.sha256'
# Find large files (>1MB)
cat data/provenance.jsonl | jq 'select(.size > 1048576)'
# Group by jurisdiction
cat data/provenance.jsonl | jq -r '.jurisdiction' | sort | uniq -cAfter downloading a legal corpus, verify integrity before ingestion:
# Download USC
wget https://uscode.house.gov/download/.../xml_uscAll.zip
# Ingest with provenance tracking
python scripts/ingest_and_index.py \
--source /data/legal/uscode \
--format xml \
--jurisdiction federal
# Verify all files
python scripts/verify_integrity.py --input data/provenance.jsonlPeriodically verify that source files haven't been modified:
# Run weekly verification
python scripts/verify_integrity.py --input data/provenance.jsonl
# Alert if any files fail
if [ $? -ne 0 ]; then
echo "WARNING: Some files failed integrity verification!"
# Send alert, log incident, etc.
fiProve that ingested data matches original sources:
# Re-download corpus
wget https://uscode.house.gov/download/.../xml_uscAll.zip -O uscode_new.zip
# Verify checksums match
unzip uscode_new.zip -d /tmp/uscode_verify
python scripts/verify_integrity.py --input data/provenance.jsonl --verboseTrack the complete lineage of ingested documents:
import json
# Read provenance log
with open("data/provenance.jsonl") as f:
for line in f:
record = json.loads(line)
print(f"Document: {record['file']}")
print(f" Source: {record['source']}")
print(f" Jurisdiction: {record['jurisdiction']}")
print(f" SHA-256: {record['sha256']}")
print(f" Size: {record['size']} bytes")
print()- Collision-resistant: Virtually impossible to create two files with same hash
- One-way: Cannot recover file contents from hash
- Deterministic: Same file always produces same hash
- Avalanche effect: Tiny change in file produces completely different hash
- Does not encrypt: Hashes are public, files are not protected
- Does not sign: Cannot verify who created the hash
- Requires secure storage: Provenance log must be protected from tampering
- Digital Signatures: Sign provenance records with GPG
- Blockchain Ledger: Store hashes in blockchain for tamper-evidence
- Timestamping: Add RFC 3161 timestamps for legal compliance
- Key Management: Integrate with HSM or key vault
Phase 7 provenance tracking is optional and backward-compatible:
- Standard ingestion (TXT/JSON) works without provenance
- XML ingestion automatically creates provenance records
- Existing workflows are unaffected
Source Files (XML) → Checksum → Provenance Record → Normalized JSON → Embeddings
↓
data/provenance.jsonl
Provenance Record → File Lookup → Checksum → Comparison → Report
↓
✓ / ✗ / ⚠
- Verify before ingestion: Check source files match expected checksums
- Keep provenance log safe: Back up
data/provenance.jsonlregularly - Run periodic checks: Schedule weekly/monthly integrity verification
- Document sources: Include detailed source URLs in provenance records
- Version control: Track provenance log changes in separate repository
- Audit compliance: Use provenance for regulatory audit requirements
Problem: File checksum doesn't match provenance record
Possible causes:
- File was modified after ingestion
- File was corrupted during download/transfer
- Original source data was updated
Solution:
# Re-download original file
# Re-verify checksums
# If checksums still don't match, investigate source updatesProblem: File in provenance record no longer exists
Possible causes:
- File was deleted or moved
- Mount point not available (for external storage)
- Path changed due to directory restructuring
Solution:
# Check if file was moved
find /data/legal -name "filename.xml"
# Update provenance record with new path
# Or restore file from backupProblem: Provenance log growing too large
Solutions:
- Split by jurisdiction:
provenance_federal.jsonl,provenance_state.jsonl - Compress old records:
gzip data/provenance_2024.jsonl - Archive completed corpora: Move to long-term storage
- Use SQLite: Migrate to database for efficient querying
- PHASE7_CORPUS.md - Phase 7 corpus integration guide
- ARCHITECTURE.md - System architecture
- SHA-256: NIST FIPS 180-4
- JSONL: JSON Lines Specification