-
Notifications
You must be signed in to change notification settings - Fork 105
Indexing Compression Diff
Tier: Intermediate
Commands covered: index, diff, blake3, extsort, extdedup, snappy
Per-command flag reference lives in
/docs/help/. This page is the workflow layer — when to reach for each command and how they compose.
These are the performance primitives. None of them transform data — they make subsequent commands faster, smaller, or auditable. extsort / extdedup / snappy are also covered in their respective category pages; this page collects them for the "performance / ops" mindset.
| If you want to… | Use | Notes |
|---|---|---|
Make count, sample, slice, stats, frequency, split, schema faster |
index |
One-time cost; speeds up 9 commands by 2-9× |
| Auto-index files over a size threshold |
QSV_AUTOINDEX_SIZE env var |
See Environment Variables |
| Compute a cryptographic fingerprint of a file | blake3 |
Cache keys, pipeline gating, integrity checks |
| Diff two CSVs ludicrously fast | diff |
1M × 9 cols in <600 ms with a primary key |
| Sort a file > RAM | extsort |
See Aggregation & Statistics |
| Dedup a file > RAM | extdedup |
See Aggregation & Statistics |
| Stream-compress / decompress Snappy | snappy |
Most qsv commands auto-handle .sz; this is the explicit interface |
Builds a <file>.idx sidecar that lets random-access commands (and multithreaded ones) skip parsing rows they don't need. The headline numbers from the README:
- 14 seconds to index the 15 GB / 28 M-row NYC 311 dataset
- After indexing:
count,sample,sliceare instantaneous -
frequency,split,stats,schemabecome multithreaded with an index (the 🏎️ symbol in the README legend) -
luaugets random-access mode with an index (otherwise sequential only)
The index is automatically used by any command that benefits. If the source file changes, the index goes stale — re-run qsv index. If QSV_AUTOINDEX_SIZE is set, qsv auto-creates / refreshes the index for files larger than the threshold.
Example: one-time index of the NYC 311 full export
qsv index nyc311-full.csv # ~14s for 15 GB
ls -lh nyc311-full.csv*
# 15G nyc311-full.csv
# 27M nyc311-full.csv.idxExample: index a CSV via shell, then enjoy the speedup
qsv index NYC_311_SR_2010-2020-sample-1M.csv
qsv count NYC_311_SR_2010-2020-sample-1M.csv # instant
qsv slice -i 500000 NYC_311_SR_2010-2020-sample-1M.csv | qsv flatten # parses 1 row, not 500kExample: turn on auto-indexing for any file > 50 MB
export QSV_AUTOINDEX_SIZE=50000000
# Now qsv index runs automatically on big files when neededSee also: /docs/help/index.md, Environment Variables — QSV_AUTOINDEX_SIZE, Performance Tuning, Stats Cache & Caching.
The headline number: 1M × 9 columns in under 600 ms. diff is not a line-level differ — it's a primary-key-aware differ. Each row in left/right is identified by one or more primary key columns; the differ classifies each row as added, removed, or modified.
Primary-key uniqueness is required. Use qsv extdedup --select keycol data.csv --no-output to check.
Example: weekly regulatory CSV diff
qsv diff --select id last_week.csv this_week.csv > delta.csvOutput columns include diffresult (Add / Remove / Modify), the primary key, and a side-by-side view of changed fields.
Example: diff with different delimiters on each side
qsv diff --delimiter-left '\t' --delimiter-right ',' \
--select 'BBL' \
last.tsv this.csvExample: handle differently-structured headers
qsv diff --no-headers-right \
--select 1 \
last.csv right_no_headers.csvExample: gate a CI job on no-changes
qsv diff --select id reference.csv candidate.csv \
| qsv count
# Exits with non-zero count = the diff is non-emptySee also: /docs/help/diff.md, extdedup — for verifying primary-key uniqueness, blake3 — for fingerprint-level "did anything change at all" checks, Recipe: Diff & Audit.
Multithreaded, mmap-backed BLAKE3 hashing. Functionally similar to b3sum but bundled into qsv. Supports keyed hashing, key derivation, variable-length output, and checksum verification.
The CSV-relevant uses are cache keys ("did the input change between runs?"), pipeline gating ("only run downstream if upstream changed"), and integrity checks ("is this file the same as the one signed off last week?").
Example: fingerprint a CSV for a pipeline-gating cache key
KEY=$(qsv blake3 input.csv | awk '{print $1}')
echo "Cache key: $KEY"Example: verify a downloaded .zip matches an expected hash
echo "abc123... downloaded.zip" | qsv blake3 --check -Example: short hash for use as a column in a CSV
qsv blake3 -l 8 input.csv-l N truncates the output to N bytes before hex encoding.
Example: keyed BLAKE3 for keyed-hash-based pseudonymization
echo -n 'my-32-byte-secret-key-............' \
| qsv blake3 --keyed sensitive.csvSee also: /docs/help/blake3.md, pseudo — for reversible pseudonymization, diff — for row-level diff after a hash-level detection, BLAKE3 spec.
External merge-sort for files > RAM. Multithreaded. See Aggregation & Statistics → extsort for the full treatment.
qsv extsort --select 'Created Date' nyc311-full.csv > nyc311-by-date.csvOn-disk hash-table dedup. Preserves input order (unlike dedup). See Aggregation & Statistics → extdedup.
qsv extdedup --select session_id huge_clickstream.csv > unique.csvStreaming Snappy compression. Multithreaded compression is 5-6× faster than the auto-compression most qsv commands do on .sz output. See Conversion & I/O → snappy.
qsv snappy compress nyc311-full.csv > nyc311-full.csv.sz- Command Reference (index)
- Performance Tuning — when to index, how multithreading works
-
Environment Variables —
QSV_AUTOINDEX_SIZE, cache config - Stats Cache & Caching
-
Aggregation & Statistics —
extsort/extdedup -
Conversion & I/O —
snappydeep-dive - Cookbook → Diff & Audit
- Cookbook → Larger-than-RAM CSV
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Conversion & I/O
- Geospatial
- HTTP & Web
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation