Skip to content

Indexing Compression Diff

Joel Natividad edited this page May 13, 2026 · 2 revisions

Indexing, Compression & Diff

Tier: Intermediate Commands covered: index, diff, blake3, extsort, extdedup, snappy

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

These are the performance primitives. None of them transform data — they make subsequent commands faster, smaller, or auditable. extsort / extdedup / snappy are also covered in their respective category pages; this page collects them for the "performance / ops" mindset.

Quick decision table

If you want to… Use Notes
Make count, sample, slice, stats, frequency, split, schema faster index One-time cost; speeds up 9 commands by 2-9×
Auto-index files over a size threshold QSV_AUTOINDEX_SIZE env var See Environment Variables
Compute a cryptographic fingerprint of a file blake3 Cache keys, pipeline gating, integrity checks
Diff two CSVs ludicrously fast diff 1M × 9 cols in <600 ms with a primary key
Sort a file > RAM extsort See Aggregation & Statistics
Dedup a file > RAM extdedup See Aggregation & Statistics
Stream-compress / decompress Snappy snappy Most qsv commands auto-handle .sz; this is the explicit interface

index

Builds a <file>.idx sidecar that lets random-access commands (and multithreaded ones) skip parsing rows they don't need. The headline numbers from the README:

  • 14 seconds to index the 15 GB / 28 M-row NYC 311 dataset
  • After indexing: count, sample, slice are instantaneous
  • frequency, split, stats, schema become multithreaded with an index (the 🏎️ symbol in the README legend)
  • luau gets random-access mode with an index (otherwise sequential only)

The index is automatically used by any command that benefits. If the source file changes, the index goes stale — re-run qsv index. If QSV_AUTOINDEX_SIZE is set, qsv auto-creates / refreshes the index for files larger than the threshold.

Example: one-time index of the NYC 311 full export

qsv index nyc311-full.csv     # ~14s for 15 GB
ls -lh nyc311-full.csv*
# 15G  nyc311-full.csv
# 27M  nyc311-full.csv.idx

Example: index a CSV via shell, then enjoy the speedup

qsv index NYC_311_SR_2010-2020-sample-1M.csv
qsv count NYC_311_SR_2010-2020-sample-1M.csv      # instant
qsv slice -i 500000 NYC_311_SR_2010-2020-sample-1M.csv | qsv flatten  # parses 1 row, not 500k

Example: turn on auto-indexing for any file > 50 MB

export QSV_AUTOINDEX_SIZE=50000000
# Now qsv index runs automatically on big files when needed

See also: /docs/help/index.md, Environment VariablesQSV_AUTOINDEX_SIZE, Performance Tuning, Stats Cache & Caching.

diff

The headline number: 1M × 9 columns in under 600 ms. diff is not a line-level differ — it's a primary-key-aware differ. Each row in left/right is identified by one or more primary key columns; the differ classifies each row as added, removed, or modified.

Primary-key uniqueness is required. Use qsv extdedup --select keycol data.csv --no-output to check.

Example: weekly regulatory CSV diff

qsv diff --select id last_week.csv this_week.csv > delta.csv

Output columns include diffresult (Add / Remove / Modify), the primary key, and a side-by-side view of changed fields.

Example: diff with different delimiters on each side

qsv diff --delimiter-left '\t' --delimiter-right ',' \
  --select 'BBL' \
  last.tsv this.csv

Example: handle differently-structured headers

qsv diff --no-headers-right \
  --select 1 \
  last.csv right_no_headers.csv

Example: gate a CI job on no-changes

qsv diff --select id reference.csv candidate.csv \
  | qsv count
# Exits with non-zero count = the diff is non-empty

See also: /docs/help/diff.md, extdedup — for verifying primary-key uniqueness, blake3 — for fingerprint-level "did anything change at all" checks, Recipe: Diff & Audit.

blake3

Multithreaded, mmap-backed BLAKE3 hashing. Functionally similar to b3sum but bundled into qsv. Supports keyed hashing, key derivation, variable-length output, and checksum verification.

The CSV-relevant uses are cache keys ("did the input change between runs?"), pipeline gating ("only run downstream if upstream changed"), and integrity checks ("is this file the same as the one signed off last week?").

Example: fingerprint a CSV for a pipeline-gating cache key

KEY=$(qsv blake3 input.csv | awk '{print $1}')
echo "Cache key: $KEY"

Example: verify a downloaded .zip matches an expected hash

echo "abc123...  downloaded.zip" | qsv blake3 --check -

Example: short hash for use as a column in a CSV

qsv blake3 -l 8 input.csv

-l N truncates the output to N bytes before hex encoding.

Example: keyed BLAKE3 for keyed-hash-based pseudonymization

echo -n 'my-32-byte-secret-key-............' \
  | qsv blake3 --keyed sensitive.csv

See also: /docs/help/blake3.md, pseudo — for reversible pseudonymization, diff — for row-level diff after a hash-level detection, BLAKE3 spec.

extsort (cross-reference)

External merge-sort for files > RAM. Multithreaded. See Aggregation & Statistics → extsort for the full treatment.

qsv extsort --select 'Created Date' nyc311-full.csv > nyc311-by-date.csv

extdedup (cross-reference)

On-disk hash-table dedup. Preserves input order (unlike dedup). See Aggregation & Statistics → extdedup.

qsv extdedup --select session_id huge_clickstream.csv > unique.csv

snappy (cross-reference)

Streaming Snappy compression. Multithreaded compression is 5-6× faster than the auto-compression most qsv commands do on .sz output. See Conversion & I/O → snappy.

qsv snappy compress nyc311-full.csv > nyc311-full.csv.sz

See also

Clone this wiki locally