Skip to content
Joel Natividad edited this page May 13, 2026 · 14 revisions

Cookbook

Tier: Beginner

Task-oriented recipes. Each recipe follows the same shape: ProblemDataSolutionVariationsPerformance notesSee also. Pick by goal:

Recipe index

Inspect & profile

Recipe Tier Anchor dataset Commands used
Inspect an Unknown CSV Beginner wcp.csv sniff, headers, count, stats, frequency, sample, table

Clean & normalize

Recipe Tier Anchor dataset Commands used
Clean & Normalize Beginner Boston 311 input, safenames, replace, fill, dedup

Enrich

Recipe Tier Anchor dataset Commands used
Geographic Enrichment Intermediate NYC 311 geocode, luau, lookup tables
Date Enrichment Intermediate NYC 311 datefmt, luau, partition
Fetch & Cache Advanced NOAA GHCN, GitHub stargazers fetch, --disk-cache, --jaq

Validate

Recipe Tier Anchor dataset Commands used
JSON Schema Validation Intermediate NYC 311 (1M sample) schema, validate (custom keywords)

Aggregate & analyze

Recipe Tier Anchor dataset Commands used
Stats → Insights Intermediate NYC 311 stats, moarstats, pragmastat, describegpt
Build a Data Pipeline Advanced Allegheny property sales safenames, schema, validate, sqlp, pivotp, template

Combine & compare

Recipe Tier Anchor dataset Commands used
Multi-Table Joins Intermediate NYC 311 + NOAA join, joinp (asof)
Diff & Audit Intermediate Weekly regulatory CSVs sortcheck, extsort, diff, blake3

Big files

Recipe Tier Anchor dataset Commands used
Larger-than-RAM CSV Advanced NYC 311 (16 GB / 27M rows) index, extsort, sqlp, extdedup, jemalloc tuning

Integration

Recipe Tier Anchor dataset Commands used
CKAN Integration Intermediate any CKAN portal safenames, applydp, to postgres, DataPusher+

For a per-command reference (vs. per-task), see the Command Reference.

Anchor datasets

Examples in the recipes use these consistently:

  • wcp.csv — World Cities Population, 2.7M rows / 124 MB. files/wcp.zip.
  • NYC 311 Service Requests — 1M sample at files/nyc311samp.csv; full 16 GB / 27M from NYC Open Data; bundled in resources/test/NYC_311_SR_2010-2020-sample-1M.csv.
  • Boston 311 — small fixtures in qsv's resources/test/ (also .gz, .zst, .parquet).
  • Allegheny County property salesresources/test/allegheny_property_sales*.csv with pre-built .idx and .stats.csv.
  • NOAA GHCN-Daily — fetched via qsv fetch from https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/.
  • GitHub stargazershttps://api.github.com/repos/dathere/qsv/stargazers.
  • country_continent.csv — small lookup, used in joins.

Contributing

Have a recipe to share? See Contributing to the Wiki for how to add one. Recipes are also welcome in the main qsv CONTRIBUTING.md.


Legacy recipes (preserved)

These are the original wiki Cookbook recipes from before the Phase C expansion. Each is being expanded into a dedicated Recipe page with the template above — links at the top of this index. Originals are kept here as a backup until each replacement is verified.

CKAN (legacy) — see Recipe: CKAN Integration

using qsv, ckanapi, jq and xargs.

  • get a CSV of datasets/users/groups/orgs in a CKAN instance
ckanapi -r https://demo.ckan.org dump datasets --all | qsv jsonl > datasets.csv
ckanapi -r https://demo.ckan.org dump users --all | qsv jsonl > users.csv
ckanapi -r https://demo.ckan.org dump groups --all | qsv jsonl > groups.csv
ckanapi -r https://demo.ckan.org dump organizations --all | qsv jsonl > organizations.csv
  • get a CSV of resources for a given dataset
ckanapi -r https://catalog.data.gov action package_show \
id=low-altitude-aerial-imagery-obtained-with-unmanned-aerial-systems-uas-flights-over-black-beach \
| jq -c '.resources[]' \
| qsv jsonl \
> resources.csv
ckanapi -r https://data.cnra.ca.gov action package_show id="wellstar-oil-and-gas-wells1" \
> wellstar-oil-and-gas-wells.json
cat wellstar-oil-and-gas-wells.json \
| jq -c '.resources[] | select(.name=="CSV") | .url' \
| xargs -L 1 wget -O wellstar.csv
qsv stats --everything wellstar.csv > wellstar-stats.csv

Date Enrichment (legacy) — see Recipe: Date Enrichment

(Snippet kept for back-compat with old external links. See the new recipe page for an expanded version using current command syntax.)

qsv luau map Quarter -x -f getquarter.lua nyc311samp.csv > result-qtr.csv
qsv partition Quarter nyc311byqtr result-qtr.csv
qsv datefmt "Created Date" nyc311samp.csv > result-iso8601.csv
qsv datefmt "Created Date" --formatstr "%a %b %e %T %Y %z" nyc311samp.csv > result-datefmt.csv
qsv datefmt "Created Date" --formatstr "%Y" --new-column "Created Year" nyc311samp.csv > result-year.csv
qsv luau map TAT -x -f turnaroundtime.lua nyc311samp.csv > result-tat.csv

Files: nyc311samp.csv, getquarter.lua, turnaroundtime.lua.

Multi-table join avoiding repeated columns (legacy) — see Recipe: Multi-Table Joins

cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
  qsv join 2 combined.csv 1 <(qsv select 2,11- $NEXT) > new.csv
  mv new.csv combined.csv
done

qsv cat rows of many CSVs with varying columns (legacy)

CSVs from a 3rd party are unreliable. Over time, the header formatting may change, a header may get added, or removed. The following script concatenates hundreds of CSV files that share the same basic set of headers, with marginal differences:

#!/bin/bash
# bash scripts/fix.sh ./fixed "*.csv"
DIST=$1
TARGETS=$2

mkdir -p $DIST

# Step 1: collect the unique header names
unique_headers_list=$(qsv headers -j --union $TARGETS | tr '[:upper:]' '[:lower:]' | sort -u)
unique_headers_comma=$(echo $unique_headers_list | tr ' ' ',')

# Step 2: For each CSV file, add missing headers with default values
FILES=($TARGETS)
for file in "${FILES[@]}"; do
  FILENAME=${file##*/}
  OUT="$DIST/$FILENAME"
  qsv input $file | awk 'NR==1 {print tolower($0)} NR>1 {print}' > $OUT.lowercase
  mv $OUT.lowercase $OUT
  MISSING_HEADERS=$(echo -e "$(qsv headers -j $OUT)\n$unique_headers_list" | sort | uniq -u | tr '\n' ',' | sed 's/,$//' | sed 's/^,//')
  echo $MISSING_HEADERS | qsv cat columns -o $OUT.full -p $OUT -
  mv $OUT.full $OUT
  qsv select $unique_headers_comma -o $OUT.ordered $OUT
  mv $OUT.ordered $OUT
done

Note: the original recipe used --intersect. In qsv 19+ the flag was renamed to --union since it always computed a union. Updated above.

Geocoding (legacy) — see Recipe: Geographic Enrichment

Old qsv apply geocode syntax has been superseded by the dedicated geocode command.

# Old (still works, but prefer the new geocode command):
qsv apply geocode Location --new-column City nyc311samp.csv > result-geocoded.csv

# New:
qsv geocode reverse Location --new-column City --formatstr "%name" --country US nyc311samp.csv > result-geocoded.csv

See also

Clone this wiki locally