-
Notifications
You must be signed in to change notification settings - Fork 105
Cookbook
Tier: Beginner
Task-oriented recipes. Each recipe follows the same shape: Problem → Data → Solution → Variations → Performance notes → See also. Pick by goal:
| Recipe | Tier | Anchor dataset | Commands used |
|---|---|---|---|
| Inspect an Unknown CSV | Beginner | wcp.csv |
sniff, headers, count, stats, frequency, sample, table
|
| Recipe | Tier | Anchor dataset | Commands used |
|---|---|---|---|
| Clean & Normalize | Beginner | Boston 311 |
input, safenames, replace, fill, dedup
|
| Recipe | Tier | Anchor dataset | Commands used |
|---|---|---|---|
| Geographic Enrichment | Intermediate | NYC 311 |
geocode, luau, lookup tables |
| Date Enrichment | Intermediate | NYC 311 |
datefmt, luau, partition
|
| Fetch & Cache | Advanced | NOAA GHCN, GitHub stargazers |
fetch, --disk-cache, --jaq
|
| Recipe | Tier | Anchor dataset | Commands used |
|---|---|---|---|
| JSON Schema Validation | Intermediate | NYC 311 (1M sample) |
schema, validate (custom keywords) |
| Recipe | Tier | Anchor dataset | Commands used |
|---|---|---|---|
| Stats → Insights | Intermediate | NYC 311 |
stats, moarstats, pragmastat, describegpt
|
| Build a Data Pipeline | Advanced | Allegheny property sales |
safenames, schema, validate, sqlp, pivotp, template
|
| Recipe | Tier | Anchor dataset | Commands used |
|---|---|---|---|
| Multi-Table Joins | Intermediate | NYC 311 + NOAA |
join, joinp (asof) |
| Diff & Audit | Intermediate | Weekly regulatory CSVs |
sortcheck, extsort, diff, blake3
|
| Recipe | Tier | Anchor dataset | Commands used |
|---|---|---|---|
| Larger-than-RAM CSV | Advanced | NYC 311 (16 GB / 27M rows) |
index, extsort, sqlp, extdedup, jemalloc tuning |
| Recipe | Tier | Anchor dataset | Commands used |
|---|---|---|---|
| CKAN Integration | Intermediate | any CKAN portal |
safenames, applydp, to postgres, DataPusher+ |
For a per-command reference (vs. per-task), see the Command Reference.
Examples in the recipes use these consistently:
-
wcp.csv— World Cities Population, 2.7M rows / 124 MB.files/wcp.zip. -
NYC 311 Service Requests — 1M sample at
files/nyc311samp.csv; full 16 GB / 27M from NYC Open Data; bundled inresources/test/NYC_311_SR_2010-2020-sample-1M.csv. -
Boston 311 — small fixtures in qsv's
resources/test/(also.gz,.zst,.parquet). -
Allegheny County property sales —
resources/test/allegheny_property_sales*.csvwith pre-built.idxand.stats.csv. -
NOAA GHCN-Daily — fetched via
qsv fetchfromhttps://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/. -
GitHub stargazers —
https://api.github.com/repos/dathere/qsv/stargazers. -
country_continent.csv— small lookup, used in joins.
Have a recipe to share? See Contributing to the Wiki for how to add one. Recipes are also welcome in the main qsv CONTRIBUTING.md.
These are the original wiki Cookbook recipes from before the Phase C expansion. Each is being expanded into a dedicated Recipe page with the template above — links at the top of this index. Originals are kept here as a backup until each replacement is verified.
CKAN (legacy) — see Recipe: CKAN Integration
using qsv, ckanapi, jq and xargs.
- get a CSV of datasets/users/groups/orgs in a CKAN instance
ckanapi -r https://demo.ckan.org dump datasets --all | qsv jsonl > datasets.csv
ckanapi -r https://demo.ckan.org dump users --all | qsv jsonl > users.csv
ckanapi -r https://demo.ckan.org dump groups --all | qsv jsonl > groups.csv
ckanapi -r https://demo.ckan.org dump organizations --all | qsv jsonl > organizations.csv- get a CSV of resources for a given dataset
ckanapi -r https://catalog.data.gov action package_show \
id=low-altitude-aerial-imagery-obtained-with-unmanned-aerial-systems-uas-flights-over-black-beach \
| jq -c '.resources[]' \
| qsv jsonl \
> resources.csv- get the latest version of a CSV resource for
wellstar-oil-and-gas-wells1on https://data.cnra.ca.gov, run stats on it
ckanapi -r https://data.cnra.ca.gov action package_show id="wellstar-oil-and-gas-wells1" \
> wellstar-oil-and-gas-wells.json
cat wellstar-oil-and-gas-wells.json \
| jq -c '.resources[] | select(.name=="CSV") | .url' \
| xargs -L 1 wget -O wellstar.csv
qsv stats --everything wellstar.csv > wellstar-stats.csvDate Enrichment (legacy) — see Recipe: Date Enrichment
(Snippet kept for back-compat with old external links. See the new recipe page for an expanded version using current command syntax.)
qsv luau map Quarter -x -f getquarter.lua nyc311samp.csv > result-qtr.csv
qsv partition Quarter nyc311byqtr result-qtr.csv
qsv datefmt "Created Date" nyc311samp.csv > result-iso8601.csv
qsv datefmt "Created Date" --formatstr "%a %b %e %T %Y %z" nyc311samp.csv > result-datefmt.csv
qsv datefmt "Created Date" --formatstr "%Y" --new-column "Created Year" nyc311samp.csv > result-year.csv
qsv luau map TAT -x -f turnaroundtime.lua nyc311samp.csv > result-tat.csvFiles: nyc311samp.csv, getquarter.lua, turnaroundtime.lua.
Multi-table join avoiding repeated columns (legacy) — see Recipe: Multi-Table Joins
cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
qsv join 2 combined.csv 1 <(qsv select 2,11- $NEXT) > new.csv
mv new.csv combined.csv
doneCSVs from a 3rd party are unreliable. Over time, the header formatting may change, a header may get added, or removed. The following script concatenates hundreds of CSV files that share the same basic set of headers, with marginal differences:
#!/bin/bash
# bash scripts/fix.sh ./fixed "*.csv"
DIST=$1
TARGETS=$2
mkdir -p $DIST
# Step 1: collect the unique header names
unique_headers_list=$(qsv headers -j --union $TARGETS | tr '[:upper:]' '[:lower:]' | sort -u)
unique_headers_comma=$(echo $unique_headers_list | tr ' ' ',')
# Step 2: For each CSV file, add missing headers with default values
FILES=($TARGETS)
for file in "${FILES[@]}"; do
FILENAME=${file##*/}
OUT="$DIST/$FILENAME"
qsv input $file | awk 'NR==1 {print tolower($0)} NR>1 {print}' > $OUT.lowercase
mv $OUT.lowercase $OUT
MISSING_HEADERS=$(echo -e "$(qsv headers -j $OUT)\n$unique_headers_list" | sort | uniq -u | tr '\n' ',' | sed 's/,$//' | sed 's/^,//')
echo $MISSING_HEADERS | qsv cat columns -o $OUT.full -p $OUT -
mv $OUT.full $OUT
qsv select $unique_headers_comma -o $OUT.ordered $OUT
mv $OUT.ordered $OUT
doneNote: the original recipe used
--intersect. In qsv 19+ the flag was renamed to--unionsince it always computed a union. Updated above.
Geocoding (legacy) — see Recipe: Geographic Enrichment
Old qsv apply geocode syntax has been superseded by the dedicated geocode command.
# Old (still works, but prefer the new geocode command):
qsv apply geocode Location --new-column City nyc311samp.csv > result-geocoded.csv
# New:
qsv geocode reverse Location --new-column City --formatstr "%name" --country US nyc311samp.csv > result-geocoded.csvqsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Conversion & I/O
- Geospatial
- HTTP & Web
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation