-
Notifications
You must be signed in to change notification settings - Fork 105
Recipe Clean and Normalize
Tier: Beginner
Commands used: input, safenames, replace, fill, dedup, sortcheck
Anchor dataset: Boston 311 (resources/test/boston311-100.csv from the qsv repo, ~100 rows; also available in .gz, .zst, .parquet)
A vendor CSV arrives with:
- non-UTF-8 encoding and a 3-line preamble
- inconsistent column names (
Customer Name,customer_id,_id,Column with Spaces!@#) - sentinel values masquerading as missing data (
"N/A","null","-", empty strings) - mostly-filled categorical columns with sparse holes
- exact-duplicate rows from a bad export pipeline
You want a clean, DB-ready CSV ready for downstream analytics.
Grab the Boston 311 sample:
curl -LO https://raw.githubusercontent.com/dathere/qsv/master/resources/test/boston311-100.csv
ls -lh boston311-100.csvFor this recipe we'll simulate the messy input by mangling a few values, then walk through the cleanup. In real life, you'd swap boston311-100.csv for whatever vendor file you're cleaning.
A six-step pipeline. Each step writes to a new file so you can diff between stages while debugging.
qsv input \
--auto-skip \
--trim-headers \
--trim-fields \
--encoding-errors replace \
boston311-100.csv > step1.csv-
--auto-skipsniffs and skips preamble lines (overrides--skip-lines) -
--trim-headers/--trim-fieldsstrip leading/trailing whitespace -
--encoding-errors replacesubstitutes invalid UTF-8 bytes with�(alternatives:skip,strict)
For true transcoding from a known encoding like ISO-8859-1, run iconv before qsv input (qsv input is lossy, not transcoding-correct).
qsv safenames step1.csv > step2.csvCustomer Name → customer_name; Column!@# → column__; _id → reserved__id (CKAN-required); duplicates get a numeric suffix.
Audit-only mode (no rewrite):
qsv safenames --mode V step1.csv
# stderr: 4 unsafe header/s: ["Column with Spaces", "_id", "Phone Number", ""]qsv replace -i '^(N/A|null|none|-|unknown)$' '' step2.csv > step3.csv-i makes the regex case-insensitive. The ^...$ anchors ensure we only match cells that are exactly one of those sentinels — not cells that contain them as substrings.
qsv fill --groupby case_status status step3.csv > step4.csvIf status is sometimes blank within a case_status group, fill from the last non-empty value in the same group. Variants:
qsv fill --first status step3.csv # use the first non-empty value
qsv fill --backfill status step3.csv # fill leading empties from the first valid value
qsv fill --default 'unknown' status step3.csv # constant defaultqsv sort --select 'case_enquiry_id,open_dt' step4.csv > step5_sorted.csv
qsv dedup --sorted --select 'case_enquiry_id' --dupes-output dupes.csv step5_sorted.csv > step5.csv-
--sortedenables streaming dedup (constant memory) -
--dupes-output dupes.csvkeeps an audit trail of every row that got dropped — invaluable for explaining "where did N rows go?" later
For files larger than RAM, swap sort → extsort and dedup --sorted works the same way.
qsv sortcheck --select 'case_enquiry_id' step5.csv
# Sorted: ✓qsv input --auto-skip --trim-headers --trim-fields boston311-100.csv \
| qsv safenames - \
| qsv replace -i '^(N/A|null|none|-|unknown)$' '' \
| qsv fill --groupby case_status status \
| qsv sort --select 'case_enquiry_id,open_dt' \
| qsv dedup --sorted --select 'case_enquiry_id' --dupes-output dupes.csv \
> cleaned.csvIf you're cleaning CSVs as part of a CKAN / DataPusher+ pipeline, use the qsvdp variant with applydp — it has the same trim/safen/cast operations but in a tiny binary:
qsvdp applydp operations trim,lower 'email' input.csv \
| qsvdp applydp operations cast 'amount' --comparand integer \
| qsvdp safenames - > cleaned.csvIf the file contains PII you can't ship downstream, swap names for stable identifiers:
qsv pseudo customer_name --formatstr 'CUST-{}' --start 1000 cleaned.csv > deidentified.csvSame input value always maps to the same ID — useful for keeping referential integrity across multiple exports.
qsv apply operations censor description cleaned.csv > censored.csv
# Replaces profanity with asterisksOr with custom regex:
qsv replace '\b\d{3}-\d{2}-\d{4}\b' '<SSN_REDACTED>' cleaned.csv > redacted.csvqsv schema cleaned.csv # produces cleaned.csv.schema.json
# Edit the schema to tighten the rules, then:
qsv validate cleaned.csv cleaned.csv.schema.jsonSee Recipe: JSON Schema Validation.
- Each step is O(rows), streaming where possible. The whole pipeline runs in well under a second on Boston 311 (100 rows) and in seconds on millions of rows.
-
dedup --sortedis streaming (constant memory).dedupwithout--sortedloads the whole CSV into memory to sort it first. - For files > RAM, use
extsortandextdedup— both are multithreaded and on-disk. - The pipeline above does six sequential CSV passes. If you care about throughput, you can fuse
replaceandapplyinto oneluauscript for a single pass.
- Transform & Reshape — every command used here
-
Recipe: CKAN Integration — the
safenamesworkflow taken further - Recipe: JSON Schema Validation — verify the cleaned output
- Recipe: Larger-than-RAM CSV — same pipeline, ext-* variants
- Aggregation & Statistics → dedup
- Troubleshooting — UTF-8, BOM, weird delimiters
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Conversion & I/O
- Geospatial
- HTTP & Web
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation