Skip to content

Recipe Clean and Normalize

Joel Natividad edited this page May 13, 2026 · 2 revisions

Recipe: Clean & Normalize

Tier: Beginner Commands used: input, safenames, replace, fill, dedup, sortcheck Anchor dataset: Boston 311 (resources/test/boston311-100.csv from the qsv repo, ~100 rows; also available in .gz, .zst, .parquet)

Problem

A vendor CSV arrives with:

  • non-UTF-8 encoding and a 3-line preamble
  • inconsistent column names (Customer Name, customer_id, _id, Column with Spaces!@#)
  • sentinel values masquerading as missing data ("N/A", "null", "-", empty strings)
  • mostly-filled categorical columns with sparse holes
  • exact-duplicate rows from a bad export pipeline

You want a clean, DB-ready CSV ready for downstream analytics.

Data

Grab the Boston 311 sample:

curl -LO https://raw.githubusercontent.com/dathere/qsv/master/resources/test/boston311-100.csv
ls -lh boston311-100.csv

For this recipe we'll simulate the messy input by mangling a few values, then walk through the cleanup. In real life, you'd swap boston311-100.csv for whatever vendor file you're cleaning.

Solution

A six-step pipeline. Each step writes to a new file so you can diff between stages while debugging.

1. Normalize encoding / preamble / trim

qsv input \
  --auto-skip \
  --trim-headers \
  --trim-fields \
  --encoding-errors replace \
  boston311-100.csv > step1.csv
  • --auto-skip sniffs and skips preamble lines (overrides --skip-lines)
  • --trim-headers / --trim-fields strip leading/trailing whitespace
  • --encoding-errors replace substitutes invalid UTF-8 bytes with (alternatives: skip, strict)

For true transcoding from a known encoding like ISO-8859-1, run iconv before qsv input (qsv input is lossy, not transcoding-correct).

2. Safen column names

qsv safenames step1.csv > step2.csv

Customer Namecustomer_name; Column!@#column__; _idreserved__id (CKAN-required); duplicates get a numeric suffix.

Audit-only mode (no rewrite):

qsv safenames --mode V step1.csv
# stderr: 4 unsafe header/s: ["Column with Spaces", "_id", "Phone Number", ""]

3. Replace sentinel-null patterns with true empties

qsv replace -i '^(N/A|null|none|-|unknown)$' '' step2.csv > step3.csv

-i makes the regex case-insensitive. The ^...$ anchors ensure we only match cells that are exactly one of those sentinels — not cells that contain them as substrings.

4. Forward-fill sparse categorical columns

qsv fill --groupby case_status status step3.csv > step4.csv

If status is sometimes blank within a case_status group, fill from the last non-empty value in the same group. Variants:

qsv fill --first status step3.csv          # use the first non-empty value
qsv fill --backfill status step3.csv       # fill leading empties from the first valid value
qsv fill --default 'unknown' status step3.csv  # constant default

5. Sort + dedup

qsv sort --select 'case_enquiry_id,open_dt' step4.csv > step5_sorted.csv
qsv dedup --sorted --select 'case_enquiry_id' --dupes-output dupes.csv step5_sorted.csv > step5.csv
  • --sorted enables streaming dedup (constant memory)
  • --dupes-output dupes.csv keeps an audit trail of every row that got dropped — invaluable for explaining "where did N rows go?" later

For files larger than RAM, swap sortextsort and dedup --sorted works the same way.

6. Verify with sortcheck (optional, for CI)

qsv sortcheck --select 'case_enquiry_id' step5.csv
# Sorted: ✓

Final pipeline (one shot)

qsv input --auto-skip --trim-headers --trim-fields boston311-100.csv \
  | qsv safenames - \
  | qsv replace -i '^(N/A|null|none|-|unknown)$' '' \
  | qsv fill --groupby case_status status \
  | qsv sort --select 'case_enquiry_id,open_dt' \
  | qsv dedup --sorted --select 'case_enquiry_id' --dupes-output dupes.csv \
  > cleaned.csv

Variations

Use applydp instead inside DataPusher+

If you're cleaning CSVs as part of a CKAN / DataPusher+ pipeline, use the qsvdp variant with applydp — it has the same trim/safen/cast operations but in a tiny binary:

qsvdp applydp operations trim,lower 'email' input.csv \
  | qsvdp applydp operations cast 'amount' --comparand integer \
  | qsvdp safenames - > cleaned.csv

See Recipe: CKAN Integration.

Pseudonymize sensitive columns

If the file contains PII you can't ship downstream, swap names for stable identifiers:

qsv pseudo customer_name --formatstr 'CUST-{}' --start 1000 cleaned.csv > deidentified.csv

Same input value always maps to the same ID — useful for keeping referential integrity across multiple exports.

Censor / redact text columns

qsv apply operations censor description cleaned.csv > censored.csv
# Replaces profanity with asterisks

Or with custom regex:

qsv replace '\b\d{3}-\d{2}-\d{4}\b' '<SSN_REDACTED>' cleaned.csv > redacted.csv

Validate the output against an expected schema

qsv schema cleaned.csv     # produces cleaned.csv.schema.json
# Edit the schema to tighten the rules, then:
qsv validate cleaned.csv cleaned.csv.schema.json

See Recipe: JSON Schema Validation.

Performance notes

  • Each step is O(rows), streaming where possible. The whole pipeline runs in well under a second on Boston 311 (100 rows) and in seconds on millions of rows.
  • dedup --sorted is streaming (constant memory). dedup without --sorted loads the whole CSV into memory to sort it first.
  • For files > RAM, use extsort and extdedup — both are multithreaded and on-disk.
  • The pipeline above does six sequential CSV passes. If you care about throughput, you can fuse replace and apply into one luau script for a single pass.

See also

Clone this wiki locally