Recipe Clean and Normalize

Recipe: Clean & Normalize

Tier: Beginner Commands used: input, safenames, replace, fill, dedup, sortcheck Anchor dataset: Boston 311 (resources/test/boston311-100.csv from the qsv repo, ~100 rows; also available in .gz, .zst, .parquet)

Problem

A vendor CSV arrives with:

non-UTF-8 encoding and a 3-line preamble
inconsistent column names (Customer Name, customer_id, _id, Column with Spaces!@#)
sentinel values masquerading as missing data ("N/A", "null", "-", empty strings)
mostly-filled categorical columns with sparse holes
exact-duplicate rows from a bad export pipeline

You want a clean, DB-ready CSV ready for downstream analytics.

Data

Grab the Boston 311 sample:

curl -LO https://raw.githubusercontent.com/dathere/qsv/master/resources/test/boston311-100.csv
ls -lh boston311-100.csv

For this recipe we'll simulate the messy input by mangling a few values, then walk through the cleanup. In real life, you'd swap boston311-100.csv for whatever vendor file you're cleaning.

Solution

A six-step pipeline. Each step writes to a new file so you can diff between stages while debugging.

1. Normalize encoding / preamble / trim

qsv input \
  --auto-skip \
  --trim-headers \
  --trim-fields \
  --encoding-errors replace \
  boston311-100.csv > step1.csv

--auto-skip sniffs and skips preamble lines (overrides --skip-lines)
--trim-headers / --trim-fields strip leading/trailing whitespace
--encoding-errors replace substitutes invalid UTF-8 bytes with � (alternatives: skip, strict)

For true transcoding from a known encoding like ISO-8859-1, run iconv before qsv input (qsv input is lossy, not transcoding-correct).

2. Safen column names

qsv safenames step1.csv > step2.csv

Customer Name → customer_name; Column!@# → column__; _id → reserved__id (CKAN-required); duplicates get a numeric suffix.

Audit-only mode (no rewrite):

qsv safenames --mode V step1.csv
# stderr: 4 unsafe header/s: ["Column with Spaces", "_id", "Phone Number", ""]

3. Replace sentinel-null patterns with true empties

qsv replace -i '^(N/A|null|none|-|unknown)$' '' step2.csv > step3.csv

-i makes the regex case-insensitive. The ^...$ anchors ensure we only match cells that are exactly one of those sentinels — not cells that contain them as substrings.

4. Forward-fill sparse categorical columns

qsv fill --groupby case_status status step3.csv > step4.csv

If status is sometimes blank within a case_status group, fill from the last non-empty value in the same group. Variants:

qsv fill --first status step3.csv          # use the first non-empty value
qsv fill --backfill status step3.csv       # fill leading empties from the first valid value
qsv fill --default 'unknown' status step3.csv  # constant default

5. Sort + dedup

qsv sort --select 'case_enquiry_id,open_dt' step4.csv > step5_sorted.csv
qsv dedup --sorted --select 'case_enquiry_id' --dupes-output dupes.csv step5_sorted.csv > step5.csv

--sorted enables streaming dedup (constant memory)
--dupes-output dupes.csv keeps an audit trail of every row that got dropped — invaluable for explaining "where did N rows go?" later

For files larger than RAM, swap sort → extsort and dedup --sorted works the same way.

6. Verify with sortcheck (optional, for CI)

qsv sortcheck --select 'case_enquiry_id' step5.csv
# Sorted: ✓

Final pipeline (one shot)

qsv input --auto-skip --trim-headers --trim-fields boston311-100.csv \
  | qsv safenames - \
  | qsv replace -i '^(N/A|null|none|-|unknown)$' '' \
  | qsv fill --groupby case_status status \
  | qsv sort --select 'case_enquiry_id,open_dt' \
  | qsv dedup --sorted --select 'case_enquiry_id' --dupes-output dupes.csv \
  > cleaned.csv

Variations

Use `applydp` instead inside DataPusher+

If you're cleaning CSVs as part of a CKAN / DataPusher+ pipeline, use the qsvdp variant with applydp — it has the same trim/safen/cast operations but in a tiny binary:

qsvdp applydp operations trim,lower 'email' input.csv \
  | qsvdp applydp operations cast 'amount' --comparand integer \
  | qsvdp safenames - > cleaned.csv

See Recipe: CKAN Integration.

Pseudonymize sensitive columns

If the file contains PII you can't ship downstream, swap names for stable identifiers:

qsv pseudo customer_name --formatstr 'CUST-{}' --start 1000 cleaned.csv > deidentified.csv

Same input value always maps to the same ID — useful for keeping referential integrity across multiple exports.

Censor / redact text columns

qsv apply operations censor description cleaned.csv > censored.csv
# Replaces profanity with asterisks

Or with custom regex:

qsv replace '\b\d{3}-\d{2}-\d{4}\b' '<SSN_REDACTED>' cleaned.csv > redacted.csv

Validate the output against an expected schema

qsv schema cleaned.csv     # produces cleaned.csv.schema.json
# Edit the schema to tighten the rules, then:
qsv validate cleaned.csv cleaned.csv.schema.json

See Recipe: JSON Schema Validation.

Performance notes

Each step is O(rows), streaming where possible. The whole pipeline runs in well under a second on Boston 311 (100 rows) and in seconds on millions of rows.
dedup --sorted is streaming (constant memory). dedup without --sorted loads the whole CSV into memory to sort it first.
For files > RAM, use extsort and extdedup — both are multithreaded and on-disk.
The pipeline above does six sequential CSV passes. If you care about throughput, you can fuse replace and apply into one luau script for a single pass.

Recipe Clean and Normalize

Recipe: Clean & Normalize

Problem

Data

Solution

1. Normalize encoding / preamble / trim

2. Safen column names

3. Replace sentinel-null patterns with true empties

4. Forward-fill sparse categorical columns

5. Sort + dedup

6. Verify with sortcheck (optional, for CI)

Final pipeline (one shot)

Variations

Use applydp instead inside DataPusher+

Pseudonymize sensitive columns

Censor / redact text columns

Validate the output against an expected schema

Performance notes

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference

Legacy

Clone this wiki locally

Use `applydp` instead inside DataPusher+