Skip to content

Recipe JSON Schema Validate

Joel Natividad edited this page May 13, 2026 · 2 revisions

Recipe: JSON Schema Validation

Tier: Intermediate Commands used: stats, schema, validate, extdedup Anchor dataset: NYC 311 (1M-row sample) — resources/test/NYC_311_SR_2010-2020-sample-1M.csv

Problem

You receive a CSV every week from a partner and need to know — fast — whether every row conforms to expectations:

  • types (numeric where numeric, dates where dates)
  • ranges (zip codes between 00001 and 99999, ages between 0 and 120)
  • enums (only allowed values for status, borough, agency)
  • composite uniqueness (no duplicate (case_enquiry_id, open_dt) pairs)
  • currency formats (ISO 4217 codes, optional symbols)

qsv does this at up to 780,000 rows / sec against a JSON Schema 2020-12 spec, with three custom keywords that no other JSON Schema validator has: currency, dynamicEnum, uniqueCombinedWith.

Data

# 1M-row NYC 311 sample (bundled with qsv)
ls resources/test/NYC_311_SR_2010-2020-sample-1M.csv

# A pre-generated schema for it
ls resources/test/311_Service_Requests_from_2010_to_Present-2022-03-04.csv.schema.json

If you're running outside the qsv repo, fetch both:

curl -LO https://raw.githubusercontent.com/dathere/qsv/master/resources/test/NYC_311_SR_2010-2020-sample-1M.csv
curl -LO https://raw.githubusercontent.com/dathere/qsv/master/resources/test/311_Service_Requests_from_2010_to_Present-2022-03-04.csv.schema.json

Solution

1. Pre-populate the stats cache for speed

qsv stats --cardinality --infer-dates --infer-boolean --stats-jsonl \
  NYC_311_SR_2010-2020-sample-1M.csv
ls NYC_311_SR_2010-2020-sample-1M.*
# .csv  .stats.csv  .stats.csv.data.jsonl

schema and validate both look for these sidecars and reuse them — schema generation in ~5 s instead of ~30 s on the 1M sample.

2. Generate a JSON Schema from representative data

qsv schema NYC_311_SR_2010-2020-sample-1M.csv
# Writes NYC_311_SR_2010-2020-sample-1M.csv.schema.json

The output includes inferred types, min/max ranges (for numerics), date formats, enum values (for low-cardinality columns), and pattern regex (when --pattern-columns is used).

3. Edit the schema to tighten rules

Open NYC_311_SR_2010-2020-sample-1M.csv.schema.json in your editor. Typical edits:

{
  "properties": {
    "Status": {
      "type": "string",
      "enum": ["Open", "Closed", "Pending", "In Progress"]    // restrict to known values
    },
    "Incident Zip": {
      "type": "string",
      "pattern": "^\\d{5}(-\\d{4})?$"                          // 5-digit or 5-4 ZIP
    },
    "Agency": {
      "type": "string",
      "dynamicEnum": "https://data.cityofnewyork.us/api/views/nyc_agencies.csv"
    },
    "Created Date": {
      "type": "string",
      "format": "date-time"
    }
  },
  "required": ["Unique Key", "Created Date", "Complaint Type", "Borough"],
  "additionalProperties": false
}

The schema is itself a JSON Schema 2020-12 file — so you can validate that it's syntactically correct:

qsv validate schema NYC_311_SR_2010-2020-sample-1M.csv.schema.json

4. Run validation

qsv validate \
  NYC_311_SR_2010-2020-sample-1M.csv \
  NYC_311_SR_2010-2020-sample-1M.csv.schema.json

When everything passes:

  • exit code 0
  • no extra files produced

When some rows fail:

  • exit code non-zero
  • NYC_311_SR_2010-2020-sample-1M.csv.valid.csv — rows that passed
  • NYC_311_SR_2010-2020-sample-1M.csv.invalid.csv — rows that failed
  • NYC_311_SR_2010-2020-sample-1M.csv.validation-errors.tsv — one row per error: row_number, field, error

On the 1M-row sample, expect ~1.5 seconds for a clean run on an M2 Pro. The benchmarked peak is 780k rows/sec.

5. Use the three custom keywords

currency — ISO 4217 validation

{
  "properties": {
    "Sale Price": {
      "type": "string",
      "format": "currency"
    }
  }
}

Matches $1,000.00, USD1000.00, (€100,00), -USD100.00, ¥10000, and many more. Rejects raw integers without a currency context.

dynamicEnum — enum from a CSV (local, HTTP, dathere, CKAN)

{
  "properties": {
    "Agency": {
      "type": "string",
      "dynamicEnum": "NYC_agencies.csv"
    }
  }
}

The dynamicEnum value can be a local path or a URL. Supported schemes:

  • file:// (or just a bare path)
  • http:// / https://
  • dathere:// — qsv's curated lookup-tables repo
  • ckan:// — a CKAN resource ID

The first column of the referenced CSV is the value list (additional columns are ignored).

uniqueCombinedWith — composite-key uniqueness

{
  "properties": {
    "Created Date": {
      "uniqueCombinedWith": ["Complaint Type", "Incident Address"]
    }
  }
}

This enforces uniqueness of the combination of those three columns across the file.

6. RFC 4180 / UTF-8-only sanity check (no schema)

qsv validate raw_export.csv
# Exit 0 = the CSV is well-formed and UTF-8.

This is the cheap precondition before running stats (which assumes well-formed input for max performance).

7. Validate a directory of CSVs

qsv validate exports/2024/ schema.json

validate has Extended input support (🗄️) — directories, .infile-list files, snappy-compressed inputs all work.

Variations

Use the --fancy-regex engine for look-around / backreferences

qsv validate --fancy-regex data.csv strict-schema.json

The default regex engine is the fast one (regex crate, same as ripgrep). --fancy-regex opts into fancy-regex for schemas that need look-around / backreferences (rare in JSON Schema, but possible with pattern).

Polars schema for sqlp / joinp / pivotp speed

qsv schema --polars NYC_311_SR_2010-2020-sample-1M.csv
# Writes NYC_311_SR_2010-2020-sample-1M.pschema.json

Polars commands automatically pick this up and skip their own type inference scan.

CI gate

# Inside .github/workflows/data-quality.yml
- run: |
    qsv validate weekly_export.csv schema.json
    if [ -f weekly_export.csv.invalid.csv ]; then
      echo "::error::Validation failed. See validation-errors.tsv"
      exit 1
    fi

Validate a CSV against a CKAN-hosted reference list

In your schema:

{
  "properties": {
    "borough": {
      "type": "string",
      "dynamicEnum": "ckan://nyc-boroughs-resource-id"
    }
  }
}

qsv resolves the ckan:// URL via the CKAN action API. See Recipe: CKAN Integration.

Performance notes

  • 780,000 rows/sec is the validated peak benchmark — see the validate_index benchmark on qsv.dathere.com/benchmarks.
  • Pre-populating the stats cache (step 1) shaves seconds off schema generation.
  • validate is multithreaded (🚀 in the README legend).
  • --fancy-regex is slower than the default; use only when the schema actually needs it.
  • For files > RAM, validate still works (it streams) — but use extdedup for the precondition check of primary key uniqueness if uniqueCombinedWith causes memory pressure.

See also

Clone this wiki locally