-
Notifications
You must be signed in to change notification settings - Fork 105
Recipe JSON Schema Validate
Tier: Intermediate
Commands used: stats, schema, validate, extdedup
Anchor dataset: NYC 311 (1M-row sample) — resources/test/NYC_311_SR_2010-2020-sample-1M.csv
You receive a CSV every week from a partner and need to know — fast — whether every row conforms to expectations:
- types (numeric where numeric, dates where dates)
- ranges (zip codes between 00001 and 99999, ages between 0 and 120)
- enums (only allowed values for status, borough, agency)
- composite uniqueness (no duplicate
(case_enquiry_id, open_dt)pairs) - currency formats (ISO 4217 codes, optional symbols)
qsv does this at up to 780,000 rows / sec against a JSON Schema 2020-12 spec, with three custom keywords that no other JSON Schema validator has: currency, dynamicEnum, uniqueCombinedWith.
# 1M-row NYC 311 sample (bundled with qsv)
ls resources/test/NYC_311_SR_2010-2020-sample-1M.csv
# A pre-generated schema for it
ls resources/test/311_Service_Requests_from_2010_to_Present-2022-03-04.csv.schema.jsonIf you're running outside the qsv repo, fetch both:
curl -LO https://raw.githubusercontent.com/dathere/qsv/master/resources/test/NYC_311_SR_2010-2020-sample-1M.csv
curl -LO https://raw.githubusercontent.com/dathere/qsv/master/resources/test/311_Service_Requests_from_2010_to_Present-2022-03-04.csv.schema.jsonqsv stats --cardinality --infer-dates --infer-boolean --stats-jsonl \
NYC_311_SR_2010-2020-sample-1M.csv
ls NYC_311_SR_2010-2020-sample-1M.*
# .csv .stats.csv .stats.csv.data.jsonlschema and validate both look for these sidecars and reuse them — schema generation in ~5 s instead of ~30 s on the 1M sample.
qsv schema NYC_311_SR_2010-2020-sample-1M.csv
# Writes NYC_311_SR_2010-2020-sample-1M.csv.schema.jsonThe output includes inferred types, min/max ranges (for numerics), date formats, enum values (for low-cardinality columns), and pattern regex (when --pattern-columns is used).
Open NYC_311_SR_2010-2020-sample-1M.csv.schema.json in your editor. Typical edits:
The schema is itself a JSON Schema 2020-12 file — so you can validate that it's syntactically correct:
qsv validate schema NYC_311_SR_2010-2020-sample-1M.csv.schema.jsonqsv validate \
NYC_311_SR_2010-2020-sample-1M.csv \
NYC_311_SR_2010-2020-sample-1M.csv.schema.jsonWhen everything passes:
- exit code 0
- no extra files produced
When some rows fail:
- exit code non-zero
-
NYC_311_SR_2010-2020-sample-1M.csv.valid.csv— rows that passed -
NYC_311_SR_2010-2020-sample-1M.csv.invalid.csv— rows that failed -
NYC_311_SR_2010-2020-sample-1M.csv.validation-errors.tsv— one row per error:row_number, field, error
On the 1M-row sample, expect ~1.5 seconds for a clean run on an M2 Pro. The benchmarked peak is 780k rows/sec.
{
"properties": {
"Sale Price": {
"type": "string",
"format": "currency"
}
}
}Matches $1,000.00, USD1000.00, (€100,00), -USD100.00, ¥10000, and many more. Rejects raw integers without a currency context.
{
"properties": {
"Agency": {
"type": "string",
"dynamicEnum": "NYC_agencies.csv"
}
}
}The dynamicEnum value can be a local path or a URL. Supported schemes:
-
file://(or just a bare path) -
http:///https:// -
dathere://— qsv's curated lookup-tables repo -
ckan://— a CKAN resource ID
The first column of the referenced CSV is the value list (additional columns are ignored).
{
"properties": {
"Created Date": {
"uniqueCombinedWith": ["Complaint Type", "Incident Address"]
}
}
}This enforces uniqueness of the combination of those three columns across the file.
qsv validate raw_export.csv
# Exit 0 = the CSV is well-formed and UTF-8.This is the cheap precondition before running stats (which assumes well-formed input for max performance).
qsv validate exports/2024/ schema.jsonvalidate has Extended input support (🗄️) — directories, .infile-list files, snappy-compressed inputs all work.
qsv validate --fancy-regex data.csv strict-schema.jsonThe default regex engine is the fast one (regex crate, same as ripgrep). --fancy-regex opts into fancy-regex for schemas that need look-around / backreferences (rare in JSON Schema, but possible with pattern).
qsv schema --polars NYC_311_SR_2010-2020-sample-1M.csv
# Writes NYC_311_SR_2010-2020-sample-1M.pschema.jsonPolars commands automatically pick this up and skip their own type inference scan.
# Inside .github/workflows/data-quality.yml
- run: |
qsv validate weekly_export.csv schema.json
if [ -f weekly_export.csv.invalid.csv ]; then
echo "::error::Validation failed. See validation-errors.tsv"
exit 1
fiIn your schema:
{
"properties": {
"borough": {
"type": "string",
"dynamicEnum": "ckan://nyc-boroughs-resource-id"
}
}
}qsv resolves the ckan:// URL via the CKAN action API. See Recipe: CKAN Integration.
- 780,000 rows/sec is the validated peak benchmark — see the validate_index benchmark on qsv.dathere.com/benchmarks.
- Pre-populating the stats cache (step 1) shaves seconds off
schemageneration. -
validateis multithreaded (🚀 in the README legend). -
--fancy-regexis slower than the default; use only when the schema actually needs it. - For files > RAM,
validatestill works (it streams) — but useextdedupfor the precondition check of primary key uniqueness ifuniqueCombinedWithcauses memory pressure.
- Validation & Schema → validate
- Validation & Schema → schema
-
docs/Validate.md— canonical reference for custom keywords - JSON Schema 2020-12 spec
- Stats Cache & Caching
-
Lookup Tables —
dynamicEnumdeep-dive -
Recipe: CKAN Integration —
dynamicEnumagainst CKAN - Recipe: Clean & Normalize — clean before validating
-
resources/test/311_Service_Requests_from_2010_to_Present-2022-03-04.csv.schema.json— example schema
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Conversion & I/O
- Geospatial
- HTTP & Web
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation
{ "properties": { "Status": { "type": "string", "enum": ["Open", "Closed", "Pending", "In Progress"] // restrict to known values }, "Incident Zip": { "type": "string", "pattern": "^\\d{5}(-\\d{4})?$" // 5-digit or 5-4 ZIP }, "Agency": { "type": "string", "dynamicEnum": "https://data.cityofnewyork.us/api/views/nyc_agencies.csv" }, "Created Date": { "type": "string", "format": "date-time" } }, "required": ["Unique Key", "Created Date", "Complaint Type", "Borough"], "additionalProperties": false }