-
Notifications
You must be signed in to change notification settings - Fork 105
SQL and Polars
Tier: Advanced
Commands covered: sqlp, joinp, pivotp, scoresql
Per-command flag reference lives in
/docs/help/. This page is the workflow layer — when to reach for each command and how they compose.
These commands are powered by the Polars vectorized query engine. They handle larger-than-RAM data, are multithreaded by default, and accept Polars's PostgreSQL dialect of SQL. If you reach for pandas or duckdb for CSV analytics, this is the page that replaces both for many workflows.
The joinp command is also covered in Joins & Set Ops; this page focuses on sqlp, pivotp, and scoresql.
| If you want to… | Use | Notes |
|---|---|---|
| Run a SQL query across one or many CSVs / Parquet / JSONL / Arrow | sqlp |
Polars SQL (PostgreSQL dialect), larger-than-RAM, multithreaded |
| Score a SQL query BEFORE running it | scoresql |
Plan analysis, anti-pattern detection, cache freshness |
| Pivot a CSV (wide ↔ long) with smart aggregation | pivotp |
Group-by mode when <on-cols> is omitted |
| Join 2+ CSVs (asof, non-equi, larger-than-RAM) | joinp |
See Joins & Set Ops |
Polars SQL — a PostgreSQL dialect — over CSV, Parquet, JSONL, and Arrow files. Polars converts the query into a lazy expression and streams the execution. Output can be CSV, JSON, JSONL, Parquet, Arrow IPC, or AVRO. Returns (rows, cols) to stderr.
The optimization secret: feed it a Polars schema (.pschema.json generated by schema --polars) — Polars then knows column types without an inference scan. Combined with a stats cache and scoresql, sqlp becomes the most performant SQL engine for CSV-shaped data we know of.
Example: aggregate NYC 311 by Borough × Month, output Parquet
qsv sqlp NYC_311_SR_2010-2020-sample-1M.csv \
"SELECT
Borough,
strftime('%Y-%m', \"Created Date\") AS year_month,
COUNT(*) AS complaints
FROM read_csv('NYC_311_SR_2010-2020-sample-1M.csv')
WHERE Borough != ''
GROUP BY Borough, year_month
ORDER BY year_month, complaints DESC" \
--format parquet --output nyc311_by_borough_month.parquetExample: join two CSVs in SQL
qsv sqlp wcp.csv country_continent.csv \
"SELECT
wcp.AccentCity, wcp.Population, country_continent.continent
FROM wcp
JOIN country_continent ON wcp.Country = country_continent.country
WHERE wcp.Population > 1000000
ORDER BY wcp.Population DESC"When you pass multiple inputs, each file is registered as a table named after its file stem (so wcp.csv → wcp). You can also reference them as _t_1, _t_2, …
Example: UNION ALL across yearly exports + a window function
qsv sqlp nyc311-2023.csv nyc311-2024.csv \
"SELECT * FROM nyc311_2023
UNION ALL BY NAME
SELECT * FROM nyc311_2024" \
| qsv sqlp - \
"SELECT *, ROW_NUMBER() OVER (PARTITION BY Borough ORDER BY \"Created Date\") AS borough_rownum
FROM _t_1
LIMIT 100"Example: subquery / IN clause
qsv sqlp wcp.csv country_continent.csv \
"SELECT * FROM wcp WHERE Country IN (SELECT country FROM country_continent WHERE continent = 'EU')"Example: dollar-quoting for awkward literals
qsv sqlp tweets.csv "SELECT * FROM tweets WHERE handle = \$\$Diane's@Twitter\$\$"Example: read Parquet directly via read_parquet
qsv sqlp - "SELECT continent, SUM(Population) AS total_pop
FROM read_parquet('cities.parquet')
GROUP BY continent
ORDER BY total_pop DESC"The read_csv, read_ndjson, read_parquet, and read_ipc table functions let you mix-and-match formats inline.
See also: /docs/help/sqlp.md, Polars SQL reference, scoresql, pivotp, joinp, Recipe: Build a Data Pipeline.
Analyze a SQL query against the stats / frequency / moarstats caches and the query plan — before you run it. Output is a human-readable performance report (default) or JSON. Supports both Polars (default) and DuckDB modes for the underlying plan analysis.
Caches are auto-generated when missing: qsv stats --everything --stats-jsonl and qsv frequency --frequency-jsonl if their outputs aren't already there.
Example: score a join query before running it
qsv scoresql wcp.csv country_continent.csv \
"SELECT wcp.AccentCity, country_continent.continent
FROM wcp JOIN country_continent ON wcp.Country = country_continent.country"Report includes: query plan (EXPLAIN output), type optimization warnings, join key cardinality, filter selectivity, anti-pattern detection (SELECT *, missing LIMIT, cartesian joins, …), and cache freshness.
Example: get JSON output for CI gating
qsv scoresql --json wcp.csv "SELECT * FROM wcp" | jq '.score, .warnings'A SELECT * without LIMIT will earn an anti-pattern warning.
Example: use DuckDB for plan analysis instead of Polars
qsv scoresql --duckdb nyc311.csv "SELECT * FROM nyc311 WHERE status = 'active'"Example: score a query loaded from a .sql file
qsv scoresql data.csv complex_query.sqlOnly the last query in the file is scored.
See also: /docs/help/scoresql.md, sqlp, stats, frequency, Stats Cache & Caching.
Pivot CSV data using Polars. Two modes:
-
Pivot mode — when you pass
<on-cols>: one or more index columns become rows, the<on-cols>values become new columns, and a values column is aggregated. -
Group-by mode — when
<on-cols>is omitted: a simpleGROUP BYaggregation. Counting rows per group is the canonical use case.
Smart aggregation auto-selection picks sum, mean, first, or len based on the data type of the values column (or you can specify explicitly via --agg).
Example: pivot NYC 311 by Complaint Type × Year (count)
qsv pivotp 'Complaint Type' NYC_311_SR_2010-2020-sample-1M.csv \
--index 'YearReported' \
--values 'Unique Key' \
--agg len \
--output nyc311_pivot.csv(Assumes a YearReported column — derive it first with datefmt--formatstr '%Y' --new-column YearReported.)
Example: group-by mode — count rows per Borough
qsv pivotp NYC_311_SR_2010-2020-sample-1M.csv \
--index 'Borough' \
--output borough_counts.csvExample: pivot Allegheny property sales: Sale Year × Property Class, average Sale Price
qsv pivotp 'Property Class' allegheny_property_sales.csv \
--index 'Sale Year' \
--values 'Sale Price' \
--agg meanpivotp will pick up a <filename>.pschema.json if one exists (generated by schema --polars). That tells Polars the column types without re-inferring.
See also: /docs/help/pivotp.md, sqlp — alternative via GROUP BY and PIVOT, transpose, schema --polars.
See Joins & Set Ops → joinp for the full treatment. The headline capabilities for SQL-style work:
- Asof joins — time-series-aware, matches each left row to the nearest prior right row.
-
Non-equi joins — range-based matches via a Polars SQL
WHEREclause with_left/_rightcolumn suffixes. -
Pre-join filtering —
--filter-left/--filter-rightevaluate Polars SQL before the join, cheaper than filtering after. -
Column coalescing —
--coalescemerges identically-named columns post-join.
- Command Reference (index)
-
Joins & Set Ops —
joinpin depth -
Aggregation & Statistics —
statscache feedsscoresql -
Validation & Schema —
schema --polarsproduces.pschema.jsonfor sqlp/joinp/pivotp - Stats Cache & Caching
- Polars SQL docs
- Cookbook → Build a Data Pipeline
- Cookbook → Larger-than-RAM CSV
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Conversion & I/O
- Geospatial
- HTTP & Web
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation