Skip to content
Joel Natividad edited this page May 13, 2026 · 2 revisions

SQL & Polars

Tier: Advanced Commands covered: sqlp, joinp, pivotp, scoresql

Per-command flag reference lives in /docs/help/. This page is the workflow layer — when to reach for each command and how they compose.

These commands are powered by the Polars vectorized query engine. They handle larger-than-RAM data, are multithreaded by default, and accept Polars's PostgreSQL dialect of SQL. If you reach for pandas or duckdb for CSV analytics, this is the page that replaces both for many workflows.

The joinp command is also covered in Joins & Set Ops; this page focuses on sqlp, pivotp, and scoresql.

Quick decision table

If you want to… Use Notes
Run a SQL query across one or many CSVs / Parquet / JSONL / Arrow sqlp Polars SQL (PostgreSQL dialect), larger-than-RAM, multithreaded
Score a SQL query BEFORE running it scoresql Plan analysis, anti-pattern detection, cache freshness
Pivot a CSV (wide ↔ long) with smart aggregation pivotp Group-by mode when <on-cols> is omitted
Join 2+ CSVs (asof, non-equi, larger-than-RAM) joinp See Joins & Set Ops

sqlp

Polars SQL — a PostgreSQL dialect — over CSV, Parquet, JSONL, and Arrow files. Polars converts the query into a lazy expression and streams the execution. Output can be CSV, JSON, JSONL, Parquet, Arrow IPC, or AVRO. Returns (rows, cols) to stderr.

The optimization secret: feed it a Polars schema (.pschema.json generated by schema --polars) — Polars then knows column types without an inference scan. Combined with a stats cache and scoresql, sqlp becomes the most performant SQL engine for CSV-shaped data we know of.

Example: aggregate NYC 311 by Borough × Month, output Parquet

qsv sqlp NYC_311_SR_2010-2020-sample-1M.csv \
  "SELECT
     Borough,
     strftime('%Y-%m', \"Created Date\") AS year_month,
     COUNT(*) AS complaints
   FROM read_csv('NYC_311_SR_2010-2020-sample-1M.csv')
   WHERE Borough != ''
   GROUP BY Borough, year_month
   ORDER BY year_month, complaints DESC" \
  --format parquet --output nyc311_by_borough_month.parquet

Example: join two CSVs in SQL

qsv sqlp wcp.csv country_continent.csv \
  "SELECT
     wcp.AccentCity, wcp.Population, country_continent.continent
   FROM wcp
   JOIN country_continent ON wcp.Country = country_continent.country
   WHERE wcp.Population > 1000000
   ORDER BY wcp.Population DESC"

When you pass multiple inputs, each file is registered as a table named after its file stem (so wcp.csvwcp). You can also reference them as _t_1, _t_2, …

Example: UNION ALL across yearly exports + a window function

qsv sqlp nyc311-2023.csv nyc311-2024.csv \
  "SELECT * FROM nyc311_2023
   UNION ALL BY NAME
   SELECT * FROM nyc311_2024" \
  | qsv sqlp - \
    "SELECT *, ROW_NUMBER() OVER (PARTITION BY Borough ORDER BY \"Created Date\") AS borough_rownum
     FROM _t_1
     LIMIT 100"

Example: subquery / IN clause

qsv sqlp wcp.csv country_continent.csv \
  "SELECT * FROM wcp WHERE Country IN (SELECT country FROM country_continent WHERE continent = 'EU')"

Example: dollar-quoting for awkward literals

qsv sqlp tweets.csv "SELECT * FROM tweets WHERE handle = \$\$Diane's@Twitter\$\$"

Example: read Parquet directly via read_parquet

qsv sqlp - "SELECT continent, SUM(Population) AS total_pop
            FROM read_parquet('cities.parquet')
            GROUP BY continent
            ORDER BY total_pop DESC"

The read_csv, read_ndjson, read_parquet, and read_ipc table functions let you mix-and-match formats inline.

See also: /docs/help/sqlp.md, Polars SQL reference, scoresql, pivotp, joinp, Recipe: Build a Data Pipeline.

scoresql

Analyze a SQL query against the stats / frequency / moarstats caches and the query plan — before you run it. Output is a human-readable performance report (default) or JSON. Supports both Polars (default) and DuckDB modes for the underlying plan analysis.

Caches are auto-generated when missing: qsv stats --everything --stats-jsonl and qsv frequency --frequency-jsonl if their outputs aren't already there.

Example: score a join query before running it

qsv scoresql wcp.csv country_continent.csv \
  "SELECT wcp.AccentCity, country_continent.continent
   FROM wcp JOIN country_continent ON wcp.Country = country_continent.country"

Report includes: query plan (EXPLAIN output), type optimization warnings, join key cardinality, filter selectivity, anti-pattern detection (SELECT *, missing LIMIT, cartesian joins, …), and cache freshness.

Example: get JSON output for CI gating

qsv scoresql --json wcp.csv "SELECT * FROM wcp" | jq '.score, .warnings'

A SELECT * without LIMIT will earn an anti-pattern warning.

Example: use DuckDB for plan analysis instead of Polars

qsv scoresql --duckdb nyc311.csv "SELECT * FROM nyc311 WHERE status = 'active'"

Example: score a query loaded from a .sql file

qsv scoresql data.csv complex_query.sql

Only the last query in the file is scored.

See also: /docs/help/scoresql.md, sqlp, stats, frequency, Stats Cache & Caching.

pivotp

Pivot CSV data using Polars. Two modes:

  • Pivot mode — when you pass <on-cols>: one or more index columns become rows, the <on-cols> values become new columns, and a values column is aggregated.
  • Group-by mode — when <on-cols> is omitted: a simple GROUP BY aggregation. Counting rows per group is the canonical use case.

Smart aggregation auto-selection picks sum, mean, first, or len based on the data type of the values column (or you can specify explicitly via --agg).

Example: pivot NYC 311 by Complaint Type × Year (count)

qsv pivotp 'Complaint Type' NYC_311_SR_2010-2020-sample-1M.csv \
  --index 'YearReported' \
  --values 'Unique Key' \
  --agg len \
  --output nyc311_pivot.csv

(Assumes a YearReported column — derive it first with datefmt--formatstr '%Y' --new-column YearReported.)

Example: group-by mode — count rows per Borough

qsv pivotp NYC_311_SR_2010-2020-sample-1M.csv \
  --index 'Borough' \
  --output borough_counts.csv

Example: pivot Allegheny property sales: Sale Year × Property Class, average Sale Price

qsv pivotp 'Property Class' allegheny_property_sales.csv \
  --index 'Sale Year' \
  --values 'Sale Price' \
  --agg mean

pivotp will pick up a <filename>.pschema.json if one exists (generated by schema --polars). That tells Polars the column types without re-inferring.

See also: /docs/help/pivotp.md, sqlp — alternative via GROUP BY and PIVOT, transpose, schema --polars.

joinp (cross-reference)

See Joins & Set Ops → joinp for the full treatment. The headline capabilities for SQL-style work:

  • Asof joins — time-series-aware, matches each left row to the nearest prior right row.
  • Non-equi joins — range-based matches via a Polars SQL WHERE clause with _left / _right column suffixes.
  • Pre-join filtering--filter-left / --filter-right evaluate Polars SQL before the join, cheaper than filtering after.
  • Column coalescing--coalesce merges identically-named columns post-join.

See also

Clone this wiki locally