Skip to content

ImpossibleForge/pfc-duckdb

Repository files navigation

pfc β€” DuckDB Extension for PFC-JSONL

You have compressed log archives on disk. To query them you normally decompress everything first β€” even if you only need one hour out of thirty days.

This extension changes that. Query .pfc files directly from DuckDB SQL. A block index tells the extension exactly which chunks of the file to decompress β€” the rest stays compressed.

Requires: The pfc_jsonl binary installed on your machine (Step 1 below). The extension calls it for decompression.

Platform: Linux x86_64 and macOS Apple Silicon (ARM64). No native Windows binary β€” Windows users must use WSL2 or a Linux machine.

INSTALL pfc FROM community;
LOAD pfc;
LOAD json;

SELECT
    line->>'$.level'   AS level,
    line->>'$.message' AS message
FROM read_pfc_jsonl('/var/log/events.pfc')
WHERE line->>'$.level' = 'ERROR';

Awesome DuckDB

What is PFC-JSONL?

PFC-JSONL is a high-performance compressed log format built for structured (JSONL) data. It achieves better compression than gzip and zstd on real log data while supporting random block access β€” meaning you can decompress only the time range you need.

Key properties:

  • Each file is split into independently compressible blocks
  • A .pfc.bidx binary index stores the byte offset and timestamp range of every block
  • The PFC binary can decompress any subset of blocks in a single call
  • Free for personal and open-source use β€” no account, no signup required

How It Works (Architecture)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DuckDB                                                       β”‚
β”‚                                                              β”‚
β”‚  SELECT * FROM read_pfc_jsonl('events.pfc', ts_from=...)     β”‚
β”‚           β”‚                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    reads     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  pfc extension    │─────────────▢│  events.pfc.bidx    β”‚  β”‚
β”‚  β”‚  (MIT, open src)  β”‚  block index β”‚  (block timestamps) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚           β”‚ popen() / subprocess                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  pfc_jsonl binary   β”‚  ← proprietary, closed source
  β”‚  (v3.4+, local)     β”‚    contains BWT+rANS compression
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
  decompressed JSON lines β†’ back to DuckDB

The extension is a thin open-source wrapper β€” it reads the .bidx index in C++ to select which blocks are needed, then calls the PFC binary once to decompress only those blocks. The compression algorithm stays closed.

Installation

Step 1 β€” Install the PFC binary (once per machine)

The extension calls the pfc_jsonl binary for decompression. Download the latest release for your platform:

Linux x64:

curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-linux-x64 \
     -o /usr/local/bin/pfc_jsonl
chmod +x /usr/local/bin/pfc_jsonl
pfc_jsonl --help      # verify install

macOS (Apple Silicon M1/M2/M3/M4):

curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-arm64 \
     -o /usr/local/bin/pfc_jsonl
chmod +x /usr/local/bin/pfc_jsonl
pfc_jsonl --help      # verify install

macOS Intel (x64): Binary coming soon.

Custom path: Set PFC_JSONL_BINARY=/path/to/pfc_jsonl in your environment to override the default /usr/local/bin/pfc_jsonl.

Step 2 β€” Install the DuckDB extension

INSTALL pfc FROM community;
LOAD pfc;

Build from source (developers / early access)

git clone --recurse-submodules https://github.com/ImpossibleForge/pfc-duckdb
cd pfc-duckdb
GEN=ninja make release
# Extension at: build/release/extension/pfc/pfc.duckdb_extension

Usage

Basic query

LOAD pfc;

SELECT line FROM read_pfc_jsonl('/path/to/file.pfc');

Each row contains one raw JSON string in the line column. Use the DuckDB json extension to parse fields:

LOAD json;

SELECT
    line->>'$.timestamp' AS ts,
    line->>'$.level'     AS level,
    line->>'$.message'   AS message,
    line->>'$.service'   AS service
FROM read_pfc_jsonl('/path/to/file.pfc');

Timestamp-based block filtering

PFC files include a .pfc.bidx index with the timestamp range of each block. Pass ts_from and/or ts_to (Unix seconds) to skip entire blocks before decompression:

-- Only decompress blocks that overlap the given time window
SELECT line
FROM read_pfc_jsonl(
    '/path/to/file.pfc',
    ts_from = 1735689600,   -- 2026-01-01 00:00:00 UTC
    ts_to   = 1735775999    -- 2026-01-01 23:59:59 UTC
);

Convert a timestamp string to Unix seconds with epoch():

SELECT line
FROM read_pfc_jsonl(
    '/path/to/file.pfc',
    ts_from = epoch(TIMESTAMPTZ '2026-01-01 00:00:00+00'),
    ts_to   = epoch(TIMESTAMPTZ '2026-01-02 00:00:00+00')
);

Combining block filter and row filter

ts_from/ts_to skip entire blocks (coarse, fast). Add a WHERE clause for row-level precision:

LOAD json;

SELECT line->>'$.message' AS msg
FROM read_pfc_jsonl(
    '/var/log/api.pfc',
    ts_from = epoch(TIMESTAMPTZ '2026-03-15 08:00:00+00'),
    ts_to   = epoch(TIMESTAMPTZ '2026-03-15 10:00:00+00')
)
WHERE line->>'$.level' = 'ERROR';

Analytics examples

LOAD json;

-- Error rate per hour
SELECT
    strftime(to_timestamp((line->>'$.ts')::BIGINT), '%Y-%m-%d %H:00') AS hour,
    count(*) FILTER (WHERE line->>'$.level' = 'ERROR') AS errors,
    count(*)                                            AS total
FROM read_pfc_jsonl('/var/log/api.pfc')
GROUP BY hour ORDER BY hour;

-- Top 10 slowest endpoints
SELECT
    line->>'$.path'                              AS endpoint,
    avg((line->>'$.duration_ms')::DOUBLE)        AS avg_ms,
    count(*)                                     AS requests
FROM read_pfc_jsonl('/var/log/api.pfc')
GROUP BY endpoint ORDER BY avg_ms DESC LIMIT 10;

API Reference

read_pfc_jsonl(path [, ts_from, ts_to])

Parameter Type Default Description
path VARCHAR β€” Path to the .pfc file. A .pfc.bidx index must exist at path + ".bidx".
ts_from BIGINT 0 Lower bound for block selection (Unix seconds). 0 = no lower bound.
ts_to BIGINT 0 Upper bound for block selection (Unix seconds). 0 = no upper bound.

Returns: table with one column line VARCHAR β€” one row per decompressed JSON line.

Block filtering semantics: A block is included if its timestamp range [ts_start, ts_end] overlaps [ts_from, ts_to]. Blocks with unknown timestamps are always included. If both ts_from and ts_to are 0, all blocks are read.

File Requirements

File Required Description
file.pfc yes Compressed PFC-JSONL file
file.pfc.bidx yes Binary block index (requires PFC-JSONL v3.4+)

Generate both with the PFC binary:

pfc_jsonl compress input.jsonl output.pfc
# Produces: output.pfc  +  output.pfc.bidx

Note: The Docker image on Docker Hub (impossibleforge/pfc-jsonl) is a server-side compression tool. It is not required for using the DuckDB extension β€” you only need the standalone pfc_jsonl binary from GitHub Releases.

Performance

Block-level filtering can skip the majority of a file. Example: 30-day log file, 720 hourly blocks β€” a 1-hour query reads 1 block instead of 720.

Query range Blocks read Speedup (720-block file)
30 days 720/720 1Γ—
1 day ~24/720 ~30Γ—
1 hour ~1/720 ~720Γ—

Troubleshooting

Cannot open index file: /path/to/file.pfc.bidx The .pfc.bidx index is missing. Compress with PFC-JSONL v3.4+:

pfc_jsonl compress input.jsonl output.pfc

PFC binary not found at '/usr/local/bin/pfc_jsonl' Binary is missing or not executable. Re-run the curl install command, or set PFC_JSONL_BINARY=/path/to/pfc_jsonl.

popen() failed β€” could not start PFC binary subprocess The extension uses popen() to call the PFC binary. Windows is not supported β€” use WSL2 or a Linux machine.

ts_from (...) must be <= ts_to (...) You passed an inverted time range. Swap the values so ts_from comes before ts_to.

Related Projects

Project Description
pfc-jsonl The core binary β€” compress, decompress, query
pfc-gateway HTTP query + ingest API for .pfc archives β€” enables Grafana and remote access
pfc-fluentbit Stream Fluent Bit logs directly to .pfc archives
pfc-migrate Convert existing gzip/zstd/lz4 archives to PFC β€” local, S3, Azure, GCS
pfc-jsonl (PyPI) Python package β€” pip install pfc-jsonl
pfc-vector High-performance Rust ingest daemon for Vector.dev
pfc-otel-collector OpenTelemetry OTLP/HTTP log exporter
pfc-kafka-consumer Kafka / Redpanda consumer
pfc-telegraf Telegraf HTTP output plugin β†’ PFC
pfc-grafana Grafana data source plugin for PFC archives

Disclaimer

PFC-DuckDB is an independent open-source project and is not affiliated with, endorsed by, or associated with the DuckDB Foundation or DuckDB Labs.

License

The pfc DuckDB extension (this repository) is released under the MIT License β€” see LICENSE.

The PFC-JSONL binary (pfc_jsonl) is proprietary software β€” free for personal and open-source use. Commercial use requires a license: info@impossibleforge.com

About

DuckDB extension to read PFC-JSONL compressed log files with block-level timestamp filtering

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors