Skip to content

ImpossibleForge/pfc-migrate

Repository files navigation

pfc-migrate — Move any JSONL log or event data to PFC cold storage

PyPI License: MIT Python PFC-JSONL Version

Export any JSONL data directly to PFC cold storage — or convert existing compressed JSONL archives from local disk, S3, Azure, or GCS. No intermediate files, no schema changes, no pipelines.


What this does

Command What it does
pfc-migrate cratedb Stream a CrateDB table directly to a .pfc archive
pfc-migrate questdb Stream a QuestDB table directly to a .pfc archive
pfc-migrate convert Convert gzip/zstd/bzip2/lz4/JSONL files to PFC
pfc-migrate s3 Convert JSONL archives in S3 in-place
pfc-migrate glacier Restore + convert S3 Glacier archives to PFC
pfc-migrate azure Convert JSONL archives in Azure Blob Storage
pfc-migrate gcs Convert JSONL archives in Google Cloud Storage

Why convert?

Once your archives are in PFC format, DuckDB can query them directly — without decompressing the whole file first:

INSTALL pfc FROM community;
LOAD pfc;
LOAD json;

-- Query just one hour from a 30-day archive
SELECT line->>'$.level' AS level, line->>'$.message' AS message
FROM read_pfc_jsonl(
    '/var/log/pfc/app_2026-03-01.pfc',
    ts_from = epoch(TIMESTAMPTZ '2026-03-01 14:00:00+00'),
    ts_to   = epoch(TIMESTAMPTZ '2026-03-01 15:00:00+00')
);
Tool 1h query on 30-day archive Storage vs gzip
gzip Decompress full 30-day file
zstd Decompress full 30-day file
PFC-JSONL Decompress ~1/720 of the file 25% smaller than gzip

~6–11% compression ratio on typical JSONL log data (25–40% smaller than gzip).


Zero egress cost

Cloud conversions run in-region: download → convert → upload, without ever routing through your laptop or billing for egress.


Input Formats (file conversion)

Format Extension Extra dependency
gzip .jsonl.gz stdlib ✅
bzip2 .jsonl.bz2 stdlib ✅
zstd .jsonl.zst pip install pfc-migrate[zstd]
lz4 .jsonl.lz4 pip install pfc-migrate[lz4]
Plain JSONL .jsonl stdlib ✅

Requirements

The pfc_jsonl binary must be installed on the machine running the export:

# Linux x64:
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-linux-x64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

# macOS (Apple Silicon M1–M4):
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-arm64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

License note: This tool requires the pfc_jsonl binary. pfc_jsonl is free for personal and open-source use — commercial use requires a separate license. See pfc-jsonl for details.

macOS Intel (x64): Binary coming soon. Windows: No native binary. Use WSL2 or a Linux machine.


Install

pip install pfc-migrate

# With zstd support
pip install pfc-migrate[zstd]

# With S3/Glacier support
pip install pfc-migrate[s3]

# With Azure Blob Storage support
pip install pfc-migrate[azure]

# With Google Cloud Storage support
pip install pfc-migrate[gcs]

# For CrateDB direct export
pip install pfc-migrate[postgres]

# For QuestDB direct export
pip install pfc-migrate[questdb]

Usage — CrateDB direct export

Stream rows directly from a CrateDB table into a .pfc archive. No intermediate files.

pip install pfc-migrate[postgres]

# Export one week of logs
pfc-migrate cratedb \
  --host crate.example.com \
  --user crate \
  --dbname doc \
  --schema doc \
  --table logs \
  --ts-column ts \
  --from-ts "2026-03-01" --to-ts "2026-03-08" \
  --output logs_2026-03-01.pfc \
  --verbose

# Auto-named output: logs_20260301_20260308.pfc
pfc-migrate cratedb --host localhost --table logs \
  --from-ts "2026-03-01" --to-ts "2026-03-08" --verbose

Verbose output:

  -> Connecting to CrateDB at localhost:5432 (db: doc) ...
  -> Columns (6): ts, level, message, host, service, duration_ms
  -> Streaming rows (batch size: 10,000) ...
     100,000 rows  (17.4 MiB) ...
     200,000 rows  (34.8 MiB) ...
  -> Exported 250,000 rows  (43.7 MiB JSONL)
  -> Compressing with pfc_jsonl ...
  ✓ 250,000 rows  |  JSONL 43.7 MiB  ->  PFC 2.6 MiB  (5.9%)  ->  logs_20260301_20260308.pfc
Option Default Description
--host localhost CrateDB host
--port 5432 PostgreSQL wire port
--user crate Username
--password (empty) Password
--dbname doc Database name
--schema doc Schema name
--table required Table to export
--ts-column None Timestamp column for WHERE filter and ORDER BY
--from-ts None Start of range (inclusive, ISO 8601)
--to-ts None End of range (exclusive, ISO 8601)
--batch-size 10000 Rows per fetch (memory-safe batching)
--output (auto) Output .pfc file
--verbose false Show row progress and size stats

Usage — QuestDB direct export

Stream rows directly from a QuestDB table into a .pfc archive. No intermediate files.

pip install pfc-migrate[questdb]

# Export one week of trades
pfc-migrate questdb \
  --host quest.example.com \
  --table trades \
  --ts-column timestamp \
  --from-ts "2026-03-01" --to-ts "2026-03-08" \
  --output trades_2026-03-01.pfc \
  --verbose

# Auto-named output: trades_20260301_20260308.pfc
pfc-migrate questdb --host localhost --table trades \
  --from-ts "2026-03-01" --to-ts "2026-03-08" --verbose

Verbose output:

  -> Connecting to QuestDB at localhost:8812 (db: qdb) ...
  -> Columns (5): timestamp, symbol, price, volume, side
  -> Streaming rows (batch size: 10,000) ...
     100,000 rows  (18.1 MiB) ...
  -> Exported 120,000 rows  (21.7 MiB JSONL)
  -> Compressing with pfc_jsonl ...
  ✓ 120,000 rows  |  JSONL 21.7 MiB  ->  PFC 1.3 MiB  (6.0%)  ->  trades_20260301_20260308.pfc
Option Default Description
--host localhost QuestDB host
--port 8812 PostgreSQL wire port
--user admin Username
--password quest Password
--dbname qdb Database name
--table required Table to export (no schema prefix)
--ts-column None Timestamp column for WHERE filter and ORDER BY
--from-ts None Start of range (inclusive, ISO 8601)
--to-ts None End of range (exclusive, ISO 8601)
--batch-size 10000 Rows per fetch (memory-safe batching)
--output (auto) Output .pfc file
--verbose false Show row progress and size stats

Note: QuestDB has no schema concept — tables are referenced by name only. There is no --schema option.


Usage — Local filesystem

# Single file (output auto-named: logs.pfc + logs.pfc.bidx)
pfc-migrate convert logs.jsonl.gz

# Explicit output
pfc-migrate convert logs.jsonl.gz logs.pfc

# Entire directory
pfc-migrate convert --dir /var/log/archive/ --output-dir /var/log/pfc/

# Recursive + verbose
pfc-migrate convert --dir /mnt/logs/ -r -v

Usage — Amazon S3 / S3 Glacier

Conversion happens in-region (download to temp dir → convert → upload). No egress charges.

# Single object
pfc-migrate s3 \
  --bucket my-logs \
  --key archive/app_2026-03.jsonl.gz \
  --out-bucket my-logs-pfc \
  --out-prefix converted/

# All objects matching a prefix
pfc-migrate s3 \
  --bucket my-logs \
  --prefix archive/2026-03/ \
  --out-bucket my-logs-pfc \
  --out-prefix converted/2026-03/ \
  --format gz \
  --verbose

# Glacier (Expedited retrieval)
pfc-migrate glacier \
  --bucket my-glacier-logs \
  --prefix 2025/ \
  --out-bucket my-glacier-pfc \
  --retrieval-tier Expedited

Usage — Azure Blob Storage

# All blobs matching a prefix
pfc-migrate azure \
  --container my-logs \
  --prefix archive/2026-03/ \
  --out-container my-logs-pfc \
  --connection-string "DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;"

Usage — Google Cloud Storage

# All objects matching a prefix
pfc-migrate gcs \
  --bucket my-logs \
  --prefix archive/2026-03/ \
  --out-bucket my-logs-pfc \
  --verbose

Hybrid queries: CrateDB live + PFC cold storage

Query CrateDB live data and cold PFC archives in a single DuckDB SQL statement:

import duckdb, psycopg2

con = duckdb.connect()
con.execute("INSTALL pfc FROM community; LOAD pfc; LOAD json;")

# Register CrateDB live data as a view
cratedb_conn = psycopg2.connect(host="localhost", user="crate", dbname="doc")
live_data = cratedb_conn.cursor()
live_data.execute("SELECT * FROM logs WHERE ts >= '2026-04-01'")
con.register("live_logs", live_data.fetchall())

# Query cold PFC archives + hot live data in one SQL
result = con.execute("""
    SELECT ts, level, message
    FROM pfc_scan([
        '/archives/logs_2026-01.pfc',
        '/archives/logs_2026-02.pfc',
        '/archives/logs_2026-03.pfc'
    ])
    UNION ALL
    SELECT ts, level, message FROM live_logs
    ORDER BY ts
""").fetchall()

See examples/cratedb_archive_explorer.py for a complete demo.


Hybrid queries: QuestDB live + PFC cold storage

Query QuestDB live data and cold PFC archives in a single DuckDB SQL statement:

import duckdb, psycopg2

con = duckdb.connect()
con.execute("INSTALL pfc FROM community; LOAD pfc; LOAD json;")

# Register QuestDB live data as a view
questdb_conn = psycopg2.connect(host="localhost", port=8812,
                                user="admin", password="quest", dbname="qdb")
live_data = questdb_conn.cursor()
live_data.execute("SELECT * FROM trades WHERE timestamp >= '2026-04-01'")
con.register("live_trades", live_data.fetchall())

# Query cold PFC archives + hot live data in one SQL
result = con.execute("""
    SELECT timestamp, symbol, price, volume
    FROM pfc_scan([
        '/archives/trades_2026-01.pfc',
        '/archives/trades_2026-02.pfc',
        '/archives/trades_2026-03.pfc'
    ])
    UNION ALL
    SELECT timestamp, symbol, price, volume FROM live_trades
    ORDER BY timestamp
""").fetchall()

Lossless guarantee

Every conversion is verified by full decompression and MD5 check before output is written. If anything doesn't match, the output file is deleted and the error is reported — the original is never modified. For S3, GCS, and Azure subcommands, --delete removes the original cloud object only after successful verification.


Related Projects

Project Description
pfc-jsonl Core binary — compress, decompress, query
pfc-duckdb DuckDB Community Extension (INSTALL pfc FROM community)
pfc-fluentbit Fluent Bit -> PFC forwarder for live pipelines
pfc-archiver-cratedb Autonomous daemon: archive old CrateDB partitions automatically
pfc-archiver-questdb Autonomous daemon: archive old QuestDB partitions automatically
pfc-vector High-performance Rust ingest daemon for Vector.dev and Telegraf
pfc-otel-collector OpenTelemetry OTLP/HTTP log exporter
pfc-kafka-consumer Kafka / Redpanda consumer
pfc-telegraf Telegraf HTTP output plugin → PFC
pfc-grafana Grafana data source plugin for PFC archives

License

pfc-migrate (this repository) is released under the MIT License — see LICENSE.

The PFC-JSONL binary (pfc_jsonl) is proprietary software — free for personal and open-source use. Commercial use requires a license: info@impossibleforge.com

About

Convert any compressed JSONL archive to PFC cold storage — from S3, Azure, GCS or local disk. No intermediate files, no schema changes, no pipelines.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages