This document keeps only the shortest path:
- Start HMS and Trino
- Re-import
sf10into S3 - Re-register the tables in Trino
- Validate with Trino CLI
- Start CDC updates
- Open the monitor for overall CPU / memory / job status
Start HMS:
/home/ubuntu/disk1/opt/run/start-metastore.shStart Trino:
/home/ubuntu/disk1/opt/run/start-trino.shIf Trino needs to query Delta tables on S3, confirm:
/home/ubuntu/disk1/opt/trino-server-466/etc/catalog/delta_lake.properties
contains at least:
connector.name=delta_lake
hive.metastore.uri=thrift://127.0.0.1:9083
delta.register-table-procedure.enabled=true
delta.enable-non-concurrent-writes=true
fs.native-s3.enabled=true
s3.aws-access-key=...
s3.aws-secret-key=...
s3.region=us-east-2
s3.endpoint=https://s3.us-east-2.amazonaws.comTemplate file:
Project configuration:
Notes:
- both shell scripts and application code now read this same file by default
- defaults for
run-import-hybench-sf10.sh,run-import-hybench-sf1000.sh,run-cdc-hybench-sf10.sh, andrun-single-cdc-foreground.share centralized here - avoid splitting the same operational settings across multiple
.envfiles - benchmark definition files:
- these files define table names, input file names, delimiters, Spark schemas, and primary-key columns
- ImportApp and CDC now share the same local definitions instead of fetching primary keys dynamically from the Pixels metadata service
Example table definition:
tables=customer,company
table.customer.file=customer.csv
table.customer.delimiter=,
table.customer.primary-keys=custID
table.customer.schema=custID:int,name:string,freshness_ts:timestampThe schema field uses a comma-separated column:type format. The currently supported types are:
intlongfloatdoublestringdatetimestampboolean
Confirm the import-related settings:
pixels.spark.delta.enable-deletion-vectors=true
pixels.spark.import.csv.chunk-rows=2560000
pixels.spark.import.count-rows=falseNotes:
pixels.spark.delta.hash-bucket.countis deprecated and should no longer be configured_pixels_bucket_idis no longer computed with Sparkpmod(hash(pk), x)- the authoritative bucket configuration now comes from
node.bucket.numin$PIXELS_HOME/etc/pixels.properties - both import and CDC compute bucket ids from canonical primary-key bytes ->
ByteString->RetinaUtils, matching the server
Run the full import:
./scripts/run-import-hybench-sf10.sh \
/home/ubuntu/disk1/hybench_sf10 \
s3a://home-zinuo/deltalake/hybench_sf10Run the CHBenCHMark w1 import:
./scripts/run-import-chbenchmark-w1.sh \
/home/ubuntu/disk1/ch_w1 \
s3a://home-zinuo/deltalake/chbenchmark_w1Notes:
- the import uses
overwrite - the import writes a persistent
_pixels_bucket_idcolumn - the bucket value is computed with the same server-side bucket algorithm, not Spark
hash() - new tables are created with
delta.enableDeletionVectors=true - imports do not run
count()by default - CSV data is read in chunks controlled by
pixels.spark.import.csv.chunk-rows, then written to Delta chunk by chunk - CDC source batch sizing can be controlled through
pixels.spark.source.max-rows-per-batch,pixels.spark.source.max-wait-ms-per-batch, andpixels.spark.source.empty-poll-sleep-ms - all import entrypoints now use the Java app:
io.pixelsdb.spark.app.PixelsBenchmarkDeltaImportApp
If you want DV enabled at table-creation time, the core Delta table property is:
delta.enableDeletionVectors=trueIn this project, the recommended switch is:
pixels.spark.delta.enable-deletion-vectors=trueThis applies to both:
- CSV-import table creation
- CDC auto-create table creation
After a re-import or a partition change, re-register:
/home/ubuntu/disk1/opt/trino-cli/trino --server http://127.0.0.1:8080 \
--execute "CREATE SCHEMA IF NOT EXISTS delta_lake.hybench_sf10"
for table_name in customer company savingaccount checkingaccount transfer checking loanapps loantrans; do
/home/ubuntu/disk1/opt/trino-cli/trino --server http://127.0.0.1:8080 \
--execute \"DROP TABLE IF EXISTS delta_lake.hybench_sf10.${table_name}\"
done
/home/ubuntu/disk1/opt/trino-cli/trino --server http://127.0.0.1:8080 \
--execute \"CALL delta_lake.system.register_table(schema_name => 'hybench_sf10', table_name => 'customer', table_location => 's3://home-zinuo/deltalake/hybench_sf10/customer')\"
/home/ubuntu/disk1/opt/trino-cli/trino --server http://127.0.0.1:8080 \
--execute \"CALL delta_lake.system.register_table(schema_name => 'hybench_sf10', table_name => 'company', table_location => 's3://home-zinuo/deltalake/hybench_sf10/company')\"
/home/ubuntu/disk1/opt/trino-cli/trino --server http://127.0.0.1:8080 \
--execute \"CALL delta_lake.system.register_table(schema_name => 'hybench_sf10', table_name => 'savingaccount', table_location => 's3://home-zinuo/deltalake/hybench_sf10/savingaccount')\"
/home/ubuntu/disk1/opt/trino-cli/trino --server http://127.0.0.1:8080 \
--execute \"CALL delta_lake.system.register_table(schema_name => 'hybench_sf10', table_name => 'checkingaccount', table_location => 's3://home-zinuo/deltalake/hybench_sf10/checkingaccount')\"
/home/ubuntu/disk1/opt/trino-cli/trino --server http://127.0.0.1:8080 \
--execute \"CALL delta_lake.system.register_table(schema_name => 'hybench_sf10', table_name => 'transfer', table_location => 's3://home-zinuo/deltalake/hybench_sf10/transfer')\"
/home/ubuntu/disk1/opt/trino-cli/trino --server http://127.0.0.1:8080 \
--execute \"CALL delta_lake.system.register_table(schema_name => 'hybench_sf10', table_name => 'checking', table_location => 's3://home-zinuo/deltalake/hybench_sf10/checking')\"
/home/ubuntu/disk1/opt/trino-cli/trino --server http://127.0.0.1:8080 \
--execute \"CALL delta_lake.system.register_table(schema_name => 'hybench_sf10', table_name => 'loanapps', table_location => 's3://home-zinuo/deltalake/hybench_sf10/loanapps')\"
/home/ubuntu/disk1/opt/trino-cli/trino --server http://127.0.0.1:8080 \
--execute \"CALL delta_lake.system.register_table(schema_name => 'hybench_sf10', table_name => 'loantrans', table_location => 's3://home-zinuo/deltalake/hybench_sf10/loantrans')\"List the tables:
/home/ubuntu/disk1/opt/trino-cli/trino \
--server http://127.0.0.1:8080 \
--execute "SHOW TABLES FROM delta_lake.hybench_sf10"Read one row:
/home/ubuntu/disk1/opt/trino-cli/trino \
--server http://127.0.0.1:8080 \
--execute "SELECT * FROM delta_lake.hybench_sf10.customer LIMIT 1"If SHOW TABLES works but SELECT fails:
- check whether
delta_lake.propertiesreally contains the S3 settings - confirm Trino has been restarted and loaded the new configuration
If you see:
No factory for location: s3://.../_delta_log
the current Trino instance still cannot read S3.
If you see:
Error getting snapshot for hybench_sf10.customer
the Trino-side S3 / Delta read configuration is still not effective.
If you need to enable DV on an existing table, you can run:
/home/ubuntu/disk1/opt/trino-cli/trino \
--server http://127.0.0.1:8080 \
--execute "ALTER TABLE delta_lake.hybench_sf10.customer SET PROPERTIES delta.enableDeletionVectors = true"Or in Spark SQL:
ALTER TABLE delta.`s3a://home-zinuo/deltalake/hybench_sf10/customer`
SET TBLPROPERTIES ('delta.enableDeletionVectors'='true');Start the local dependency services first:
./scripts/start-local-cdc-stack.shThen start the full sf10 CDC workload:
./scripts/run-cdc-hybench-sf10.shThe local benchmark definitions used by CDC are controlled in etc/pixels-spark.properties:
pixels.cdc.benchmark=hybenchTo switch to CHBenCHMark:
pixels.cdc.benchmark=chbenchmarkThis starts one Spark CDC job for each of:
customercompanysavingaccountcheckingaccounttransfercheckingloanappsloantrans
If you only want to validate source polling without executing the Delta merge, run:
./scripts/run-delta-merge.sh \
--database pixels_bench \
--table savingaccount \
--rpc-host localhost \
--rpc-port 9091 \
--metadata-host localhost \
--metadata-port 18888 \
--mode polling \
--trigger-mode processing-time \
--trigger-interval "10 seconds" \
--sink-mode noopBy default, CDC pulls all source buckets defined by node.bucket.num in $PIXELS_HOME/etc/pixels.properties; do not pass --buckets manually.
Notes:
- CDC source schemas come from the benchmark definition files
- CDC merge primary-key columns also come from the benchmark definition files
- CDC no longer depends on the Pixels metadata service for schema or primary-key definitions
Start metric collection:
./scripts/collect-cdc-metrics.shStart the web monitor:
python3 ./scripts/cdc_web_monitor.pyOpen:
http://127.0.0.1:8084
The monitor reports:
- dependency service status
- per-table CDC job status
- CPU / RSS / uptime for each Spark job
- machine-wide
load1 - machine-wide used and available memory
- disk usage for the filesystem under
/tmp
If you care about overall CPU and memory rather than just one job, look at the System section at the top of the page.