feat(perf): performance improvements for parallel reading, indexing, and aggregation by rayandrew · Pull Request #65 · llnl/dftracer-utils

rayandrew · 2026-05-18T05:53:54Z

Performance improvements for parallel reading, indexing, and aggregation

Closes #47 as WONTFIX

Summary

End-to-end performance work across the dftracer-utils stack: indexer, aggregator/MPI driver, trace reader, replay, organize, and surrounding infrastructure. The goal is to scale indexing and analysis from single-node to 32-node MPI/Dask runs on multi-TB traces (h5bench and DLIO targets) while keeping memory bounded.

Motivation

dftracer_index was CPU-bound (~3000s for the h5bench 2TB trace) and the existing distributed path did not scale linearly. The Python/Dask build was ~10x slower than a pure MPI build (2746s vs 277s at 32 nodes). Replay and trace-reader query paths also had hot-path overheads (DOM-wrapped simdjson, string churn, no predicate compilation). This PR addresses those bottlenecks together because they share infrastructure (RocksDB CFs, SST staging, coroutine pipelines, gzip member layout).

Indexer

Streaming parse-and-emit worker pipeline with bounded memory usage; write workers decoupled from parse workers.
Concurrent SST artifact ingestion with staging support; per-CF artifact collection uses an atomic counter + mutex.
Gzip member slicing so a single .pfw.gz can be indexed in parallel across its members. Writer now tracks gzip member offsets.
Lazy decoding for compressed value counts.
simdjson on_demand path bypasses the DOM wrapper on the indexer hot path.
--rebuild-summaries flag; root summary rebuild is now incremental rather than full-scan.
Distributed aggregation tracking unified; manifest consolidation removed.

Aggregator and MPI driver

Aggregator pipeline restructured into a task-based DAG (agg_mpi), inlining phase bodies into make_task lambdas to avoid the GCC 12 ASan SIGILL/hang seen with the previous task_* coroutine-wrapper pattern.
Shared staging for multi-node artifact relocation; artifact move now indexed by path.
Per-node thread scaling to avoid oversubscription on Flux/Corona.
AGGREGATION and SYSTEM_METRICS CFs distributed via SstFileWriter::Merge operands, per-worker memory bounded to 300-500MB on a 2TB trace.
Deterministic aggregation + intra-file parallelism in the Dask path (prerequisite for closing the Python-vs-MPI gap).
CMake: MPI sources excluded from PCH when MPI is disabled.

Trace reader and query

Compiled predicate evaluation for AND-of-EQ queries.
Uniform-match shortcut for AND-of-EQ queries against uniformly-valued chunks.
Line-range support in work items and checkpoint processing.
Chunk pruning and checkpoint handling optimized.

Replay

Pipelined replay using coroutines and channels.
Trace processing migrated to JsonParser.
String handling and i/o buffering tightened to reduce per-event allocation.

Organize, writer, dft

Parallel slice creation and merging in the organize visitor; batch flushing added.
Inline indexer in organize so a single pass produces both reorganized output and index artifacts.
Writer keeps track of gzip members (consumed by the indexer's member-slicing path).
Event dispatcher converted to coroutines with parse logic extracted into parse_inflated.h.
Backpressure: wants_drain / drain_pending hooks on reorganize.

Arrow and call_tree

Arrow conversion path optimized.
call_tree save/load refactored and gains Arrow IPC support; perf improvements in tree construction.

Build and infrastructure

Option to use zlib-ng instead of madler/zlib.
System simdjson::simdjson fallback in link_simdjson.
cgroup v1/v2 memory-limit detection and parsing.
Auto-computed per-file memory estimates and batch sizes (used to right-size workers).
CI: perf branch trigger, workflow reformat, longer/retry budgets on flaky tests (dftracer_tar, dftracer_server, dftracer_replay with leak detection disabled for OpenMPI compatibility).

Docs

Indexer and trace reader API references rewritten for the new interfaces.

…and aggregation Indexer - Streaming parse-and-emit worker pipeline with bounded memory usage - Concurrent SST artifact ingestion with staging support - Gzip member slicing for parallel indexing - Lazy decoding for compressed value counts - Bypass DOM wrapper for indexer hot path (simdjson on_demand) - Decoupled write workers from parse workers - --rebuild-summaries flag and optimized root summary rebuild Aggregator / MPI - Task-based DAG execution for aggregator pipeline - Shared staging for multi-node artifact relocation - Per-node thread scaling to avoid oversubscription - Unified distributed aggregation tracking, removed manifest consolidation - Deterministic aggregation and intra-file parallelism Trace reader / query - Compiled predicate evaluation for AND-of-EQ queries - Uniform-match shortcut for AND-of-EQ queries - Line-range support for work items and checkpoint processing - Optimized chunk pruning and checkpoint handling Replay - Pipelined replay with coroutines and channels - JsonParser-based trace processing - Optimized string handling and i/o buffering Organize / writer / dft - Parallel slice creation and merging in organize visitor - Inline indexer in organize - Gzip member tracking in writer - Coroutine-based event dispatcher with extracted parse logic - Batch flushing in organize visitor Arrow / call_tree - Optimized arrow conversion - Arrow IPC support and improved save/load in call_tree Build / infrastructure - zlib-ng option, system simdjson fallback - cgroup v1/v2 memory limit detection - Auto-computed per-file memory estimates and batch sizes - CI: perf branch trigger, formatting Docs - Rewritten indexer and trace reader API references

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Copilot AI review requested due to automatic review settings May 18, 2026 05:53

Copilot AI reviewed May 18, 2026

View reviewed changes

rayandrew requested review from Copilot and hariharan-devarajan May 18, 2026 05:54

Copilot AI reviewed May 18, 2026

View reviewed changes

hariharan-devarajan approved these changes May 18, 2026

View reviewed changes

hariharan-devarajan merged commit 896c045 into llnl:develop May 18, 2026
47 of 50 checks passed

rayandrew deleted the feat/perf-improvement branch May 18, 2026 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(perf): performance improvements for parallel reading, indexing, and aggregation#65

feat(perf): performance improvements for parallel reading, indexing, and aggregation#65
hariharan-devarajan merged 1 commit into
llnl:developfrom
rayandrew:feat/perf-improvement

rayandrew commented May 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rayandrew commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance improvements for parallel reading, indexing, and aggregation

Summary

Motivation

Indexer

Aggregator and MPI driver

Trace reader and query

Replay

Organize, writer, dft

Arrow and call_tree

Build and infrastructure

Docs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rayandrew commented May 18, 2026 •

edited

Loading