Skip to content

feat(perf): performance improvements for parallel reading, indexing, and aggregation#65

Merged
hariharan-devarajan merged 1 commit into
llnl:developfrom
rayandrew:feat/perf-improvement
May 18, 2026
Merged

feat(perf): performance improvements for parallel reading, indexing, and aggregation#65
hariharan-devarajan merged 1 commit into
llnl:developfrom
rayandrew:feat/perf-improvement

Conversation

@rayandrew
Copy link
Copy Markdown
Collaborator

@rayandrew rayandrew commented May 18, 2026

Performance improvements for parallel reading, indexing, and aggregation

Closes #47 as WONTFIX

Summary

End-to-end performance work across the dftracer-utils stack: indexer, aggregator/MPI driver, trace reader, replay, organize, and surrounding infrastructure. The goal is to scale indexing and analysis from single-node to 32-node MPI/Dask runs on multi-TB traces (h5bench and DLIO targets) while keeping memory bounded.

Motivation

dftracer_index was CPU-bound (~3000s for the h5bench 2TB trace) and the existing distributed path did not scale linearly. The Python/Dask build was ~10x slower than a pure MPI build (2746s vs 277s at 32 nodes). Replay and trace-reader query paths also had hot-path overheads (DOM-wrapped simdjson, string churn, no predicate compilation). This PR addresses those bottlenecks together because they share infrastructure (RocksDB CFs, SST staging, coroutine pipelines, gzip member layout).

Indexer

  • Streaming parse-and-emit worker pipeline with bounded memory usage; write workers decoupled from parse workers.
  • Concurrent SST artifact ingestion with staging support; per-CF artifact collection uses an atomic counter + mutex.
  • Gzip member slicing so a single .pfw.gz can be indexed in parallel across its members. Writer now tracks gzip member offsets.
  • Lazy decoding for compressed value counts.
  • simdjson on_demand path bypasses the DOM wrapper on the indexer hot path.
  • --rebuild-summaries flag; root summary rebuild is now incremental rather than full-scan.
  • Distributed aggregation tracking unified; manifest consolidation removed.

Aggregator and MPI driver

  • Aggregator pipeline restructured into a task-based DAG (agg_mpi), inlining phase bodies into make_task lambdas to avoid the GCC 12 ASan SIGILL/hang seen with the previous task_* coroutine-wrapper pattern.
  • Shared staging for multi-node artifact relocation; artifact move now indexed by path.
  • Per-node thread scaling to avoid oversubscription on Flux/Corona.
  • AGGREGATION and SYSTEM_METRICS CFs distributed via SstFileWriter::Merge operands, per-worker memory bounded to 300-500MB on a 2TB trace.
  • Deterministic aggregation + intra-file parallelism in the Dask path (prerequisite for closing the Python-vs-MPI gap).
  • CMake: MPI sources excluded from PCH when MPI is disabled.

Trace reader and query

  • Compiled predicate evaluation for AND-of-EQ queries.
  • Uniform-match shortcut for AND-of-EQ queries against uniformly-valued chunks.
  • Line-range support in work items and checkpoint processing.
  • Chunk pruning and checkpoint handling optimized.

Replay

  • Pipelined replay using coroutines and channels.
  • Trace processing migrated to JsonParser.
  • String handling and i/o buffering tightened to reduce per-event allocation.

Organize, writer, dft

  • Parallel slice creation and merging in the organize visitor; batch flushing added.
  • Inline indexer in organize so a single pass produces both reorganized output and index artifacts.
  • Writer keeps track of gzip members (consumed by the indexer's member-slicing path).
  • Event dispatcher converted to coroutines with parse logic extracted into parse_inflated.h.
  • Backpressure: wants_drain / drain_pending hooks on reorganize.

Arrow and call_tree

  • Arrow conversion path optimized.
  • call_tree save/load refactored and gains Arrow IPC support; perf improvements in tree construction.

Build and infrastructure

  • Option to use zlib-ng instead of madler/zlib.
  • System simdjson::simdjson fallback in link_simdjson.
  • cgroup v1/v2 memory-limit detection and parsing.
  • Auto-computed per-file memory estimates and batch sizes (used to right-size workers).
  • CI: perf branch trigger, workflow reformat, longer/retry budgets on flaky tests (dftracer_tar, dftracer_server, dftracer_replay with leak detection disabled for OpenMPI compatibility).

Docs

  • Indexer and trace reader API references rewritten for the new interfaces.

…and aggregation

Indexer
- Streaming parse-and-emit worker pipeline with bounded memory usage
- Concurrent SST artifact ingestion with staging support
- Gzip member slicing for parallel indexing
- Lazy decoding for compressed value counts
- Bypass DOM wrapper for indexer hot path (simdjson on_demand)
- Decoupled write workers from parse workers
- --rebuild-summaries flag and optimized root summary rebuild

Aggregator / MPI
- Task-based DAG execution for aggregator pipeline
- Shared staging for multi-node artifact relocation
- Per-node thread scaling to avoid oversubscription
- Unified distributed aggregation tracking, removed manifest consolidation
- Deterministic aggregation and intra-file parallelism

Trace reader / query
- Compiled predicate evaluation for AND-of-EQ queries
- Uniform-match shortcut for AND-of-EQ queries
- Line-range support for work items and checkpoint processing
- Optimized chunk pruning and checkpoint handling

Replay
- Pipelined replay with coroutines and channels
- JsonParser-based trace processing
- Optimized string handling and i/o buffering

Organize / writer / dft
- Parallel slice creation and merging in organize visitor
- Inline indexer in organize
- Gzip member tracking in writer
- Coroutine-based event dispatcher with extracted parse logic
- Batch flushing in organize visitor

Arrow / call_tree
- Optimized arrow conversion
- Arrow IPC support and improved save/load in call_tree

Build / infrastructure
- zlib-ng option, system simdjson fallback
- cgroup v1/v2 memory limit detection
- Auto-computed per-file memory estimates and batch sizes
- CI: perf branch trigger, formatting

Docs
- Rewritten indexer and trace reader API references
Copilot AI review requested due to automatic review settings May 18, 2026 05:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

@hariharan-devarajan hariharan-devarajan merged commit 896c045 into llnl:develop May 18, 2026
47 of 50 checks passed
@rayandrew rayandrew deleted the feat/perf-improvement branch May 18, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change argparse to lyra

3 participants