feat(perf): performance improvements for parallel reading, indexing, and aggregation#65
Merged
hariharan-devarajan merged 1 commit intoMay 18, 2026
Conversation
…and aggregation Indexer - Streaming parse-and-emit worker pipeline with bounded memory usage - Concurrent SST artifact ingestion with staging support - Gzip member slicing for parallel indexing - Lazy decoding for compressed value counts - Bypass DOM wrapper for indexer hot path (simdjson on_demand) - Decoupled write workers from parse workers - --rebuild-summaries flag and optimized root summary rebuild Aggregator / MPI - Task-based DAG execution for aggregator pipeline - Shared staging for multi-node artifact relocation - Per-node thread scaling to avoid oversubscription - Unified distributed aggregation tracking, removed manifest consolidation - Deterministic aggregation and intra-file parallelism Trace reader / query - Compiled predicate evaluation for AND-of-EQ queries - Uniform-match shortcut for AND-of-EQ queries - Line-range support for work items and checkpoint processing - Optimized chunk pruning and checkpoint handling Replay - Pipelined replay with coroutines and channels - JsonParser-based trace processing - Optimized string handling and i/o buffering Organize / writer / dft - Parallel slice creation and merging in organize visitor - Inline indexer in organize - Gzip member tracking in writer - Coroutine-based event dispatcher with extracted parse logic - Batch flushing in organize visitor Arrow / call_tree - Optimized arrow conversion - Arrow IPC support and improved save/load in call_tree Build / infrastructure - zlib-ng option, system simdjson fallback - cgroup v1/v2 memory limit detection - Auto-computed per-file memory estimates and batch sizes - CI: perf branch trigger, formatting Docs - Rewritten indexer and trace reader API references
hariharan-devarajan
approved these changes
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Performance improvements for parallel reading, indexing, and aggregation
Closes #47 as
WONTFIXSummary
End-to-end performance work across the dftracer-utils stack: indexer, aggregator/MPI driver, trace reader, replay, organize, and surrounding infrastructure. The goal is to scale indexing and analysis from single-node to 32-node MPI/Dask runs on multi-TB traces (h5bench and DLIO targets) while keeping memory bounded.
Motivation
dftracer_indexwas CPU-bound (~3000s for the h5bench 2TB trace) and the existing distributed path did not scale linearly. The Python/Dask build was ~10x slower than a pure MPI build (2746s vs 277s at 32 nodes). Replay and trace-reader query paths also had hot-path overheads (DOM-wrapped simdjson, string churn, no predicate compilation). This PR addresses those bottlenecks together because they share infrastructure (RocksDB CFs, SST staging, coroutine pipelines, gzip member layout).Indexer
.pfw.gzcan be indexed in parallel across its members. Writer now tracks gzip member offsets.on_demandpath bypasses the DOM wrapper on the indexer hot path.--rebuild-summariesflag; root summary rebuild is now incremental rather than full-scan.Aggregator and MPI driver
agg_mpi), inlining phase bodies intomake_tasklambdas to avoid the GCC 12 ASan SIGILL/hang seen with the previoustask_*coroutine-wrapper pattern.SstFileWriter::Mergeoperands, per-worker memory bounded to 300-500MB on a 2TB trace.Trace reader and query
Replay
JsonParser.Organize, writer, dft
parse_inflated.h.wants_drain/drain_pendinghooks on reorganize.Arrow and call_tree
call_treesave/load refactored and gains Arrow IPC support; perf improvements in tree construction.Build and infrastructure
simdjson::simdjsonfallback inlink_simdjson.dftracer_tar,dftracer_server,dftracer_replaywith leak detection disabled for OpenMPI compatibility).Docs