Skip to content

feat(dfanalyzer): integration of bridge module, exact-interval reindex, cat normalization#69

Merged
hariharan-devarajan merged 1 commit into
llnl:developfrom
rayandrew:feat/dfanalyzer-integration
May 21, 2026
Merged

feat(dfanalyzer): integration of bridge module, exact-interval reindex, cat normalization#69
hariharan-devarajan merged 1 commit into
llnl:developfrom
rayandrew:feat/dfanalyzer-integration

Conversation

@rayandrew
Copy link
Copy Markdown
Collaborator

Summary

Folds the dftracer-utils <-> dfanalyzer integration logic into dftracer-utils so that dfanalyzer stays lean and the integration code lives next to the Indexer and Arrow plumbing it depends on. Three independent pieces:

  1. dftracer.utils.dfanalyzer bridge module: the dfanalyzer-facing helpers (index orchestration, Arrow-IPC plumbing, HLM groupby) now live here instead of being vendored in dfanalyzer.
  2. Exact aggregation-interval handling: an index built at a different time_interval is now detected and rebuilt automatically; no external sidecar file is needed.
  3. cat normalization at source: the event category is lowercased where the aggregation key is built, so grouping, queries, and Arrow output all agree.

Changes

New: dftracer/utils/dfanalyzer.py bridge module

An isolated module (intentionally not re-exported from dftracer.utils) holding the dfanalyzer integration surface:

  • Index orchestration: index_path_for, build_index_distributed, ensure_index, resolve_trace_inputs.
  • Arrow-IPC plumbing: scan_to_ipc, batches_to_ipc, ipc_to_pandas.
  • HLM groupby: distributed_hlm, worker_hlm_partial, make_empty_hlm, partial_arrow_view_groupby, finalize_view_partials, build_partial_meta, build_final_meta.
  • DataFrame shaping: coerce_profile_dtypes, coerce_arrow_numerics_to_pandas_native, normalize_arrow_dtypes.

All dfanalyzer-specific schema (column dtype maps, HLM aggregation spec, set-flatten function) is passed in as parameters, so the module itself stays schema-agnostic.

dftracer/utils/dask.py

  • register_auto_thread_plugin: registers the worker plugin sizing each worker's C++ Runtime threads as hw_concurrency / workers_on_node; idempotent per scheduler address.
  • resolve_local_staging: derives node-local SST scratch from each worker's own scratch directory.

Exact aggregation-interval reindex (C++)

  • resolve_and_build_index (resolve_and_build.cpp): when the resolver reports a cached aggregation tier built at a different time_interval, the index is discarded and rebuilt at the requested interval. The stored interval lives in the index itself (AGG_GLOBAL_CONFIG_KEY), so no external .meta sidecar is required.
  • index_resolver_utility.cpp: propagates stored_time_interval_us whenever a compatible aggregation config exists.
  • batch_indexer.cpp: resolve() now reports aggregation_interval_us and needs_rebuild; ensure_indexed() rebuilds when an interval mismatch is detected.
  • indexer.py: IndexStatus gains aggregation_interval_us.

cat normalization at source (C++)

  • internal/utils.{h,cpp}: new to_lower_ascii; is_data_transfer_op is now case-insensitive.
  • aggregation_logic.cpp / aggregation_visitor.cpp: the cat field is lowercased where the aggregation key is built, so the AGGREGATION key, group-by output, query filters, and Arrow output are all consistent.

Copilot AI review requested due to automatic review settings May 21, 2026 04:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates dfanalyzer-facing orchestration and Arrow/HLM helpers directly into dftracer-utils, while improving aggregation index correctness (rebuild on time-interval mismatch) and normalizing event categories to lowercase across aggregation/query/Arrow output.

Changes:

  • Added a new python/dftracer/utils/dfanalyzer.py “bridge” module for dfanalyzer integration (index orchestration, Arrow IPC conversion, distributed HLM helpers).
  • Implemented exact aggregation interval mismatch detection and automatic rebuild (C++ resolver/build + Python status plumbing).
  • Normalized cat to lowercase at aggregation-key construction and updated tests/queries/docs accordingly; added Dask worker plugin auto-thread sizing utilities.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/utilities/composites/dft/aggregators/test_chunk_aggregator_utility.cpp Updates expectations for lowercased cat output.
tests/utilities/composites/dft/aggregators/test_aggregator_utility.cpp Updates expectations for lowercased cat output.
tests/python/test_indexer.py Adds coverage for reporting aggregation interval and rebuild behavior; updates query strings for lowercased cat.
tests/python/test_aggregator.py Updates expected cat values and query filters to lowercase.
src/dftracer/utils/utilities/composites/dft/internal/utils.cpp Adds ASCII lowercasing helper and makes is_data_transfer_op case-insensitive.
include/dftracer/utils/utilities/composites/dft/internal/utils.h Declares to_lower_ascii helper for cat normalization.
src/dftracer/utils/utilities/composites/dft/indexing/resolve_and_build.cpp Rebuilds index when cached aggregation interval differs from requested interval.
src/dftracer/utils/utilities/composites/dft/indexing/index_resolver_utility.cpp Propagates stored aggregation interval info from cached tiers.
src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_visitor.cpp Lowercases cat when building serialized aggregation keys.
src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_logic.cpp Lowercases cat when constructing AggregationKey for map aggregation.
src/dftracer/utils/python/batch_indexer.cpp Exposes aggregation_interval_us in resolve() and triggers rebuild when interval mismatch is detected.
python/dftracer/utils/indexer.py Extends IndexStatus with aggregation_interval_us and plumbs it through resolve().
python/dftracer/utils/dfanalyzer.py New bridge module implementing index orchestration, Arrow IPC plumbing, and distributed HLM helpers.
python/dftracer/utils/dask.py Adds resolve_local_staging() and idempotent register_auto_thread_plugin().
docs/source/cpp_api/utilities.rst Adjusts doc links to use .../index pages.
docs/source/conf.py Mocks numpy/pandas for doc builds.
docs/source/api/index.rst Adds dfanalyzer to the Python API docs toctree.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/dftracer/utils/utilities/composites/dft/indexing/resolve_and_build.cpp Outdated
Comment thread src/dftracer/utils/utilities/composites/dft/indexing/resolve_and_build.cpp Outdated
Comment thread src/dftracer/utils/utilities/composites/dft/internal/utils.cpp Outdated
Comment thread src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_visitor.cpp Outdated
Comment thread python/dftracer/utils/dfanalyzer.py Outdated
Comment thread python/dftracer/utils/dfanalyzer.py
Comment thread python/dftracer/utils/indexer.py
Comment thread docs/source/api/index.rst
@rayandrew rayandrew force-pushed the feat/dfanalyzer-integration branch from 66f99ca to 18dfeae Compare May 21, 2026 05:23
@hariharan-devarajan hariharan-devarajan self-requested a review May 21, 2026 15:34
@hariharan-devarajan hariharan-devarajan merged commit 13b43a8 into llnl:develop May 21, 2026
50 of 51 checks passed
@rayandrew rayandrew deleted the feat/dfanalyzer-integration branch May 21, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants