feat(dfanalyzer): integration of bridge module, exact-interval reindex, cat normalization#69
Merged
hariharan-devarajan merged 1 commit intoMay 21, 2026
Conversation
There was a problem hiding this comment.
Pull request overview
Integrates dfanalyzer-facing orchestration and Arrow/HLM helpers directly into dftracer-utils, while improving aggregation index correctness (rebuild on time-interval mismatch) and normalizing event categories to lowercase across aggregation/query/Arrow output.
Changes:
- Added a new
python/dftracer/utils/dfanalyzer.py“bridge” module for dfanalyzer integration (index orchestration, Arrow IPC conversion, distributed HLM helpers). - Implemented exact aggregation interval mismatch detection and automatic rebuild (C++ resolver/build + Python status plumbing).
- Normalized
catto lowercase at aggregation-key construction and updated tests/queries/docs accordingly; added Dask worker plugin auto-thread sizing utilities.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/utilities/composites/dft/aggregators/test_chunk_aggregator_utility.cpp | Updates expectations for lowercased cat output. |
| tests/utilities/composites/dft/aggregators/test_aggregator_utility.cpp | Updates expectations for lowercased cat output. |
| tests/python/test_indexer.py | Adds coverage for reporting aggregation interval and rebuild behavior; updates query strings for lowercased cat. |
| tests/python/test_aggregator.py | Updates expected cat values and query filters to lowercase. |
| src/dftracer/utils/utilities/composites/dft/internal/utils.cpp | Adds ASCII lowercasing helper and makes is_data_transfer_op case-insensitive. |
| include/dftracer/utils/utilities/composites/dft/internal/utils.h | Declares to_lower_ascii helper for cat normalization. |
| src/dftracer/utils/utilities/composites/dft/indexing/resolve_and_build.cpp | Rebuilds index when cached aggregation interval differs from requested interval. |
| src/dftracer/utils/utilities/composites/dft/indexing/index_resolver_utility.cpp | Propagates stored aggregation interval info from cached tiers. |
| src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_visitor.cpp | Lowercases cat when building serialized aggregation keys. |
| src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_logic.cpp | Lowercases cat when constructing AggregationKey for map aggregation. |
| src/dftracer/utils/python/batch_indexer.cpp | Exposes aggregation_interval_us in resolve() and triggers rebuild when interval mismatch is detected. |
| python/dftracer/utils/indexer.py | Extends IndexStatus with aggregation_interval_us and plumbs it through resolve(). |
| python/dftracer/utils/dfanalyzer.py | New bridge module implementing index orchestration, Arrow IPC plumbing, and distributed HLM helpers. |
| python/dftracer/utils/dask.py | Adds resolve_local_staging() and idempotent register_auto_thread_plugin(). |
| docs/source/cpp_api/utilities.rst | Adjusts doc links to use .../index pages. |
| docs/source/conf.py | Mocks numpy/pandas for doc builds. |
| docs/source/api/index.rst | Adds dfanalyzer to the Python API docs toctree. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
66f99ca to
18dfeae
Compare
hariharan-devarajan
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Folds the dftracer-utils <-> dfanalyzer integration logic into dftracer-utils so that dfanalyzer stays lean and the integration code lives next to the Indexer and Arrow plumbing it depends on. Three independent pieces:
dftracer.utils.dfanalyzerbridge module: the dfanalyzer-facing helpers (index orchestration, Arrow-IPC plumbing, HLM groupby) now live here instead of being vendored in dfanalyzer.time_intervalis now detected and rebuilt automatically; no external sidecar file is needed.catnormalization at source: the event category is lowercased where the aggregation key is built, so grouping, queries, and Arrow output all agree.Changes
New:
dftracer/utils/dfanalyzer.pybridge moduleAn isolated module (intentionally not re-exported from
dftracer.utils) holding the dfanalyzer integration surface:index_path_for,build_index_distributed,ensure_index,resolve_trace_inputs.scan_to_ipc,batches_to_ipc,ipc_to_pandas.distributed_hlm,worker_hlm_partial,make_empty_hlm,partial_arrow_view_groupby,finalize_view_partials,build_partial_meta,build_final_meta.coerce_profile_dtypes,coerce_arrow_numerics_to_pandas_native,normalize_arrow_dtypes.All dfanalyzer-specific schema (column dtype maps, HLM aggregation spec, set-flatten function) is passed in as parameters, so the module itself stays schema-agnostic.
dftracer/utils/dask.pyregister_auto_thread_plugin: registers the worker plugin sizing each worker's C++ Runtime threads ashw_concurrency / workers_on_node; idempotent per scheduler address.resolve_local_staging: derives node-local SST scratch from each worker's own scratch directory.Exact aggregation-interval reindex (C++)
resolve_and_build_index(resolve_and_build.cpp): when the resolver reports a cached aggregation tier built at a differenttime_interval, the index is discarded and rebuilt at the requested interval. The stored interval lives in the index itself (AGG_GLOBAL_CONFIG_KEY), so no external.metasidecar is required.index_resolver_utility.cpp: propagatesstored_time_interval_uswhenever a compatible aggregation config exists.batch_indexer.cpp:resolve()now reportsaggregation_interval_usandneeds_rebuild;ensure_indexed()rebuilds when an interval mismatch is detected.indexer.py:IndexStatusgainsaggregation_interval_us.catnormalization at source (C++)internal/utils.{h,cpp}: newto_lower_ascii;is_data_transfer_opis now case-insensitive.aggregation_logic.cpp/aggregation_visitor.cpp: thecatfield is lowercased where the aggregation key is built, so the AGGREGATION key, group-by output, query filters, and Arrow output are all consistent.