Skip to content

feat(aggregator): offset metrics, per-event-name system metrics, and time-bucket persistence#68

Merged
hariharan-devarajan merged 3 commits into
llnl:developfrom
rayandrew:feat/dfanalyzer-integration
May 20, 2026
Merged

feat(aggregator): offset metrics, per-event-name system metrics, and time-bucket persistence#68
hariharan-devarajan merged 3 commits into
llnl:developfrom
rayandrew:feat/dfanalyzer-integration

Conversation

@rayandrew
Copy link
Copy Markdown
Collaborator

Summary

Extends the aggregator and dfanalyzer integration path with three independent
improvements:

  1. Offset metric tracking alongside the existing duration and size metrics.
  2. Per-event-name keying for the SYSTEM_METRICS column family so distinct
    counters (cpu, memory, etc.) keep separate buckets and can be pivoted into
    named columns on the dfanalyzer side.
  3. Explicit time-bound persistence so a read-only reopen of an
    SST-built index can recover the trace origin.

Changes

Offset metric

  • New MetricStats offset field on AggregationMetrics, wired through the
    copy/assign/merge paths and update_offset().
  • Serialization formats (AggMetricsView, AggMetricsFullView, the fast and
    full parsers, and serialize_agg_value_into / deserialize_agg_value) now
    carry offset_total/min/max plus mean/m2 and an offset_stddev() helper.
  • Aggregation logic and the visitor read offset / offset_sum /
    offset_min / offset_max event args. Offset has no meaningful sum, so
    ingestion triggers on any of the offset args being present. The reserved-arg
    filters in both aggregation_logic.cpp and aggregation_visitor.cpp exclude
    these keys from custom-metric tracking.
  • apply_preaggregated_metric no longer requires sum to exist; it now also
    fires when only min/max are present.
  • dfanalyzer schema gains offset_min / offset_max columns. They are emitted
    as null when no offset was ever recorded (MetricStats default
    min=UINT64_MAX, max=0), since 0 is itself a valid offset.

Per-event-name system metrics

  • SystemMetricKey gains a name field; key serialization becomes
    [hhash][name][time_bucket].
  • handle_system_event keys buffers by ev.name.
  • New EventAggregator::scan_system_metrics_raw[_fn] provides a sequential
    scan of the SYSTEM_METRICS CF (its keys carry no shard prefix).
  • New scan_system_metrics_buffer() in the Python binding does a two-pass
    scan: pass 1 discovers the dynamic metric column names, pass 2 emits rows.
    The schema must be declared up front for RecordBatchBuilder. Results are
    appended to Indexer_iter_arrow_dfanalyzer_all.

Time-bucket persistence and bucket alignment

  • New EventAggregator::persist_time_bounds() writes the in-memory min/max
    time bucket to the AGGREGATION CF. The SST build path now calls it
    explicitly after merge_chunk() in resolve_and_build_index, so a later
    read-only reopen recovers the trace origin instead of emitting time_range
    as an absolute bucket index.
  • Counter (ph="C") events report stats for the period ending at ev.ts, so
    a boundary-aligned timestamp is assigned to the bucket it summarizes (the
    one before it). Plain events keep their own timestamp.
  • In the dfanalyzer scan, counter/profile rows align time_start/time_end
    to the bucket grid; plain events keep precise min/max event timestamps.

Hash resolution

  • Unresolved file/host hashes now resolve to an empty string view instead of
    the hash itself. The dfanalyzer side treats empty file_name/host_name
    as missing (NA).

Rename: lustre_staging -> shared_staging

  • distributed_index and _build_sst_task rename the lustre_staging
    parameter to shared_staging (it need not be Lustre, only a shared FS).
    Docstrings and tests updated accordingly.

Tests

  • test_system_metrics.cpp: key round-trip test updated for the new name
    field.
  • test_distributed_manifest.py: updated for the shared_staging rename.

Copilot AI review requested due to automatic review settings May 20, 2026 13:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the DFT aggregator + dfanalyzer export path to support (1) offset metric aggregation, (2) per-event-name keying and scanning for system metrics stored in the SYSTEM_METRICS RocksDB column family, and (3) persistence of time-bucket bounds so read-only reopens can recover the trace’s time origin.

Changes:

  • Add offset metric tracking throughout aggregation, merge, and (de)serialization, and expose offset_min/offset_max in the dfanalyzer Arrow schema/export.
  • Change system-metrics key serialization to include an event name, add raw scanning APIs for the SYSTEM_METRICS CF, and export those metrics via a two-pass Arrow builder in the Python binding.
  • Persist min/max time-bucket bounds during SST index build and align counter/profile bucket timestamps to intended bucket boundaries; rename lustre_stagingshared_staging in the Dask distributed index path.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/utilities/composites/dft/aggregators/test_system_metrics.cpp Updates key round-trip test for the new system-metrics key format including name.
tests/python/test_distributed_manifest.py Updates tests for lustre_stagingshared_staging rename.
src/dftracer/utils/utilities/composites/dft/indexing/resolve_and_build.cpp Calls persist_time_bounds() after SST merge/build so read-only reopen can recover origin.
src/dftracer/utils/utilities/composites/dft/aggregators/system_metrics_serialization.cpp Extends system-metrics key serialization/deserialization to [hhash][name][time_bucket].
src/dftracer/utils/utilities/composites/dft/aggregators/event_aggregator.cpp Adds scan_system_metrics_raw_fn and persist_time_bounds() implementations.
src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_visitor.cpp Adds reserved offset args, aligns profile bucket timestamps, ingests offset args, and keys system metrics by event name.
src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_serialization.cpp Includes offset metric stats in aggregation value serialization/deserialization.
src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_metrics.cpp Implements AggregationMetrics::update_offset and merges offset stats.
src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_logic.cpp Adjusts preaggregated metric application and adds offset ingestion + reserved-arg filtering.
src/dftracer/utils/python/batch_indexer.cpp Extends dfanalyzer schema/export with offset columns, time alignment for profile rows, unresolved-hash behavior, and a two-pass export for SYSTEM_METRICS.
python/dftracer/utils/dask.py Renames lustre_stagingshared_staging and updates movement logic/docstrings.
include/dftracer/utils/utilities/composites/dft/aggregators/system_metrics_serialization.h Updates SystemMetricKey and key serialization API to include name.
include/dftracer/utils/utilities/composites/dft/aggregators/event_aggregator.h Declares new system-metrics scan and persist_time_bounds() APIs.
include/dftracer/utils/utilities/composites/dft/aggregators/aggregation_serialization.h Extends AggMetrics view structs/parsers with offset fields and stddev helper.
include/dftracer/utils/utilities/composites/dft/aggregators/aggregation_metrics.h Adds MetricStats offset and update_offset() to AggregationMetrics.
Comments suppressed due to low confidence (1)

src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_logic.cpp:30

  • apply_preaggregated_metric now updates stats.count even when the pre-aggregated *_sum/plain value is missing and only *_min/*_max are present. In that case total is forced to 0, which will skew mean (and any later derived stats) by increasing the denominator without increasing the numerator. Consider only incrementing stats.count/stats.total (and recomputing mean) when sum_val.exists(), while still allowing min/max to update independently when present.
    if (!sum_val.exists() && !min_val.exists() && !max_val.exists()) return;

    const auto total = sum_val.exists() ? sum_val.get<std::uint64_t>() : 0;
    stats.count += ev_count;
    stats.total += total;
    if (min_val.exists()) {
        stats.min = std::min(stats.min, min_val.get<std::uint64_t>());
    }
    if (max_val.exists()) {
        stats.max = std::max(stats.max, max_val.get<std::uint64_t>());
    }

    if (stats.count > 0) {
        stats.mean =
            static_cast<double>(stats.total) / static_cast<double>(stats.count);
        stats.m2 = 0.0;
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/dftracer/utils/utilities/composites/dft/aggregators/event_aggregator.cpp Outdated
Comment thread src/dftracer/utils/python/batch_indexer.cpp Outdated
@hariharan-devarajan hariharan-devarajan merged commit 9d6bf82 into llnl:develop May 20, 2026
26 of 29 checks passed
@rayandrew rayandrew deleted the feat/dfanalyzer-integration branch May 21, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants