feat(aggregator): offset metrics, per-event-name system metrics, and time-bucket persistence by rayandrew · Pull Request #68 · llnl/dftracer-utils

rayandrew · 2026-05-20T13:29:49Z

Summary

Extends the aggregator and dfanalyzer integration path with three independent
improvements:

Offset metric tracking alongside the existing duration and size metrics.
Per-event-name keying for the SYSTEM_METRICS column family so distinct
counters (cpu, memory, etc.) keep separate buckets and can be pivoted into
named columns on the dfanalyzer side.
Explicit time-bound persistence so a read-only reopen of an
SST-built index can recover the trace origin.

Changes

Offset metric

New MetricStats offset field on AggregationMetrics, wired through the
copy/assign/merge paths and update_offset().
Serialization formats (AggMetricsView, AggMetricsFullView, the fast and
full parsers, and serialize_agg_value_into / deserialize_agg_value) now
carry offset_total/min/max plus mean/m2 and an offset_stddev() helper.
Aggregation logic and the visitor read offset / offset_sum /
offset_min / offset_max event args. Offset has no meaningful sum, so
ingestion triggers on any of the offset args being present. The reserved-arg
filters in both aggregation_logic.cpp and aggregation_visitor.cpp exclude
these keys from custom-metric tracking.
apply_preaggregated_metric no longer requires sum to exist; it now also
fires when only min/max are present.
dfanalyzer schema gains offset_min / offset_max columns. They are emitted
as null when no offset was ever recorded (MetricStats default
min=UINT64_MAX, max=0), since 0 is itself a valid offset.

Per-event-name system metrics

SystemMetricKey gains a name field; key serialization becomes
[hhash][name][time_bucket].
handle_system_event keys buffers by ev.name.
New EventAggregator::scan_system_metrics_raw[_fn] provides a sequential
scan of the SYSTEM_METRICS CF (its keys carry no shard prefix).
New scan_system_metrics_buffer() in the Python binding does a two-pass
scan: pass 1 discovers the dynamic metric column names, pass 2 emits rows.
The schema must be declared up front for RecordBatchBuilder. Results are
appended to Indexer_iter_arrow_dfanalyzer_all.

Time-bucket persistence and bucket alignment

New EventAggregator::persist_time_bounds() writes the in-memory min/max
time bucket to the AGGREGATION CF. The SST build path now calls it
explicitly after merge_chunk() in resolve_and_build_index, so a later
read-only reopen recovers the trace origin instead of emitting time_range
as an absolute bucket index.
Counter (ph="C") events report stats for the period ending at ev.ts, so
a boundary-aligned timestamp is assigned to the bucket it summarizes (the
one before it). Plain events keep their own timestamp.
In the dfanalyzer scan, counter/profile rows align time_start/time_end
to the bucket grid; plain events keep precise min/max event timestamps.

Hash resolution

Unresolved file/host hashes now resolve to an empty string view instead of
the hash itself. The dfanalyzer side treats empty file_name/host_name
as missing (NA).

Rename: `lustre_staging` -> `shared_staging`

distributed_index and _build_sst_task rename the lustre_staging
parameter to shared_staging (it need not be Lustre, only a shared FS).
Docstrings and tests updated accordingly.

Tests

test_system_metrics.cpp: key round-trip test updated for the new name
field.
test_distributed_manifest.py: updated for the shared_staging rename.

…event-name keying

Copilot

Pull request overview

This PR extends the DFT aggregator + dfanalyzer export path to support (1) offset metric aggregation, (2) per-event-name keying and scanning for system metrics stored in the SYSTEM_METRICS RocksDB column family, and (3) persistence of time-bucket bounds so read-only reopens can recover the trace’s time origin.

Changes:

Add offset metric tracking throughout aggregation, merge, and (de)serialization, and expose offset_min/offset_max in the dfanalyzer Arrow schema/export.
Change system-metrics key serialization to include an event name, add raw scanning APIs for the SYSTEM_METRICS CF, and export those metrics via a two-pass Arrow builder in the Python binding.
Persist min/max time-bucket bounds during SST index build and align counter/profile bucket timestamps to intended bucket boundaries; rename lustre_staging → shared_staging in the Dask distributed index path.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/utilities/composites/dft/aggregators/test_system_metrics.cpp	Updates key round-trip test for the new system-metrics key format including `name`.
tests/python/test_distributed_manifest.py	Updates tests for `lustre_staging` → `shared_staging` rename.
src/dftracer/utils/utilities/composites/dft/indexing/resolve_and_build.cpp	Calls `persist_time_bounds()` after SST merge/build so read-only reopen can recover origin.
src/dftracer/utils/utilities/composites/dft/aggregators/system_metrics_serialization.cpp	Extends system-metrics key serialization/deserialization to `[hhash][name][time_bucket]`.
src/dftracer/utils/utilities/composites/dft/aggregators/event_aggregator.cpp	Adds `scan_system_metrics_raw_fn` and `persist_time_bounds()` implementations.
src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_visitor.cpp	Adds reserved offset args, aligns profile bucket timestamps, ingests offset args, and keys system metrics by event name.
src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_serialization.cpp	Includes `offset` metric stats in aggregation value serialization/deserialization.
src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_metrics.cpp	Implements `AggregationMetrics::update_offset` and merges `offset` stats.
src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_logic.cpp	Adjusts preaggregated metric application and adds offset ingestion + reserved-arg filtering.
src/dftracer/utils/python/batch_indexer.cpp	Extends dfanalyzer schema/export with offset columns, time alignment for profile rows, unresolved-hash behavior, and a two-pass export for `SYSTEM_METRICS`.
python/dftracer/utils/dask.py	Renames `lustre_staging` → `shared_staging` and updates movement logic/docstrings.
include/dftracer/utils/utilities/composites/dft/aggregators/system_metrics_serialization.h	Updates `SystemMetricKey` and key serialization API to include `name`.
include/dftracer/utils/utilities/composites/dft/aggregators/event_aggregator.h	Declares new system-metrics scan and `persist_time_bounds()` APIs.
include/dftracer/utils/utilities/composites/dft/aggregators/aggregation_serialization.h	Extends AggMetrics view structs/parsers with offset fields and stddev helper.
include/dftracer/utils/utilities/composites/dft/aggregators/aggregation_metrics.h	Adds `MetricStats offset` and `update_offset()` to `AggregationMetrics`.

Comments suppressed due to low confidence (1)

src/dftracer/utils/utilities/composites/dft/aggregators/aggregation_logic.cpp:30

apply_preaggregated_metric now updates stats.count even when the pre-aggregated *_sum/plain value is missing and only *_min/*_max are present. In that case total is forced to 0, which will skew mean (and any later derived stats) by increasing the denominator without increasing the numerator. Consider only incrementing stats.count/stats.total (and recomputing mean) when sum_val.exists(), while still allowing min/max to update independently when present.

    if (!sum_val.exists() && !min_val.exists() && !max_val.exists()) return;

    const auto total = sum_val.exists() ? sum_val.get<std::uint64_t>() : 0;
    stats.count += ev_count;
    stats.total += total;
    if (min_val.exists()) {
        stats.min = std::min(stats.min, min_val.get<std::uint64_t>());
    }
    if (max_val.exists()) {
        stats.max = std::max(stats.max, max_val.get<std::uint64_t>());
    }

    if (stats.count > 0) {
        stats.mean =
            static_cast<double>(stats.total) / static_cast<double>(stats.count);
        stats.m2 = 0.0;
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…or handling

rayandrew added 2 commits May 19, 2026 20:11

feat(aggregator): add system metrics scan and serialization with per-…

3c9c481

…event-name keying

feat(aggregator): add offset metric tracking and time bucket persistence

a4eaed4

Copilot AI review requested due to automatic review settings May 20, 2026 13:29

Copilot started reviewing on behalf of rayandrew May 20, 2026 13:30 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

feat(aggregator): improve system metrics scanning and persistence err…

359e836

…or handling

rayandrew requested a review from hariharan-devarajan May 20, 2026 22:27

hariharan-devarajan approved these changes May 20, 2026

View reviewed changes

hariharan-devarajan merged commit 9d6bf82 into llnl:develop May 20, 2026
26 of 29 checks passed

rayandrew deleted the feat/dfanalyzer-integration branch May 21, 2026 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aggregator): offset metrics, per-event-name system metrics, and time-bucket persistence#68

feat(aggregator): offset metrics, per-event-name system metrics, and time-bucket persistence#68
hariharan-devarajan merged 3 commits into
llnl:developfrom
rayandrew:feat/dfanalyzer-integration

rayandrew commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rayandrew commented May 20, 2026

Summary

Changes

Offset metric

Per-event-name system metrics

Time-bucket persistence and bucket alignment

Hash resolution

Rename: lustre_staging -> shared_staging

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Rename: `lustre_staging` -> `shared_staging`