Skip to content

feat(dlio): DLIO config generator#66

Merged
hariharan-devarajan merged 2 commits into
llnl:developfrom
rayandrew:feat/dlio-config-generator
May 19, 2026
Merged

feat(dlio): DLIO config generator#66
hariharan-devarajan merged 2 commits into
llnl:developfrom
rayandrew:feat/dlio-config-generator

Conversation

@rayandrew
Copy link
Copy Markdown
Collaborator

@rayandrew rayandrew commented May 19, 2026

Closes #58.

Summary

Native C++ binary that consumes raw DFTracer logs end-to-end and emits a DLIO train.computation_time / reader.preprocess_time YAML.

Users run:

dftracer_gen_dlio_config -d ./traces -o dlio_config.yaml

The binary indexes the inputs, aggregates them into the internal AGGREGATION column family (with DDSketch forced on), fits per-component timing distributions, refines max_bound against an in-process barrier simulator, and emits the YAML. RocksDB is never visible to the user.

What's new

Binary

  • src/dftracer/utils/binaries/dftracer_gen_dlio_config.cpp: new binary, uses cli::ArgParse with DirectoryArgs / PipelineArgs / IndexingArgs (same surface as dftracer_aggregator).

Aggregation pipeline refactor

  • aggregation_runner.{h,cpp}: extracted the ~400 LoC pipeline that lived inside dftracer_aggregator.cpp into a reusable library function run_aggregation(AggregationRunInput). Both dftracer_aggregator and dftracer_gen_dlio_config now call it. The aggregator binary shrank to CLI parsing + delegation. Behavior preserved: all 12 aggregator-related tests still pass.

DLIO module (utilities/dlio/)

  • trace_loader.h/.cpp: reads the AGGREGATION CF read-only (re-attaches the merge operator at open; latent bug fixed in passing), groups entries by (cat, name, pid, time_bucket), synthesizes per-call samples (inverse-CDF from DDSketch when available, mean replication as fallback).
  • barrier_simulator.h/.cpp: discrete-event simulator across ranks/steps with worker-queue prefetch model. sweep_union / cdf_similarity / variance free helpers.
  • worker_queue.h: DataLoader prefetch model.
  • statistic.h: DLIO-side alias around common::statistics::Statistic plus ComponentTimeMetrics and Boundary.
  • optimizer.h/.cpp: sequential momentum-based loop tuning max_bound percentile.
  • yaml_emit.h/.cpp: DLIO YAML emit for single distributions and Gaussian mixtures via yaml-cpp.

General statistics utilities (common/statistics/)

These are generic: no DLIO coupling: so they live under common:

  • statistic.h: min/max/mean/count accumulator with optional DDSketch backing for quantile queries.
  • distributions.{h,cpp}: MLE fitting for Normal, Lognormal, Gamma, Exponential, Weibull. KS goodness-of-fit + BIC. make_sampler() factory backed by <random>. Boost.Math used for CDF/PDF/quantile (standalone mode).
  • mixture.{h,cpp}: univariate Gaussian Mixture EM (K=2 and K=3) with log-sum-exp responsibilities, quantile-spread init, variance floor. BIC-based select_best_model() across single + mixture candidates returning a std::variant<FittedDistribution, FittedMixture>.

CMake / dependencies

  • cmake/modules/Dependencies.cmake:
    • need_boost_math(): CPM fetch of boostorg/math@boost-1.91.0, header-only standalone.
    • link_boost_math(target) helper: applies headers + BOOST_MATH_STANDALONE define as PRIVATE so the install/export set stays clean.
    • need_yaml_cpp() / link_yaml_cpp(target): yaml-cpp 0.9.0, linked PRIVATE.
  • src/CMakeLists.txt: registers the new sources, links Boost.Math + yaml-cpp to the utilities library, adds the binary.

Comparison harness

  • scripts/compare_dlio_yamls.py: side-by-side YAML diff with two checks: parameter-tolerance (--rtol) and two-sample KS on samples drawn from each fitted distribution (--ks-threshold). PEP-723 shebang installs deps via uv run. Exit code reflects the KS check (the authoritative one); parameter divergence is informational since two different fits can produce statistically interchangeable distributions.

Tests

  • tests/utilities/dlio/test_barrier_simulator.cpp: 18 cases: sweep_union, cdf_similarity, variance, WorkerQueue, BarrierSimulator async / sync / throughput.
  • tests/utilities/common/statistics/test_distributions.cpp: 14 cases: MLE recovers parameters on all 5 families within 5%, ranking invariants, sampler clamping, error paths.
  • tests/utilities/common/statistics/test_mixture.cpp: 11 cases: GMM EM recovers K=2/K=3 means+stddev+weights, pdf/cdf/sampler dispatch, select_best_model picks the right family on unimodal vs. bimodal data, variant dispatch.
  • tests/binaries/test_dftracer_gen_dlio_config.cpp: 6 integration cases: binary lookup, --help, missing --output rejection, non-DLIO directory graceful failure, happy path (YAML schema + content), worker/prefetch flag acceptance.
  • All 12 pre-existing aggregator tests still pass after the run_aggregation extraction.

Validation against Python pipeline

On a synthetic raw trace of 4 ranks × 400 steps × lognormal fetch.block + lognormal preprocess:

Block Both selected mean Δ sigma Δ max_bound Δ 2-sample KS similarity
computation_time Lognormal 0.02 % 3.79 % 2.90 % 0.060 94.0 %
preprocess_time Lognormal 0.04 % 7.86 % 6.01 % 0.082 91.8 %

Both families match; mean is recovered to 4 decimal places; sigma is within 8 %; remaining gap is from MLE/EM/PRNG numerics. Distributions are statistically interchangeable for DLIO sampling.

Docs

  • docs/source/cli.rst: new dftracer_gen_dlio_config section with full option table, examples, output schema, and comparison-script pointer.
  • docs/source/utilities/dlio.rst (new): module-level guide covering each piece of the pipeline and the comparison harness.
  • docs/source/utilities/common.rst: Statistic, Distributions, Mixture subsections with usage examples.
  • docs/source/utilities.rst: toctree entry, statistics overview expanded, DLIO category added to the architecture diagram.

Copilot AI review requested due to automatic review settings May 19, 2026 03:26
@rayandrew rayandrew changed the title feat(dlio): DLIO config generator (dftracer_gen_dlio_config) feat(dlio): DLIO config generator May 19, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@rayandrew rayandrew force-pushed the feat/dlio-config-generator branch from 7e894b1 to eafcd87 Compare May 19, 2026 03:58
@hariharan-devarajan hariharan-devarajan merged commit 6c9aaa7 into llnl:develop May 19, 2026
28 of 29 checks passed
@rayandrew rayandrew deleted the feat/dlio-config-generator branch May 19, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dftracer AI annotation conversion to DLIO YAML configuration

3 participants