feat(dlio): DLIO config generator#66
Merged
hariharan-devarajan merged 2 commits intoMay 19, 2026
Merged
Conversation
dftracer_gen_dlio_config)7e894b1 to
eafcd87
Compare
hariharan-devarajan
approved these changes
May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #58.
Summary
Native C++ binary that consumes raw DFTracer logs end-to-end and emits a DLIO
train.computation_time/reader.preprocess_timeYAML.Users run:
The binary indexes the inputs, aggregates them into the internal
AGGREGATIONcolumn family (with DDSketch forced on), fits per-component timing distributions, refinesmax_boundagainst an in-process barrier simulator, and emits the YAML. RocksDB is never visible to the user.What's new
Binary
src/dftracer/utils/binaries/dftracer_gen_dlio_config.cpp: new binary, usescli::ArgParsewithDirectoryArgs/PipelineArgs/IndexingArgs(same surface asdftracer_aggregator).Aggregation pipeline refactor
aggregation_runner.{h,cpp}: extracted the ~400 LoC pipeline that lived insidedftracer_aggregator.cppinto a reusable library functionrun_aggregation(AggregationRunInput). Bothdftracer_aggregatoranddftracer_gen_dlio_confignow call it. The aggregator binary shrank to CLI parsing + delegation. Behavior preserved: all 12 aggregator-related tests still pass.DLIO module (
utilities/dlio/)trace_loader.h/.cpp: reads theAGGREGATIONCF read-only (re-attaches the merge operator at open; latent bug fixed in passing), groups entries by(cat, name, pid, time_bucket), synthesizes per-call samples (inverse-CDF from DDSketch when available, mean replication as fallback).barrier_simulator.h/.cpp: discrete-event simulator across ranks/steps with worker-queue prefetch model.sweep_union/cdf_similarity/variancefree helpers.worker_queue.h: DataLoader prefetch model.statistic.h: DLIO-side alias aroundcommon::statistics::StatisticplusComponentTimeMetricsandBoundary.optimizer.h/.cpp: sequential momentum-based loop tuningmax_boundpercentile.yaml_emit.h/.cpp: DLIO YAML emit for single distributions and Gaussian mixtures via yaml-cpp.General statistics utilities (
common/statistics/)These are generic: no DLIO coupling: so they live under
common:statistic.h: min/max/mean/count accumulator with optional DDSketch backing for quantile queries.distributions.{h,cpp}: MLE fitting for Normal, Lognormal, Gamma, Exponential, Weibull. KS goodness-of-fit + BIC.make_sampler()factory backed by<random>. Boost.Math used for CDF/PDF/quantile (standalone mode).mixture.{h,cpp}: univariate Gaussian Mixture EM (K=2 and K=3) with log-sum-exp responsibilities, quantile-spread init, variance floor. BIC-basedselect_best_model()across single + mixture candidates returning astd::variant<FittedDistribution, FittedMixture>.CMake / dependencies
cmake/modules/Dependencies.cmake:need_boost_math(): CPM fetch ofboostorg/math@boost-1.91.0, header-only standalone.link_boost_math(target)helper: applies headers +BOOST_MATH_STANDALONEdefine as PRIVATE so the install/export set stays clean.need_yaml_cpp()/link_yaml_cpp(target): yaml-cpp 0.9.0, linked PRIVATE.src/CMakeLists.txt: registers the new sources, links Boost.Math + yaml-cpp to the utilities library, adds the binary.Comparison harness
scripts/compare_dlio_yamls.py: side-by-side YAML diff with two checks: parameter-tolerance (--rtol) and two-sample KS on samples drawn from each fitted distribution (--ks-threshold). PEP-723 shebang installs deps viauv run. Exit code reflects the KS check (the authoritative one); parameter divergence is informational since two different fits can produce statistically interchangeable distributions.Tests
tests/utilities/dlio/test_barrier_simulator.cpp: 18 cases:sweep_union,cdf_similarity,variance,WorkerQueue,BarrierSimulatorasync / sync / throughput.tests/utilities/common/statistics/test_distributions.cpp: 14 cases: MLE recovers parameters on all 5 families within 5%, ranking invariants, sampler clamping, error paths.tests/utilities/common/statistics/test_mixture.cpp: 11 cases: GMM EM recovers K=2/K=3 means+stddev+weights, pdf/cdf/sampler dispatch,select_best_modelpicks the right family on unimodal vs. bimodal data, variant dispatch.tests/binaries/test_dftracer_gen_dlio_config.cpp: 6 integration cases: binary lookup,--help, missing--outputrejection, non-DLIO directory graceful failure, happy path (YAML schema + content), worker/prefetch flag acceptance.run_aggregationextraction.Validation against Python pipeline
On a synthetic raw trace of 4 ranks × 400 steps × lognormal
fetch.block+ lognormalpreprocess:computation_timepreprocess_timeBoth families match; mean is recovered to 4 decimal places; sigma is within 8 %; remaining gap is from
MLE/EM/PRNGnumerics. Distributions are statistically interchangeable for DLIO sampling.Docs
docs/source/cli.rst: newdftracer_gen_dlio_configsection with full option table, examples, output schema, and comparison-script pointer.docs/source/utilities/dlio.rst(new): module-level guide covering each piece of the pipeline and the comparison harness.docs/source/utilities/common.rst:Statistic,Distributions,Mixturesubsections with usage examples.docs/source/utilities.rst: toctree entry, statistics overview expanded, DLIO category added to the architecture diagram.