Skip to content

Designing configuration infrastructure for mlcast python package #5

@leifdenby

Description

@leifdenby

@franchg and I have just been having a discussion about how to design the configuration in mlcast. I used ChatGPT to summarise our conversion and make summary of the two approaches and tables of pros and cons below:

Summary of the Two Configuration Approaches

We are considering two alternative approaches for handling configuration in the project:

  1. Hydra + Pydantic
    Uses Hydra to manage hierarchical YAML configurations and dynamically instantiate Python objects via _target_. Pydantic is used for schema validation. This approach treats the YAML config as both configuration and a class instantiation template, enabling highly flexible and composable setups.

  2. Dataclasses + dataclasses-wizard
    Treats configuration as Python code expressed through dataclasses, with dataclasses-wizard providing YAML serialization/deserialization. YAML is a passive storage format rather than an active driver of object creation. This approach emphasizes type safety, static tooling support, and conceptual simplicity by avoiding dynamic class instantiation in configuration files.

Hydra + Pydantic

Category Pros Cons
Flexibility Highly flexible configuration system; _target_ allows substituting whole classes without code changes Flexibility comes at cost of complexity; config effectively becomes executable code
Modularity Strong support for hierarchical config composition and experiment overrides Hard to trace which classes are instantiated until runtime
Ecosystem Mature tooling and widely used in ML training frameworks Requires learning Hydra concepts (composition, instantiation, overrides) in addition to PyTorch/Lightning
User Experience Powerful for advanced users who need dynamic architecture changes Steep learning curve; difficult for newcomers and harder to teach (as observed in AIWCAS school)
Type Safety Pydantic provides runtime validation In practice, config typing often degrades to Any, reducing discoverability and static guarantees
Tooling Support Hydra CLI utilities, sweeps, config stores IDEs can't reliably infer instantiated types → limited autocomplete and code navigation
Separation of Concerns YAML can define whole object graphs Mixing Python class references into YAML blurs boundary between config and code

Examples:

Dataclasses + dataclasses-wizard

Category Pros Cons
Mental Model Configuration is plain Python code; easier to reason about Lacks Hydra’s built-in config composition and sweeping mechanisms out of the box
Type Safety Strong typing and inheritance via dataclasses makes config structure explicit Requires disciplined design if config grows large or complex
Tooling Support Excellent IDE support: autocomplete, type hints, static analysis, navigation Fewer higher-level utilities; some features must be implemented manually
User Experience Lower cognitive load; no hidden instantiation magic Less dynamic than _target_-based systems if users expect full pluggability
Serialization dataclasses-wizard handles YAML round-tripping seamlessly Not as feature-rich for configuration lifecycle management as Hydra
Debuggability No dynamic class instantiation hidden in config files; code paths are explicit Fewer "batteries included" for large-scale experiment management
Separation of Concerns Config remains config; no Python identifiers or imports leaked into YAML If the project requires dynamic class selection at runtime, patterns must be implemented intentionally

Examples:

If anyone has further thoughts on this please join in the discussion here by posting comments ☺️

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions