Skip to content

Add pluggable artifact serializer framework#3117

Merged
romain-intel merged 13 commits intoNetflix:masterfrom
saeidbarati157:feat/pluggable-serializers
May 5, 2026
Merged

Add pluggable artifact serializer framework#3117
romain-intel merged 13 commits intoNetflix:masterfrom
saeidbarati157:feat/pluggable-serializers

Conversation

@saeidbarati157
Copy link
Copy Markdown
Contributor

@saeidbarati157 saeidbarati157 commented Apr 15, 2026

Summary

Replace the hardcoded pickle logic in TaskDataStore with a pluggable artifact serializer framework. Existing pickle behavior is preserved exactly — PickleSerializer is a built-in universal fallback so flows that never register a custom serializer behave identically to before.

  • ArtifactSerializer ABC with priority-based dispatch, SerializationMetadata namedtuple, SerializedBlob value type.
  • SerializerStore metaclass auto-registers subclasses; caches the sorted dispatch list and invalidates on new registration.
  • Extensions register serializers via ARTIFACT_SERIALIZERS_DESC (standard Metaflow plugin pattern).
  • A lazy import-hook registry lets extensions defer serializer-module imports (e.g. torch, pyarrow) until the user's own code imports the target type.

Design

  • Dispatch. Serializers are sorted by PRIORITY (lower first; ties broken by registration order). On save, the first can_serialize(obj) == True wins. On load, can_deserialize(metadata) routes by the encoding field in artifact metadata.
  • Wire vs storage format. serialize(obj, format=...) / deserialize(data, format=..., **kw) accept a format kwarg — STORAGE (default) returns (List[SerializedBlob], SerializationMetadata) for the datastore save path; WIRE returns a str for CLI args, protobuf payloads, and cross-process IPC. One class owns both representations. PickleSerializer implements STORAGE only — pickle bytes are not a wire-safe payload — and raises NotImplementedError for WIRE.
  • Side-effect-free contract. serialize() must not perform I/O or mutate global state: it may be called multiple times (caching, retries, parallel dispatch) and must stay safely idempotent. Side effects belong in hooks, not serializers.
  • Lazy registration. metaflow.datastore.artifacts.lazy_registry.register_serializer_for_type("torch.Tensor", "my_ext.TorchSerializer") stores a declarative config; an importlib.abc.MetaPathFinder watches for torch to be imported and only then loads the serializer class. If the target module is already in sys.modules, registration is immediate.
  • Backward compatible. Old artifacts with gzip+pickle-v2 / gzip+pickle-v4 encoding load correctly. Missing encoding defaults to gzip+pickle-v2.

Error paths

  • save_artifacts: if no serializer claims an object, raise DataException with a message pointing at the PickleSerializer fallback invariant (instead of the misleading UnpicklableArtifactException for a non-pickle failure).
  • load_artifacts: if no deserializer claims an artifact, the error includes serializer_info["source"] when present, hinting at a missing extension.

Test plan

  • 29 unit tests for ArtifactSerializer / SerializerStore / SerializedBlob / SerializationMetadata, including wire/storage dispatch and the ABC's abstract-method enforcement.
  • 36 unit tests for PickleSerializer (round-trips across scalars, containers, custom classes; WIRE rejection; encoding validation).
  • 9 unit tests for the lazy registry (config validation, eager registration, deferred registration via the import hook, recursion guard).
  • 6 integration tests for TaskDataStore round-trips, custom serializer priority, and backward compat.
  • 245 existing unit tests pass (only pre-existing spin failures remain, unrelated).

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 15, 2026

Greptile Summary

Replaces the hardcoded pickle logic in TaskDataStore with a pluggable serializer framework. PickleSerializer is preserved as the universal fallback so existing flows are unaffected. The _serializers property now re-queries the live registry on every access, meaning serializers activated by the lazy import hook after TaskDataStore construction are automatically visible.

  • New framework (ArtifactSerializer, SerializerStore, SerializationMetadata, SerializedBlob): priority-based dispatch, ABCMeta enforcement, state-machine lifecycle, setup_imports hook with sys.meta_path retry.
  • PickleSerializer moved to metaflow/plugins/datastores/serializers/ and wired in via ARTIFACT_SERIALIZERS_DESC; legacy gzip+pickle-v{2,4} encodings are accepted for backward compat on load.
  • task_datastore.py: save_artifacts dispatches via the registry, validates blob shape before mutating _info, auto-injects source into serializer_info; load_artifacts routes by encoding with graceful fallback to gzip+pickle-v2 for ancient artifacts.

Confidence Score: 5/5

The change is safe to merge. Backward compatibility is preserved: PickleSerializer remains the universal fallback, legacy encodings are accepted on load, and the old gzip+pickle artifact format is handled transparently.

All previously flagged critical paths are properly addressed. The two remaining comments are narrow robustness gaps in edge-case retry scenarios that do not affect the primary dispatch path or any existing artifact.

The lazy_import retry idempotency logic and the _call_setup_imports helper in serializer.py are the most likely places to surface unexpected behavior for extension authors writing multi-dependency serializers.

Important Files Changed

Filename Overview
metaflow/datastore/artifacts/serializer.py Core serializer framework: SerializerStore metaclass (now correctly using ABCMeta), dispatch logic, lazy-import helpers, and retry state machine. Two minor robustness gaps: lazy_import idempotency check uses module.name which can differ from the import path, and _call_setup_imports gives a confusing error when setup_imports lacks @classmethod.
metaflow/datastore/task_datastore.py Save/load artifact paths refactored to dispatch through the pluggable serializer registry. _serializers is now a property that re-queries the live registry on each access, correctly handling lazy-registered serializers for long-lived datastore instances. Empty/multi-blob validation and source auto-injection are well-guarded.
metaflow/plugins/datastores/serializers/pickle_serializer.py Straightforward PickleSerializer using protocol 4. Legacy gzip+pickle-v{2,4} encodings are accepted for backward compat in can_deserialize. WIRE format correctly raises NotImplementedError.
metaflow/datastore/artifacts/lazy_registry.py sys.meta_path interceptor that fires SerializerStore._on_module_imported when a watched module finishes loading. _WrappedLoader correctly restores spec.loader on both success and exception paths.
metaflow/plugins/init.py Adds ARTIFACT_SERIALIZERS_DESC declaring PickleSerializer as the core fallback and triggers SerializerStore.bootstrap() so the dispatch pool is populated before any TaskDataStore is constructed.
metaflow/datastore/artifacts/diagnostic.py New SerializerRecord dataclass and SerializerState enum for per-serializer lifecycle tracking. Clean, standalone implementation.
metaflow/datastore/exceptions.py UnpicklableArtifactException now accepts optional artifact_name so PickleSerializer can raise it without knowing the artifact name, re-raised with the name attached by the datastore.
test/unit/test_serializer_integration.py Integration tests covering round-trips, backward compat, priority dispatch, blob-count validation, exception propagation, and post-init dynamic registration.
test/unit/test_artifact_serializer.py Unit tests for ABC enforcement, lazy_import mechanics, priority ordering, and the double-assignment guard. Tests remain correct after the idempotency refinement.

Reviews (15): Last reviewed commit: "Address saikonen review: pickle owns Unp..." | Re-trigger Greptile

Comment thread metaflow/datastore/artifacts/serializer.py Outdated
Comment thread metaflow/datastore/task_datastore.py Outdated
Comment thread metaflow/datastore/artifacts/serializer.py Outdated
Comment thread metaflow/datastore/task_datastore.py Outdated
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 17, 2026
This PR, on top of Netflix#3117, adds the tiny contract that extension authors
target when they want to ship typed artifacts. Only the abstract base
lives in OSS; concrete scalar/tensor/struct/etc. types and the bridge
serializer live downstream (per review with Romain — core should not
carry opinions about which types exist, how enums are wire-encoded,
tensor byte order, dataclass inference, etc.).

Standard Python primitives (``int``, ``str``, ``list``, ``dict``, ...)
continue to flow through ``PickleSerializer`` unchanged. Wrapping is
opt-in, for types that need metadata or invariants attached.

IOType contract
- ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
  mirror the signature that Netflix#3117 added on ``ArtifactSerializer``. The
  same ``WIRE`` / ``STORAGE`` constants govern dispatch so a single
  subclass owns both representations:
  - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for the
    datastore save path.
  - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
    cross-process IPC.
- Subclasses implement four hooks: ``_wire_serialize``,
  ``_wire_deserialize``, ``_storage_serialize``,
  ``_storage_deserialize``.
- ``type_name`` + ``to_spec()`` support JSON schema generation.
- ``IOType`` itself is abstract; instantiating without implementing
  the four hooks raises ``TypeError``.

Tests
- ``test/unit/io_types/test_base.py`` — covers abstract instantiation,
  WIRE and STORAGE round-trips, default format, invalid format
  rejection, descriptor-mode spec output, and equality/hash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread metaflow/datastore/artifacts/lazy_registry.py Outdated
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 17, 2026
This PR, on top of Netflix#3117, adds the minimal contract that extension
authors target when they want to ship typed artifacts. Only the
abstract base lives in OSS; concrete scalar/tensor/struct/etc. types
and the bridge serializer belong downstream, where deployment-specific
opinions about encoding, byte order, enum wire representation, and
dataclass inference can live without being forced on every Metaflow
user.

Standard Python primitives (``int``, ``str``, ``list``, ``dict``, ...)
continue to flow through ``PickleSerializer`` unchanged. Wrapping is
opt-in, for types that need metadata or invariants attached.

IOType contract
- ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
  mirror the signature that Netflix#3117 added on ``ArtifactSerializer``. The
  same ``WIRE`` / ``STORAGE`` constants govern dispatch so a single
  subclass owns both representations:
  - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for the
    datastore save path.
  - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
    cross-process IPC.
- Subclasses implement four hooks: ``_wire_serialize``,
  ``_wire_deserialize``, ``_storage_serialize``,
  ``_storage_deserialize``.
- ``type_name`` + ``to_spec()`` support JSON schema generation.
- ``IOType`` itself is abstract; instantiating without implementing
  the four hooks raises ``TypeError``.

Tests
- ``test/unit/io_types/test_base.py`` — covers abstract instantiation,
  WIRE and STORAGE round-trips, default format, invalid format
  rejection, descriptor-mode spec output, and equality/hash.
@saeidbarati157 saeidbarati157 force-pushed the feat/pluggable-serializers branch from 4c17f76 to 79ed3ef Compare April 17, 2026 16:21
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 18, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core (wire format for CLI/IPC, cross-language JSON bytes
  on storage, self-describing schemas via ``to_spec()``, no pickle
  code-execution risk).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly ``to_spec()`` — not enough on their own for core.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). ``type_name`` + ``to_spec()`` support JSON
schema generation. Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Tests

- ``test/unit/io_types/test_base.py`` — abstract instantiation,
  WIRE/STORAGE dispatch, invalid format, equality/hash, spec.
- ``test/unit/io_types/test_json_type.py`` — round-trips.
- ``test/unit/io_types/test_struct_type.py`` — dataclass round-trip,
  dict round-trip, ``to_spec()`` with dataclass fields.
- ``test/unit/io_types/test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through
  dataclass reconstruction, rejection of non-IOType classes in
  metadata (security), WIRE not supported on the bridge.
@saeidbarati157 saeidbarati157 force-pushed the feat/pluggable-serializers branch from 79ed3ef to 0044f28 Compare April 18, 2026 01:49
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 18, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core (wire format for CLI/IPC, cross-language JSON bytes
  on storage, self-describing schemas via ``to_spec()``, no pickle
  code-execution risk).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly ``to_spec()`` — not enough on their own for core.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). ``type_name`` + ``to_spec()`` support JSON
schema generation. Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Tests

- ``test/unit/io_types/test_base.py`` — abstract instantiation,
  WIRE/STORAGE dispatch, invalid format, equality/hash, spec.
- ``test/unit/io_types/test_json_type.py`` — round-trips.
- ``test/unit/io_types/test_struct_type.py`` — dataclass round-trip,
  dict round-trip, ``to_spec()`` with dataclass fields.
- ``test/unit/io_types/test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through
  dataclass reconstruction, rejection of non-IOType classes in
  metadata (security), WIRE not supported on the bridge.
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 18, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core: wire format for CLI/IPC, cross-language JSON bytes
  on storage, no pickle code-execution risk. ``Struct`` also walks
  directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))``
  round-trips back to its original type (generic containers like
  ``List[Inner]`` come back as raw JSON values — wrap those
  explicitly when you need richer reconstruction).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly schema emission — not enough on their own for core.
- Rich schema emission from ``Struct.to_spec()``. Extensions that ship
  primitive wrappers can override to emit fully-typed schemas; core
  just returns ``{"type": "struct"}``.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Safety

- ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize``
  both require the class named in artifact metadata to be an actual
  class (``isinstance(..., type)``) before any further checks. This
  excludes module-level dataclass *instances* (``is_dataclass`` alone
  returns ``True`` for those) and other callables that could be
  invoked with attacker-controlled kwargs.
- Importing the metadata-named module can still run module-level
  side-effect code; the ``Struct`` docstring calls this out so callers
  don't load artifacts from untrusted sources.

Tests

- ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch,
  invalid format, equality/hash, spec.
- ``test_json_type.py`` — wire and storage round-trips.
- ``test_struct_type.py`` — dataclass round-trip, dict round-trip,
  directly-nested dataclass round-trip, container-field pass-through,
  rejection of non-dataclass and dataclass-instance metadata.
- ``test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through dataclass
  reconstruction, rejection of non-IOType classes in metadata, WIRE
  not supported on the bridge.
@saeidbarati157 saeidbarati157 force-pushed the feat/pluggable-serializers branch from 0044f28 to dcccf73 Compare April 18, 2026 02:58
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 18, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core: wire format for CLI/IPC, cross-language JSON bytes
  on storage, no pickle code-execution risk. ``Struct`` also walks
  directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))``
  round-trips back to its original type (generic containers like
  ``List[Inner]`` come back as raw JSON values — wrap those
  explicitly when you need richer reconstruction).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly schema emission — not enough on their own for core.
- Rich schema emission from ``Struct.to_spec()``. Extensions that ship
  primitive wrappers can override to emit fully-typed schemas; core
  just returns ``{"type": "struct"}``.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Safety

- ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize``
  both require the class named in artifact metadata to be an actual
  class (``isinstance(..., type)``) before any further checks. This
  excludes module-level dataclass *instances* (``is_dataclass`` alone
  returns ``True`` for those) and other callables that could be
  invoked with attacker-controlled kwargs.
- Importing the metadata-named module can still run module-level
  side-effect code; the ``Struct`` docstring calls this out so callers
  don't load artifacts from untrusted sources.

Tests

- ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch,
  invalid format, equality/hash, spec.
- ``test_json_type.py`` — wire and storage round-trips.
- ``test_struct_type.py`` — dataclass round-trip, dict round-trip,
  directly-nested dataclass round-trip, container-field pass-through,
  rejection of non-dataclass and dataclass-instance metadata.
- ``test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through dataclass
  reconstruction, rejection of non-IOType classes in metadata, WIRE
  not supported on the bridge.
Comment thread metaflow/datastore/task_datastore.py Outdated
@saeidbarati157 saeidbarati157 force-pushed the feat/pluggable-serializers branch from dcccf73 to 3e9a437 Compare April 18, 2026 03:16
saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 18, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core: wire format for CLI/IPC, cross-language JSON bytes
  on storage, no pickle code-execution risk. ``Struct`` also walks
  directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))``
  round-trips back to its original type (generic containers like
  ``List[Inner]`` come back as raw JSON values — wrap those
  explicitly when you need richer reconstruction).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly schema emission — not enough on their own for core.
- Rich schema emission from ``Struct.to_spec()``. Extensions that ship
  primitive wrappers can override to emit fully-typed schemas; core
  just returns ``{"type": "struct"}``.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Safety

- ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize``
  both require the class named in artifact metadata to be an actual
  class (``isinstance(..., type)``) before any further checks. This
  excludes module-level dataclass *instances* (``is_dataclass`` alone
  returns ``True`` for those) and other callables that could be
  invoked with attacker-controlled kwargs.
- Importing the metadata-named module can still run module-level
  side-effect code; the ``Struct`` docstring calls this out so callers
  don't load artifacts from untrusted sources.

Tests

- ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch,
  invalid format, equality/hash, spec.
- ``test_json_type.py`` — wire and storage round-trips.
- ``test_struct_type.py`` — dataclass round-trip, dict round-trip,
  directly-nested dataclass round-trip, container-field pass-through,
  rejection of non-dataclass and dataclass-instance metadata.
- ``test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through dataclass
  reconstruction, rejection of non-IOType classes in metadata, WIRE
  not supported on the bridge.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 80.69217% with 106 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@352f643). Learn more about missing BASE report.

Files with missing lines Patch % Lines
metaflow/datastore/artifacts/serializer.py 78.11% 55 Missing and 17 partials ⚠️
metaflow/datastore/artifacts/lazy_registry.py 75.71% 10 Missing and 7 partials ⚠️
metaflow/datastore/task_datastore.py 79.76% 10 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master    #3117   +/-   ##
=========================================
  Coverage          ?   28.00%           
=========================================
  Files             ?      381           
  Lines             ?    52347           
  Branches          ?     9238           
=========================================
  Hits              ?    14661           
  Misses            ?    36750           
  Partials          ?      936           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

Comment thread metaflow/datastore/artifacts/serializer.py
Comment thread metaflow/datastore/artifacts/serializer.py Outdated
Comment thread metaflow/datastore/artifacts/serializer.py Outdated
Comment thread metaflow/datastore/task_datastore.py
Comment thread metaflow/datastore/task_datastore.py Outdated
Comment thread metaflow/datastore/task_datastore.py Outdated
@saeidbarati157
Copy link
Copy Markdown
Contributor Author

Thanks Romain! Addressed in fed2c78:

  • serializer.py method declarations — added full type annotations to the ABC method signatures.
  • context param — stripped from deserialize across the ABC, PickleSerializer, the TaskDataStore caller, and tests. Can re-add if/when we have a concrete use case.
  • STORAGE/WIRE as an enum — promoted to SerializationFormat(str, Enum). Subclassing str keeps existing format == 'storage' comparisons working so no downstream migration is needed.
  • _serializers_override comment — reworded to describe the test override; moved the property-rationale block next to the property definition itself.
  • Multi-blob error — appended "at this time. If you have a need for multi blob serializers, please reach out to the Metaflow team."
  • Deserializer-not-found error — now prints the serializer_info dict alongside encoding.

All 77 serializer/pickle/lazy/integration unit tests pass.

saeidbarati157 pushed a commit to saeidbarati157/metaflow that referenced this pull request Apr 23, 2026
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept
deliberately small:

- ``IOType`` ABC — the contract extension authors target.
- ``Json`` and ``Struct`` — two concrete types with clear standalone
  value in core: wire format for CLI/IPC, cross-language JSON bytes
  on storage, no pickle code-execution risk. ``Struct`` also walks
  directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))``
  round-trips back to its original type (generic containers like
  ``List[Inner]`` come back as raw JSON values — wrap those
  explicitly when you need richer reconstruction).
- ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance
  into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load
  through the datastore just works.

What's intentionally *not* in this PR

- Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard
  Python numbers and strings flow through ``PickleSerializer``
  unchanged. Wrapping is opt-in, for cases where you want
  constraints/metadata attached.
- ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an
  extension that can own those choices.
- ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain
  JSON is mostly schema emission — not enough on their own for core.
- Rich schema emission from ``Struct.to_spec()``. Extensions that ship
  primitive wrappers can override to emit fully-typed schemas; core
  just returns ``{"type": "struct"}``.

Contract

``serialize(format=...)`` / ``deserialize(data, format=..., **kw)``
mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same
``WIRE`` / ``STORAGE`` constants, so one subclass owns both
representations:

- ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for
  persisting through the datastore.
- ``WIRE`` → ``str`` for CLI args, protobuf payloads, and
  cross-process IPC.

Subclasses implement four hooks (``_wire_serialize``,
``_wire_deserialize``, ``_storage_serialize``,
``_storage_deserialize``). Instantiating without the hooks raises
``TypeError``.

``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC``
with ``PRIORITY=50`` — ahead of the default 100 so it catches
``IOType`` instances before a generic catch-all, and always ahead of
the ``PickleSerializer`` fallback (9999). It implements only
``STORAGE``; wire encoding is produced by calling
``IOType.serialize(format=WIRE)`` directly.

Safety

- ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize``
  both require the class named in artifact metadata to be an actual
  class (``isinstance(..., type)``) before any further checks. This
  excludes module-level dataclass *instances* (``is_dataclass`` alone
  returns ``True`` for those) and other callables that could be
  invoked with attacker-controlled kwargs.
- Importing the metadata-named module can still run module-level
  side-effect code; the ``Struct`` docstring calls this out so callers
  don't load artifacts from untrusted sources.

Tests

- ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch,
  invalid format, equality/hash, spec.
- ``test_json_type.py`` — wire and storage round-trips.
- ``test_struct_type.py`` — dataclass round-trip, dict round-trip,
  directly-nested dataclass round-trip, container-field pass-through,
  rejection of non-dataclass and dataclass-instance metadata.
- ``test_iotype_serializer.py`` — bridge
  ``can_serialize``/``can_deserialize``, round-trip through dataclass
  reconstruction, rejection of non-IOType classes in metadata, WIRE
  not supported on the bridge.
@romain-intel romain-intel force-pushed the feat/pluggable-serializers branch from fed2c78 to b6f6cf4 Compare April 24, 2026 07:04
Comment thread metaflow/datastore/artifacts/serializer.py
Comment thread metaflow/plugins/datastores/serializers/pickle_serializer.py
Comment thread test/unit/test_artifact_serializer.py
@saeidbarati157 saeidbarati157 force-pushed the feat/pluggable-serializers branch from 813442e to f062b36 Compare April 29, 2026 18:46
Comment on lines +45 to +54
blob = pickle.dumps(obj, protocol=4)
return (
[SerializedBlob(blob, is_reference=False)],
SerializationMetadata(
obj_type=str(type(obj)),
size=len(blob),
encoding="pickle-v4",
serializer_info={},
),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Protocol-2 fallback removed — behavioral regression for some objects

The old save_artifacts caught SystemError and OverflowError from pickle.dumps(protocol=4) and automatically retried with protocol=2. That fallback is gone here, so any object that triggers one of those errors under protocol=4 (e.g., certain C-extension types or deeply nested structures) will now raise a raw SystemError/OverflowError instead of being serialized successfully. Artifacts that Metaflow previously saved without issue may now fail to save.

@classmethod
def serialize(cls, obj, format=SerializationFormat.STORAGE):
    if format == SerializationFormat.WIRE:
        raise NotImplementedError(
            "PickleSerializer does not support the WIRE format; pickle "
            "produces opaque binary bytes that are not safe to pass as "
            "CLI args or inline IPC payloads."
        )
    try:
        blob = pickle.dumps(obj, protocol=4)
        encoding = "pickle-v4"
    except (SystemError, OverflowError):
        blob = pickle.dumps(obj, protocol=2)
        encoding = "pickle-v2"
    return (
        [SerializedBlob(blob, is_reference=False)],
        SerializationMetadata(
            obj_type=str(type(obj)),
            size=len(blob),
            encoding=encoding,
            serializer_info={},
        ),
    )

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly relevant still.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually strike that, I'm forgetting the context for keeping the protocol-2 encoding here but it seems to have been tied to python <=3.6 so dropping the fallback should be fine at this point.

Comment thread metaflow/datastore/artifacts/serializer.py
Comment thread metaflow/datastore/artifacts/serializer.py
return not self.is_reference


class SerializerStore(ABCMeta):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this serializer store be accessible from other metaflow plugins ? i'm working on updating the checkpoint decorator and it'd be super useful if we could re-use this to serialize artifacts for checkpointing purposes !

we'd have to be able to (1) access the serializations for an artifact someone wants to checkpoint , and (2) save the artifacts in a different location from the usual metaflow artifacts

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly what the decoupling enables. The serializer produces (blobs, metadata) independently of where you persist them:

from metaflow.datastore.artifacts.serializer import SerializerStore

# (1) serialize
for s in SerializerStore.get_ordered_serializers():
    if s.can_serialize(obj):
        blobs, metadata = s.serialize(obj)
        break

# (2) deserialize from your own storage
for s in SerializerStore.get_ordered_serializers():
    if s.can_deserialize(metadata):
        return s.deserialize([blob], metadata)

You hand blobs to whatever storage you want (your own S3 path, a separate CAS, etc.).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ! this is great , thanks for the detailed example !

except TypeError as e:
# Preserve the historical "couldn't pickle this" wrapper so
# existing consumers still see ``UnpicklableArtifactException``.
raise UnpicklableArtifactException(name) from e
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't a non-pickle serializer also raise a TypeError ? One option could be to treat any raised MetaflowException as pass-through, and have the pickle-serializer raise the UnpicklableArtifactException instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea was to try to return the same exception as before. I think what we could do is specially catch UnpicklableArtifactException (and yes, raise it directly in the pickle serializer) and for everything else, fall through to the DataException which would maintain backwards compatibility with exception types. Would that work for you?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved pickle.dumps's TypeError handling into PickleSerializer itself. It now raises UnpicklableArtifactException (with artifact_name optional). TaskDataStore.save_artifacts catches it specifically to re-attach the artifact name, lets any other MetaflowException pass through unchanged, and only wraps non-MetaflowException failures in DataException for context. Added three tests covering the three branches.

Comment on lines +45 to +54
blob = pickle.dumps(obj, protocol=4)
return (
[SerializedBlob(blob, is_reference=False)],
SerializationMetadata(
obj_type=str(type(obj)),
size=len(blob),
encoding="pickle-v4",
serializer_info={},
),
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly relevant still.

Saeid Barati and others added 13 commits May 5, 2026 17:17
Introduce ArtifactSerializer ABC with priority-based dispatch, enabling
custom serializers to be plugged in alongside the default pickle path.
This is the foundation for the IOType system which adds typed
serialization for Metaflow artifacts.

New abstractions:
- ArtifactSerializer: base class with can_serialize/can_deserialize/
  serialize/deserialize interface
- SerializerStore: metaclass for auto-registration and deterministic
  priority-ordered dispatch
- SerializationMetadata: namedtuple for artifact metadata routing
- SerializedBlob: supports both new bytes and references to
  already-stored data
- PickleSerializer: universal fallback (PRIORITY=9999) wrapping
  existing pickle logic

No existing code is modified. These are inert until wired into
TaskDataStore in a follow-up commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Register PickleSerializer through the standard Metaflow plugin
mechanism so extensions (e.g., mli-metaflow-custom) can add their
own serializers via ARTIFACT_SERIALIZERS_DESC.

- Add artifact_serializer category to _plugin_categories
- Add ARTIFACT_SERIALIZERS_DESC with PickleSerializer in plugins/__init__.py
- Resolve via ARTIFACT_SERIALIZERS = resolve_plugins("artifact_serializer")

Importing metaflow.plugins now triggers PickleSerializer registration
in SerializerStore, ensuring the registry is populated before
TaskDataStore needs it (commit 3).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
save_artifacts now loops through priority-ordered serializers to find
one that can_serialize the object. load_artifacts routes deserialization
through metadata.encoding via can_deserialize. PickleSerializer handles
all existing Python objects as the universal fallback.

Behavioral changes:
- New artifacts get encoding "pickle-v4" (was "gzip+pickle-v4")
- _info[name] gains optional "serializer_info" dict for custom serializers
- Removed hardcoded pickle import from task_datastore.py

Backward compatible:
- Old artifacts with "gzip+pickle-v2" or "gzip+pickle-v4" encoding
  load correctly (PickleSerializer.can_deserialize handles both)
- Missing encoding defaults to "gzip+pickle-v2"
- Missing serializer_info defaults to {}
- _objects[name] stays as single string (multi-blob deferred to PR 1.5)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SerializerStore now extends ABCMeta (not type) so @AbstractMethod
  is enforced — incomplete subclasses raise TypeError at definition
- Validate blobs non-empty before accessing blobs[0] in save_artifacts
- Add NOTE on compress_method not yet wired into save path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire vs storage format
- Added WIRE and STORAGE constants in metaflow.datastore.artifacts.serializer.
- ArtifactSerializer.serialize/deserialize now accept a ``format`` kwarg so a
  single class can own both the storage path (datastore blobs + metadata) and
  the wire path (string for CLI args, protobuf payloads, cross-process IPC).
- PickleSerializer implements STORAGE only; WIRE raises NotImplementedError
  with an explanation (pickle bytes are not safe as a wire payload).
- Serializers that want wire support implement it on the same class; there's
  no need for a second class per format.

Lazy import-hook registry
- New module metaflow/datastore/artifacts/lazy_registry.py with
  SerializerConfig, an importlib.abc.MetaPathFinder interceptor, and public
  register_serializer_for_type / load_serializer_class entry points.
- If the target type's module is already in sys.modules, registration is
  immediate. Otherwise a hook is installed on sys.meta_path and registration
  fires the first time the user's code imports the target module. This
  defers the cost of serializer-module imports (torch, pyarrow, fastavro,
  ...) until those dependencies are actually in play.
- find_spec temporarily removes the interceptor from sys.meta_path during
  its lookup to avoid recursion.

Review nits
- serializer.py: rename SerializationMetadata.type field to obj_type to avoid
  shadowing the type() builtin (dict key inside _info stays "type" for
  backward compatibility with existing datastores).
- serializer.py: SerializerStore skips TYPE=None AND any subclass that is
  still abstract, via inspect.isabstract().
- serializer.py: get_ordered_serializers memoizes the sorted list and
  invalidates on new registration. Drops the O(n²) list.index tiebreaker —
  Python 3.7+ dicts preserve insertion order.
- task_datastore.py: lift SerializerStore / SerializationMetadata imports to
  module top (were on hot paths: __init__ and load_artifacts).
- task_datastore.py: the "no serializer claimed this artifact" branch now
  raises DataException with a message pointing at the PickleSerializer
  fallback invariant, instead of the misleading UnpicklableArtifactException.
- task_datastore.py: "no deserializer claimed this artifact" now looks up
  ``serializer_info["source"]`` and hints at a missing extension.
- SerializedBlob.compress_method: removed. It was documented as "not wired
  into the save path" — shipping an unwired knob invites extension authors
  to rely on it and discover it is a no-op. Will come back with its
  consuming code in a later change.
- serialize() docstring rewritten: the side-effect-free contract is there
  to support retries, caching, and parallel dispatch, and to keep
  serializers testable; I/O belongs in hooks.

Tests
- New test/unit/test_lazy_serializer_registry.py (9 tests covering config
  validation, eager registration for already-imported types, deferred
  registration through the import hook, and the recursion guard).
- test_artifact_serializer.py gains format-dispatch coverage (STORAGE and
  WIRE round-trip on a toy dual-format serializer; PickleSerializer raises
  on WIRE).
- Removed tests for compress_method and adjusted tests that poked at
  _registration_order so they go through the public API only.
- Convert STORAGE/WIRE module constants to a SerializationFormat enum
  (str-backed so existing equality checks keep working). Module-level
  STORAGE/WIRE aliases are dropped.
- Add Python type hints to can_serialize, can_deserialize, serialize,
  and deserialize so editors surface the types documented in docstrings.
- Drop the unused `context` parameter from deserialize() across the ABC,
  PickleSerializer, the TaskDataStore call site, and tests.
- Move the `_serializers` property explanation onto the property itself;
  leave only a short note next to `_serializers_override`.
- save_artifacts now validates blob count BEFORE mutating `_info[name]`,
  so a rejected artifact never leaves partial metadata behind.
- Multi-blob error message now points users at the Metaflow team for
  multi-blob use cases.
- "No deserializer claimed artifact" error now includes the full
  serializer_info dict in addition to the source hint.
- SerializerStore.get_ordered_serializers tracks already-materialized
  lazy classes so the ordered cache is only rebuilt when a new lazy
  class actually becomes importable — previously, any registered lazy
  config forced a rebuild on every call.

Tests
- New regression test for the ordered-cache steady-state behavior.
- New regression tests verifying `_info[name]` is not populated when
  save_artifacts rejects empty or multi-blob serializer output.
- Updated existing tests for the enum, type hints, and context removal.
Replace the two-path registration scheme (ARTIFACT_SERIALIZERS_DESC +
register_serializer_for_type) with a single declarative mechanism.
Extensions ship one serializer file per serializer; heavy imports
live in a setup_imports() classmethod whose ModuleNotFoundError
parks the serializer on an import hook and retries once the awaited
module appears in sys.modules.

Author-facing API
-----------------

    class TorchSerializer(ArtifactSerializer):
        TYPE = "torch"
        PRIORITY = 50

        @classmethod
        def setup_imports(cls, context=None):
            cls.lazy_import("torch")
            cls.lazy_import("pyarrow", alias="pa")

        @classmethod
        def can_serialize(cls, obj):
            return isinstance(obj, cls.torch.Tensor)
        ...

    # mfextinit_myext.py
    ARTIFACT_SERIALIZERS_DESC = [
        ("torch", "my_ext.serializers.torch_serializer.TorchSerializer"),
    ]

Mechanism
---------

- SerializerStore.bootstrap() walks every extension's
  ARTIFACT_SERIALIZERS_DESC directly (the serializer category owns its
  own lifecycle rather than flowing through resolve_plugins), applies
  ENABLED_ARTIFACT_SERIALIZER +/- toggles, and drives each entry
  through a state machine:
      known → importing → class_loaded → importing_deps → active
  with terminal states `broken` / `disabled` and the retry state
  `pending_on_imports`.
- lazy_import(module_path, alias=None) imports the module, stashes it
  on cls, returns it. Rejects reserved names (TYPE, PRIORITY, dispatch
  methods, leading underscore) and double-assignment within one
  setup_imports() call.
- A sys.meta_path interceptor watches the pending modules; when any
  imports, SerializerStore._on_module_imported re-runs the lifecycle
  for every parked entry waiting on that module. Loop guard: the same
  ImportError.name raising twice → broken.
- Dispatch reads _active_serializers only. Exceptions raised by
  can_serialize / can_deserialize on a currently-active serializer are
  caught, counted in the record's dispatch_error_count, and the
  serializer is skipped for that artifact only — preserving the
  PickleSerializer fallback.
- On PRIORITY ties, last-registered wins; a lexicographic tiebreak on
  class_path provides cross-environment reproducibility.

Diagnostic API
--------------

metaflow.datastore.artifacts.list_serializer_status() returns a list
of per-entry dicts with name, class_path, state, awaiting_modules,
last_error, priority, type, import_trigger, and dispatch_error_count.
Primary use case: answering "why isn't my custom serializer active?"
without reading source.

Removed
-------

- register_serializer_for_type() (public API)
- SerializerConfig, register_serializer_config, iter_registered_configs,
  load_serializer_class, plus the backing declarative-config plumbing
  inside lazy_registry.py — nothing in the unified path uses them.
- ARTIFACT_SERIALIZERS = resolve_plugins("artifact_serializer") in
  metaflow/plugins/__init__.py (no readers in the codebase).
- "artifact_serializer" from _plugin_categories in
  metaflow/extension_support/plugins.py.

Testing
-------

- 102 unit tests across test_artifact_serializer, test_pickle_serializer,
  test_serializer_integration, test_serializer_lifecycle, and
  test_serializer_public_api. Coverage: state-machine transitions,
  retry and loop guard, disabled toggle, dispatch exception handling,
  lazy_import reserved-name + collision guards, setup_imports signature
  compatibility (both def setup_imports(cls) and
  def setup_imports(cls, context=None) are supported), subclass
  inheritance, and public API surface assertions.
- SerializerStore._reset_for_tests() clears registry + interceptor state
  and walks MRO to delattr lazy-imported attributes for test isolation.
When save_artifacts dispatches to a serializer, inject the serializer's
source label into ``serializer_info["source"]`` so the "no deserializer
claimed artifact" load error can point at the package providing the
missing serializer. The source is derived at bootstrap time:
``"metaflow"`` for core serializers, the extension's ``package_name``
(via ``metaflow.extension_support.plugins.get_modules``) for extensions.

Authors who set their own ``source`` in the returned ``serializer_info``
are not overridden, so an extension can publish a different identifier
(e.g. a logical name rather than the pip distribution name) by passing
it through ``serialize()``.

The source is also exposed on each diagnostic record returned by
``list_serializer_status()``.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
- PickleSerializer.serialize labelled output as "gzip+pickle-v4" but the
  serializer only pickles; gzip is applied later by ContentAddressedStore.
  Tests assert the serializer reports its own encoding ("pickle-v4"); the
  legacy "gzip+pickle-v4" label is kept in _ENCODINGS for backward-compat
  reads of master-era artifacts.

- ArtifactSerializer.lazy_import silently no-op'd when an alias was reused
  with a different module path. Now: same module_path is still idempotent
  (same module's __name__ matches), different module_path raises
  ValueError so accidental aliasing fails loudly.
…atch hardening

Round-3 fixes triggered by an external code review:

Import hook (lazy_registry.py)
- _WrappedLoader.exec_module restores module.__spec__.loader to the
  original on both success and failure, so pkgutil/importlib.resources/
  test-runners that introspect spec.loader stop seeing the wrapper.
- On exec failure the success callback is skipped and the module is not
  marked processed, so a future re-import of the same name retries the
  full path instead of silently no-op'ing.
- _ensure_interceptor_installed() short-circuits when already at
  sys.meta_path[0], removing the brief "interceptor missing" window the
  remove+reinsert pattern created.
- The SerializerStore registry callback exception path now emits a
  RuntimeWarning instead of swallowing silently.

Registry / metaclass (serializer.py)
- SerializerStore.__init__ rejects subclasses that inherit a registered
  parent's TYPE without declaring their own — would otherwise overwrite
  the parent entry in _all_serializers and corrupt dispatch.
- _retry_bootstrap mirrors bootstrap_entries: TYPE-vs-name validation,
  rec.priority/rec.type capture, and last_error reset on success.
- _on_module_imported wraps each per-record retry in its own try/except
  so a defect in one record cannot strand the others as PENDING.
- bootstrap()'s extension-discovery and config-toggle except blocks now
  emit RuntimeWarning so a broken extension is visible instead of
  silently disabling every artifact serializer it ships.
- _reset_for_tests delegates to lazy_registry._reset_for_tests() so both
  _watched and _processed (and the sys.meta_path entry) are cleared.

Datastore (task_datastore.py)
- save_artifacts catches non-TypeError serializer exceptions and wraps
  them with artifact-name + serializer-class context via DataException.
- Both save and load paths pass format=SerializationFormat.STORAGE
  explicitly, so a serializer that overrides the default does not
  break tuple-unpacking.
- load_artifacts surfaces an _info/_objects divergence as a
  DataException with context instead of a bare KeyError.
- _record_dispatch_error emits a RuntimeWarning for buggy
  can_serialize/can_deserialize so silent fallthrough to PickleSerializer
  is observable.

Documentation
- SerializationMetadata.__doc__ documents the encoding-string contract
  (describes serializer output, not on-disk format; CAS owns gzip).
- ArtifactSerializer.serialize docstring repeats the no-double-compress
  rule and references the metadata contract.

Tests
- Updated _JsonStringSerializer.serialize signature to accept the
  format kwarg, matching the abstract contract now enforced by the
  framework's explicit format=STORAGE call.
…tegration.py

Round-3 commit landed two lines above the project's black configuration:
the ``inherited_from`` generator and the ``ValueError`` message in
``lazy_import``. Run black==25.12.0 (the version pinned in
``.pre-commit-config.yaml``) to reformat in place. No functional change.
Move TypeError handling out of TaskDataStore's dispatch loop so a
non-pickle serializer's TypeError is no longer mislabeled as
"unpicklable". PickleSerializer now wraps its own pickle.dumps TypeError
into UnpicklableArtifactException; TaskDataStore re-raises it with the
artifact name attached and lets any other MetaflowException propagate.
@saeidbarati157 saeidbarati157 force-pushed the feat/pluggable-serializers branch from 9653844 to c79fa79 Compare May 5, 2026 17:29
@romain-intel romain-intel merged commit de3fcc7 into Netflix:master May 5, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants