Add pluggable artifact serializer framework#3117
Add pluggable artifact serializer framework#3117romain-intel merged 13 commits intoNetflix:masterfrom
Conversation
Greptile SummaryReplaces the hardcoded pickle logic in
Confidence Score: 5/5The change is safe to merge. Backward compatibility is preserved: PickleSerializer remains the universal fallback, legacy encodings are accepted on load, and the old gzip+pickle artifact format is handled transparently. All previously flagged critical paths are properly addressed. The two remaining comments are narrow robustness gaps in edge-case retry scenarios that do not affect the primary dispatch path or any existing artifact. The lazy_import retry idempotency logic and the _call_setup_imports helper in serializer.py are the most likely places to surface unexpected behavior for extension authors writing multi-dependency serializers. Important Files Changed
Reviews (15): Last reviewed commit: "Address saikonen review: pickle owns Unp..." | Re-trigger Greptile |
This PR, on top of Netflix#3117, adds the tiny contract that extension authors target when they want to ship typed artifacts. Only the abstract base lives in OSS; concrete scalar/tensor/struct/etc. types and the bridge serializer live downstream (per review with Romain — core should not carry opinions about which types exist, how enums are wire-encoded, tensor byte order, dataclass inference, etc.). Standard Python primitives (``int``, ``str``, ``list``, ``dict``, ...) continue to flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for types that need metadata or invariants attached. IOType contract - ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the signature that Netflix#3117 added on ``ArtifactSerializer``. The same ``WIRE`` / ``STORAGE`` constants govern dispatch so a single subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for the datastore save path. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. - Subclasses implement four hooks: ``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``. - ``type_name`` + ``to_spec()`` support JSON schema generation. - ``IOType`` itself is abstract; instantiating without implementing the four hooks raises ``TypeError``. Tests - ``test/unit/io_types/test_base.py`` — covers abstract instantiation, WIRE and STORAGE round-trips, default format, invalid format rejection, descriptor-mode spec output, and equality/hash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR, on top of Netflix#3117, adds the minimal contract that extension authors target when they want to ship typed artifacts. Only the abstract base lives in OSS; concrete scalar/tensor/struct/etc. types and the bridge serializer belong downstream, where deployment-specific opinions about encoding, byte order, enum wire representation, and dataclass inference can live without being forced on every Metaflow user. Standard Python primitives (``int``, ``str``, ``list``, ``dict``, ...) continue to flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for types that need metadata or invariants attached. IOType contract - ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the signature that Netflix#3117 added on ``ArtifactSerializer``. The same ``WIRE`` / ``STORAGE`` constants govern dispatch so a single subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for the datastore save path. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. - Subclasses implement four hooks: ``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``. - ``type_name`` + ``to_spec()`` support JSON schema generation. - ``IOType`` itself is abstract; instantiating without implementing the four hooks raises ``TypeError``. Tests - ``test/unit/io_types/test_base.py`` — covers abstract instantiation, WIRE and STORAGE round-trips, default format, invalid format rejection, descriptor-mode spec output, and equality/hash.
4c17f76 to
79ed3ef
Compare
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core (wire format for CLI/IPC, cross-language JSON bytes on storage, self-describing schemas via ``to_spec()``, no pickle code-execution risk). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly ``to_spec()`` — not enough on their own for core. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). ``type_name`` + ``to_spec()`` support JSON schema generation. Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Tests - ``test/unit/io_types/test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test/unit/io_types/test_json_type.py`` — round-trips. - ``test/unit/io_types/test_struct_type.py`` — dataclass round-trip, dict round-trip, ``to_spec()`` with dataclass fields. - ``test/unit/io_types/test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata (security), WIRE not supported on the bridge.
79ed3ef to
0044f28
Compare
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core (wire format for CLI/IPC, cross-language JSON bytes on storage, self-describing schemas via ``to_spec()``, no pickle code-execution risk). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly ``to_spec()`` — not enough on their own for core. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). ``type_name`` + ``to_spec()`` support JSON schema generation. Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Tests - ``test/unit/io_types/test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test/unit/io_types/test_json_type.py`` — round-trips. - ``test/unit/io_types/test_struct_type.py`` — dataclass round-trip, dict round-trip, ``to_spec()`` with dataclass fields. - ``test/unit/io_types/test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata (security), WIRE not supported on the bridge.
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core: wire format for CLI/IPC, cross-language JSON bytes on storage, no pickle code-execution risk. ``Struct`` also walks directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))`` round-trips back to its original type (generic containers like ``List[Inner]`` come back as raw JSON values — wrap those explicitly when you need richer reconstruction). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly schema emission — not enough on their own for core. - Rich schema emission from ``Struct.to_spec()``. Extensions that ship primitive wrappers can override to emit fully-typed schemas; core just returns ``{"type": "struct"}``. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Safety - ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize`` both require the class named in artifact metadata to be an actual class (``isinstance(..., type)``) before any further checks. This excludes module-level dataclass *instances* (``is_dataclass`` alone returns ``True`` for those) and other callables that could be invoked with attacker-controlled kwargs. - Importing the metadata-named module can still run module-level side-effect code; the ``Struct`` docstring calls this out so callers don't load artifacts from untrusted sources. Tests - ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test_json_type.py`` — wire and storage round-trips. - ``test_struct_type.py`` — dataclass round-trip, dict round-trip, directly-nested dataclass round-trip, container-field pass-through, rejection of non-dataclass and dataclass-instance metadata. - ``test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata, WIRE not supported on the bridge.
0044f28 to
dcccf73
Compare
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core: wire format for CLI/IPC, cross-language JSON bytes on storage, no pickle code-execution risk. ``Struct`` also walks directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))`` round-trips back to its original type (generic containers like ``List[Inner]`` come back as raw JSON values — wrap those explicitly when you need richer reconstruction). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly schema emission — not enough on their own for core. - Rich schema emission from ``Struct.to_spec()``. Extensions that ship primitive wrappers can override to emit fully-typed schemas; core just returns ``{"type": "struct"}``. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Safety - ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize`` both require the class named in artifact metadata to be an actual class (``isinstance(..., type)``) before any further checks. This excludes module-level dataclass *instances* (``is_dataclass`` alone returns ``True`` for those) and other callables that could be invoked with attacker-controlled kwargs. - Importing the metadata-named module can still run module-level side-effect code; the ``Struct`` docstring calls this out so callers don't load artifacts from untrusted sources. Tests - ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test_json_type.py`` — wire and storage round-trips. - ``test_struct_type.py`` — dataclass round-trip, dict round-trip, directly-nested dataclass round-trip, container-field pass-through, rejection of non-dataclass and dataclass-instance metadata. - ``test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata, WIRE not supported on the bridge.
dcccf73 to
3e9a437
Compare
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core: wire format for CLI/IPC, cross-language JSON bytes on storage, no pickle code-execution risk. ``Struct`` also walks directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))`` round-trips back to its original type (generic containers like ``List[Inner]`` come back as raw JSON values — wrap those explicitly when you need richer reconstruction). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly schema emission — not enough on their own for core. - Rich schema emission from ``Struct.to_spec()``. Extensions that ship primitive wrappers can override to emit fully-typed schemas; core just returns ``{"type": "struct"}``. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Safety - ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize`` both require the class named in artifact metadata to be an actual class (``isinstance(..., type)``) before any further checks. This excludes module-level dataclass *instances* (``is_dataclass`` alone returns ``True`` for those) and other callables that could be invoked with attacker-controlled kwargs. - Importing the metadata-named module can still run module-level side-effect code; the ``Struct`` docstring calls this out so callers don't load artifacts from untrusted sources. Tests - ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test_json_type.py`` — wire and storage round-trips. - ``test_struct_type.py`` — dataclass round-trip, dict round-trip, directly-nested dataclass round-trip, container-field pass-through, rejection of non-dataclass and dataclass-instance metadata. - ``test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata, WIRE not supported on the bridge.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #3117 +/- ##
=========================================
Coverage ? 28.00%
=========================================
Files ? 381
Lines ? 52347
Branches ? 9238
=========================================
Hits ? 14661
Misses ? 36750
Partials ? 936 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks Romain! Addressed in fed2c78:
All 77 serializer/pickle/lazy/integration unit tests pass. |
On top of Netflix#3117, this ships the typed-artifact layer. Scope kept deliberately small: - ``IOType`` ABC — the contract extension authors target. - ``Json`` and ``Struct`` — two concrete types with clear standalone value in core: wire format for CLI/IPC, cross-language JSON bytes on storage, no pickle code-execution risk. ``Struct`` also walks directly-nested ``@dataclass`` fields so ``Outer(inner=Inner(...))`` round-trips back to its original type (generic containers like ``List[Inner]`` come back as raw JSON values — wrap those explicitly when you need richer reconstruction). - ``IOTypeSerializer`` — the bridge that plugs any ``IOType`` instance into the ``ArtifactSerializer`` dispatch added by Netflix#3117 so save/load through the datastore just works. What's intentionally *not* in this PR - Primitive wrappers (Int32/Int64/Float32/Float64/Bool/Text). Standard Python numbers and strings flow through ``PickleSerializer`` unchanged. Wrapping is opt-in, for cases where you want constraints/metadata attached. - ``Tensor``. Pulls in numpy + byte-order/dtype opinions; belongs in an extension that can own those choices. - ``List`` / ``Map`` / ``Enum``. Thin wrappers whose value over plain JSON is mostly schema emission — not enough on their own for core. - Rich schema emission from ``Struct.to_spec()``. Extensions that ship primitive wrappers can override to emit fully-typed schemas; core just returns ``{"type": "struct"}``. Contract ``serialize(format=...)`` / ``deserialize(data, format=..., **kw)`` mirror the ``ArtifactSerializer`` signature from Netflix#3117 and use the same ``WIRE`` / ``STORAGE`` constants, so one subclass owns both representations: - ``STORAGE`` → ``(List[SerializedBlob], metadata_dict)`` for persisting through the datastore. - ``WIRE`` → ``str`` for CLI args, protobuf payloads, and cross-process IPC. Subclasses implement four hooks (``_wire_serialize``, ``_wire_deserialize``, ``_storage_serialize``, ``_storage_deserialize``). Instantiating without the hooks raises ``TypeError``. ``IOTypeSerializer`` is registered via ``ARTIFACT_SERIALIZERS_DESC`` with ``PRIORITY=50`` — ahead of the default 100 so it catches ``IOType`` instances before a generic catch-all, and always ahead of the ``PickleSerializer`` fallback (9999). It implements only ``STORAGE``; wire encoding is produced by calling ``IOType.serialize(format=WIRE)`` directly. Safety - ``Struct._storage_deserialize`` and ``IOTypeSerializer.deserialize`` both require the class named in artifact metadata to be an actual class (``isinstance(..., type)``) before any further checks. This excludes module-level dataclass *instances* (``is_dataclass`` alone returns ``True`` for those) and other callables that could be invoked with attacker-controlled kwargs. - Importing the metadata-named module can still run module-level side-effect code; the ``Struct`` docstring calls this out so callers don't load artifacts from untrusted sources. Tests - ``test_base.py`` — abstract instantiation, WIRE/STORAGE dispatch, invalid format, equality/hash, spec. - ``test_json_type.py`` — wire and storage round-trips. - ``test_struct_type.py`` — dataclass round-trip, dict round-trip, directly-nested dataclass round-trip, container-field pass-through, rejection of non-dataclass and dataclass-instance metadata. - ``test_iotype_serializer.py`` — bridge ``can_serialize``/``can_deserialize``, round-trip through dataclass reconstruction, rejection of non-IOType classes in metadata, WIRE not supported on the bridge.
fed2c78 to
b6f6cf4
Compare
813442e to
f062b36
Compare
| blob = pickle.dumps(obj, protocol=4) | ||
| return ( | ||
| [SerializedBlob(blob, is_reference=False)], | ||
| SerializationMetadata( | ||
| obj_type=str(type(obj)), | ||
| size=len(blob), | ||
| encoding="pickle-v4", | ||
| serializer_info={}, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Protocol-2 fallback removed — behavioral regression for some objects
The old save_artifacts caught SystemError and OverflowError from pickle.dumps(protocol=4) and automatically retried with protocol=2. That fallback is gone here, so any object that triggers one of those errors under protocol=4 (e.g., certain C-extension types or deeply nested structures) will now raise a raw SystemError/OverflowError instead of being serialized successfully. Artifacts that Metaflow previously saved without issue may now fail to save.
@classmethod
def serialize(cls, obj, format=SerializationFormat.STORAGE):
if format == SerializationFormat.WIRE:
raise NotImplementedError(
"PickleSerializer does not support the WIRE format; pickle "
"produces opaque binary bytes that are not safe to pass as "
"CLI args or inline IPC payloads."
)
try:
blob = pickle.dumps(obj, protocol=4)
encoding = "pickle-v4"
except (SystemError, OverflowError):
blob = pickle.dumps(obj, protocol=2)
encoding = "pickle-v2"
return (
[SerializedBlob(blob, is_reference=False)],
SerializationMetadata(
obj_type=str(type(obj)),
size=len(blob),
encoding=encoding,
serializer_info={},
),
)There was a problem hiding this comment.
actually strike that, I'm forgetting the context for keeping the protocol-2 encoding here but it seems to have been tied to python <=3.6 so dropping the fallback should be fine at this point.
| return not self.is_reference | ||
|
|
||
|
|
||
| class SerializerStore(ABCMeta): |
There was a problem hiding this comment.
would this serializer store be accessible from other metaflow plugins ? i'm working on updating the checkpoint decorator and it'd be super useful if we could re-use this to serialize artifacts for checkpointing purposes !
we'd have to be able to (1) access the serializations for an artifact someone wants to checkpoint , and (2) save the artifacts in a different location from the usual metaflow artifacts
There was a problem hiding this comment.
Yes, that's exactly what the decoupling enables. The serializer produces (blobs, metadata) independently of where you persist them:
from metaflow.datastore.artifacts.serializer import SerializerStore
# (1) serialize
for s in SerializerStore.get_ordered_serializers():
if s.can_serialize(obj):
blobs, metadata = s.serialize(obj)
break
# (2) deserialize from your own storage
for s in SerializerStore.get_ordered_serializers():
if s.can_deserialize(metadata):
return s.deserialize([blob], metadata)You hand blobs to whatever storage you want (your own S3 path, a separate CAS, etc.).
There was a problem hiding this comment.
nice ! this is great , thanks for the detailed example !
| except TypeError as e: | ||
| # Preserve the historical "couldn't pickle this" wrapper so | ||
| # existing consumers still see ``UnpicklableArtifactException``. | ||
| raise UnpicklableArtifactException(name) from e |
There was a problem hiding this comment.
couldn't a non-pickle serializer also raise a TypeError ? One option could be to treat any raised MetaflowException as pass-through, and have the pickle-serializer raise the UnpicklableArtifactException instead?
There was a problem hiding this comment.
I think the idea was to try to return the same exception as before. I think what we could do is specially catch UnpicklableArtifactException (and yes, raise it directly in the pickle serializer) and for everything else, fall through to the DataException which would maintain backwards compatibility with exception types. Would that work for you?
There was a problem hiding this comment.
Moved pickle.dumps's TypeError handling into PickleSerializer itself. It now raises UnpicklableArtifactException (with artifact_name optional). TaskDataStore.save_artifacts catches it specifically to re-attach the artifact name, lets any other MetaflowException pass through unchanged, and only wraps non-MetaflowException failures in DataException for context. Added three tests covering the three branches.
| blob = pickle.dumps(obj, protocol=4) | ||
| return ( | ||
| [SerializedBlob(blob, is_reference=False)], | ||
| SerializationMetadata( | ||
| obj_type=str(type(obj)), | ||
| size=len(blob), | ||
| encoding="pickle-v4", | ||
| serializer_info={}, | ||
| ), | ||
| ) |
Introduce ArtifactSerializer ABC with priority-based dispatch, enabling custom serializers to be plugged in alongside the default pickle path. This is the foundation for the IOType system which adds typed serialization for Metaflow artifacts. New abstractions: - ArtifactSerializer: base class with can_serialize/can_deserialize/ serialize/deserialize interface - SerializerStore: metaclass for auto-registration and deterministic priority-ordered dispatch - SerializationMetadata: namedtuple for artifact metadata routing - SerializedBlob: supports both new bytes and references to already-stored data - PickleSerializer: universal fallback (PRIORITY=9999) wrapping existing pickle logic No existing code is modified. These are inert until wired into TaskDataStore in a follow-up commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Register PickleSerializer through the standard Metaflow plugin
mechanism so extensions (e.g., mli-metaflow-custom) can add their
own serializers via ARTIFACT_SERIALIZERS_DESC.
- Add artifact_serializer category to _plugin_categories
- Add ARTIFACT_SERIALIZERS_DESC with PickleSerializer in plugins/__init__.py
- Resolve via ARTIFACT_SERIALIZERS = resolve_plugins("artifact_serializer")
Importing metaflow.plugins now triggers PickleSerializer registration
in SerializerStore, ensuring the registry is populated before
TaskDataStore needs it (commit 3).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
save_artifacts now loops through priority-ordered serializers to find
one that can_serialize the object. load_artifacts routes deserialization
through metadata.encoding via can_deserialize. PickleSerializer handles
all existing Python objects as the universal fallback.
Behavioral changes:
- New artifacts get encoding "pickle-v4" (was "gzip+pickle-v4")
- _info[name] gains optional "serializer_info" dict for custom serializers
- Removed hardcoded pickle import from task_datastore.py
Backward compatible:
- Old artifacts with "gzip+pickle-v2" or "gzip+pickle-v4" encoding
load correctly (PickleSerializer.can_deserialize handles both)
- Missing encoding defaults to "gzip+pickle-v2"
- Missing serializer_info defaults to {}
- _objects[name] stays as single string (multi-blob deferred to PR 1.5)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SerializerStore now extends ABCMeta (not type) so @AbstractMethod is enforced — incomplete subclasses raise TypeError at definition - Validate blobs non-empty before accessing blobs[0] in save_artifacts - Add NOTE on compress_method not yet wired into save path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire vs storage format - Added WIRE and STORAGE constants in metaflow.datastore.artifacts.serializer. - ArtifactSerializer.serialize/deserialize now accept a ``format`` kwarg so a single class can own both the storage path (datastore blobs + metadata) and the wire path (string for CLI args, protobuf payloads, cross-process IPC). - PickleSerializer implements STORAGE only; WIRE raises NotImplementedError with an explanation (pickle bytes are not safe as a wire payload). - Serializers that want wire support implement it on the same class; there's no need for a second class per format. Lazy import-hook registry - New module metaflow/datastore/artifacts/lazy_registry.py with SerializerConfig, an importlib.abc.MetaPathFinder interceptor, and public register_serializer_for_type / load_serializer_class entry points. - If the target type's module is already in sys.modules, registration is immediate. Otherwise a hook is installed on sys.meta_path and registration fires the first time the user's code imports the target module. This defers the cost of serializer-module imports (torch, pyarrow, fastavro, ...) until those dependencies are actually in play. - find_spec temporarily removes the interceptor from sys.meta_path during its lookup to avoid recursion. Review nits - serializer.py: rename SerializationMetadata.type field to obj_type to avoid shadowing the type() builtin (dict key inside _info stays "type" for backward compatibility with existing datastores). - serializer.py: SerializerStore skips TYPE=None AND any subclass that is still abstract, via inspect.isabstract(). - serializer.py: get_ordered_serializers memoizes the sorted list and invalidates on new registration. Drops the O(n²) list.index tiebreaker — Python 3.7+ dicts preserve insertion order. - task_datastore.py: lift SerializerStore / SerializationMetadata imports to module top (were on hot paths: __init__ and load_artifacts). - task_datastore.py: the "no serializer claimed this artifact" branch now raises DataException with a message pointing at the PickleSerializer fallback invariant, instead of the misleading UnpicklableArtifactException. - task_datastore.py: "no deserializer claimed this artifact" now looks up ``serializer_info["source"]`` and hints at a missing extension. - SerializedBlob.compress_method: removed. It was documented as "not wired into the save path" — shipping an unwired knob invites extension authors to rely on it and discover it is a no-op. Will come back with its consuming code in a later change. - serialize() docstring rewritten: the side-effect-free contract is there to support retries, caching, and parallel dispatch, and to keep serializers testable; I/O belongs in hooks. Tests - New test/unit/test_lazy_serializer_registry.py (9 tests covering config validation, eager registration for already-imported types, deferred registration through the import hook, and the recursion guard). - test_artifact_serializer.py gains format-dispatch coverage (STORAGE and WIRE round-trip on a toy dual-format serializer; PickleSerializer raises on WIRE). - Removed tests for compress_method and adjusted tests that poked at _registration_order so they go through the public API only.
- Convert STORAGE/WIRE module constants to a SerializationFormat enum (str-backed so existing equality checks keep working). Module-level STORAGE/WIRE aliases are dropped. - Add Python type hints to can_serialize, can_deserialize, serialize, and deserialize so editors surface the types documented in docstrings. - Drop the unused `context` parameter from deserialize() across the ABC, PickleSerializer, the TaskDataStore call site, and tests. - Move the `_serializers` property explanation onto the property itself; leave only a short note next to `_serializers_override`. - save_artifacts now validates blob count BEFORE mutating `_info[name]`, so a rejected artifact never leaves partial metadata behind. - Multi-blob error message now points users at the Metaflow team for multi-blob use cases. - "No deserializer claimed artifact" error now includes the full serializer_info dict in addition to the source hint. - SerializerStore.get_ordered_serializers tracks already-materialized lazy classes so the ordered cache is only rebuilt when a new lazy class actually becomes importable — previously, any registered lazy config forced a rebuild on every call. Tests - New regression test for the ordered-cache steady-state behavior. - New regression tests verifying `_info[name]` is not populated when save_artifacts rejects empty or multi-blob serializer output. - Updated existing tests for the enum, type hints, and context removal.
Replace the two-path registration scheme (ARTIFACT_SERIALIZERS_DESC +
register_serializer_for_type) with a single declarative mechanism.
Extensions ship one serializer file per serializer; heavy imports
live in a setup_imports() classmethod whose ModuleNotFoundError
parks the serializer on an import hook and retries once the awaited
module appears in sys.modules.
Author-facing API
-----------------
class TorchSerializer(ArtifactSerializer):
TYPE = "torch"
PRIORITY = 50
@classmethod
def setup_imports(cls, context=None):
cls.lazy_import("torch")
cls.lazy_import("pyarrow", alias="pa")
@classmethod
def can_serialize(cls, obj):
return isinstance(obj, cls.torch.Tensor)
...
# mfextinit_myext.py
ARTIFACT_SERIALIZERS_DESC = [
("torch", "my_ext.serializers.torch_serializer.TorchSerializer"),
]
Mechanism
---------
- SerializerStore.bootstrap() walks every extension's
ARTIFACT_SERIALIZERS_DESC directly (the serializer category owns its
own lifecycle rather than flowing through resolve_plugins), applies
ENABLED_ARTIFACT_SERIALIZER +/- toggles, and drives each entry
through a state machine:
known → importing → class_loaded → importing_deps → active
with terminal states `broken` / `disabled` and the retry state
`pending_on_imports`.
- lazy_import(module_path, alias=None) imports the module, stashes it
on cls, returns it. Rejects reserved names (TYPE, PRIORITY, dispatch
methods, leading underscore) and double-assignment within one
setup_imports() call.
- A sys.meta_path interceptor watches the pending modules; when any
imports, SerializerStore._on_module_imported re-runs the lifecycle
for every parked entry waiting on that module. Loop guard: the same
ImportError.name raising twice → broken.
- Dispatch reads _active_serializers only. Exceptions raised by
can_serialize / can_deserialize on a currently-active serializer are
caught, counted in the record's dispatch_error_count, and the
serializer is skipped for that artifact only — preserving the
PickleSerializer fallback.
- On PRIORITY ties, last-registered wins; a lexicographic tiebreak on
class_path provides cross-environment reproducibility.
Diagnostic API
--------------
metaflow.datastore.artifacts.list_serializer_status() returns a list
of per-entry dicts with name, class_path, state, awaiting_modules,
last_error, priority, type, import_trigger, and dispatch_error_count.
Primary use case: answering "why isn't my custom serializer active?"
without reading source.
Removed
-------
- register_serializer_for_type() (public API)
- SerializerConfig, register_serializer_config, iter_registered_configs,
load_serializer_class, plus the backing declarative-config plumbing
inside lazy_registry.py — nothing in the unified path uses them.
- ARTIFACT_SERIALIZERS = resolve_plugins("artifact_serializer") in
metaflow/plugins/__init__.py (no readers in the codebase).
- "artifact_serializer" from _plugin_categories in
metaflow/extension_support/plugins.py.
Testing
-------
- 102 unit tests across test_artifact_serializer, test_pickle_serializer,
test_serializer_integration, test_serializer_lifecycle, and
test_serializer_public_api. Coverage: state-machine transitions,
retry and loop guard, disabled toggle, dispatch exception handling,
lazy_import reserved-name + collision guards, setup_imports signature
compatibility (both def setup_imports(cls) and
def setup_imports(cls, context=None) are supported), subclass
inheritance, and public API surface assertions.
- SerializerStore._reset_for_tests() clears registry + interceptor state
and walks MRO to delattr lazy-imported attributes for test isolation.
When save_artifacts dispatches to a serializer, inject the serializer's source label into ``serializer_info["source"]`` so the "no deserializer claimed artifact" load error can point at the package providing the missing serializer. The source is derived at bootstrap time: ``"metaflow"`` for core serializers, the extension's ``package_name`` (via ``metaflow.extension_support.plugins.get_modules``) for extensions. Authors who set their own ``source`` in the returned ``serializer_info`` are not overridden, so an extension can publish a different identifier (e.g. a logical name rather than the pip distribution name) by passing it through ``serialize()``. The source is also exposed on each diagnostic record returned by ``list_serializer_status()``.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
- PickleSerializer.serialize labelled output as "gzip+pickle-v4" but the
serializer only pickles; gzip is applied later by ContentAddressedStore.
Tests assert the serializer reports its own encoding ("pickle-v4"); the
legacy "gzip+pickle-v4" label is kept in _ENCODINGS for backward-compat
reads of master-era artifacts.
- ArtifactSerializer.lazy_import silently no-op'd when an alias was reused
with a different module path. Now: same module_path is still idempotent
(same module's __name__ matches), different module_path raises
ValueError so accidental aliasing fails loudly.
…atch hardening Round-3 fixes triggered by an external code review: Import hook (lazy_registry.py) - _WrappedLoader.exec_module restores module.__spec__.loader to the original on both success and failure, so pkgutil/importlib.resources/ test-runners that introspect spec.loader stop seeing the wrapper. - On exec failure the success callback is skipped and the module is not marked processed, so a future re-import of the same name retries the full path instead of silently no-op'ing. - _ensure_interceptor_installed() short-circuits when already at sys.meta_path[0], removing the brief "interceptor missing" window the remove+reinsert pattern created. - The SerializerStore registry callback exception path now emits a RuntimeWarning instead of swallowing silently. Registry / metaclass (serializer.py) - SerializerStore.__init__ rejects subclasses that inherit a registered parent's TYPE without declaring their own — would otherwise overwrite the parent entry in _all_serializers and corrupt dispatch. - _retry_bootstrap mirrors bootstrap_entries: TYPE-vs-name validation, rec.priority/rec.type capture, and last_error reset on success. - _on_module_imported wraps each per-record retry in its own try/except so a defect in one record cannot strand the others as PENDING. - bootstrap()'s extension-discovery and config-toggle except blocks now emit RuntimeWarning so a broken extension is visible instead of silently disabling every artifact serializer it ships. - _reset_for_tests delegates to lazy_registry._reset_for_tests() so both _watched and _processed (and the sys.meta_path entry) are cleared. Datastore (task_datastore.py) - save_artifacts catches non-TypeError serializer exceptions and wraps them with artifact-name + serializer-class context via DataException. - Both save and load paths pass format=SerializationFormat.STORAGE explicitly, so a serializer that overrides the default does not break tuple-unpacking. - load_artifacts surfaces an _info/_objects divergence as a DataException with context instead of a bare KeyError. - _record_dispatch_error emits a RuntimeWarning for buggy can_serialize/can_deserialize so silent fallthrough to PickleSerializer is observable. Documentation - SerializationMetadata.__doc__ documents the encoding-string contract (describes serializer output, not on-disk format; CAS owns gzip). - ArtifactSerializer.serialize docstring repeats the no-double-compress rule and references the metadata contract. Tests - Updated _JsonStringSerializer.serialize signature to accept the format kwarg, matching the abstract contract now enforced by the framework's explicit format=STORAGE call.
…tegration.py Round-3 commit landed two lines above the project's black configuration: the ``inherited_from`` generator and the ``ValueError`` message in ``lazy_import``. Run black==25.12.0 (the version pinned in ``.pre-commit-config.yaml``) to reformat in place. No functional change.
Move TypeError handling out of TaskDataStore's dispatch loop so a non-pickle serializer's TypeError is no longer mislabeled as "unpicklable". PickleSerializer now wraps its own pickle.dumps TypeError into UnpicklableArtifactException; TaskDataStore re-raises it with the artifact name attached and lets any other MetaflowException propagate.
9653844 to
c79fa79
Compare
Summary
Replace the hardcoded pickle logic in
TaskDataStorewith a pluggable artifact serializer framework. Existing pickle behavior is preserved exactly —PickleSerializeris a built-in universal fallback so flows that never register a custom serializer behave identically to before.ArtifactSerializerABC with priority-based dispatch,SerializationMetadatanamedtuple,SerializedBlobvalue type.SerializerStoremetaclass auto-registers subclasses; caches the sorted dispatch list and invalidates on new registration.ARTIFACT_SERIALIZERS_DESC(standard Metaflow plugin pattern).torch,pyarrow) until the user's own code imports the target type.Design
PRIORITY(lower first; ties broken by registration order). On save, the firstcan_serialize(obj) == Truewins. On load,can_deserialize(metadata)routes by theencodingfield in artifact metadata.serialize(obj, format=...)/deserialize(data, format=..., **kw)accept aformatkwarg —STORAGE(default) returns(List[SerializedBlob], SerializationMetadata)for the datastore save path;WIREreturns astrfor CLI args, protobuf payloads, and cross-process IPC. One class owns both representations.PickleSerializerimplementsSTORAGEonly — pickle bytes are not a wire-safe payload — and raisesNotImplementedErrorforWIRE.serialize()must not perform I/O or mutate global state: it may be called multiple times (caching, retries, parallel dispatch) and must stay safely idempotent. Side effects belong in hooks, not serializers.metaflow.datastore.artifacts.lazy_registry.register_serializer_for_type("torch.Tensor", "my_ext.TorchSerializer")stores a declarative config; animportlib.abc.MetaPathFinderwatches fortorchto be imported and only then loads the serializer class. If the target module is already insys.modules, registration is immediate.gzip+pickle-v2/gzip+pickle-v4encoding load correctly. Missingencodingdefaults togzip+pickle-v2.Error paths
save_artifacts: if no serializer claims an object, raiseDataExceptionwith a message pointing at thePickleSerializerfallback invariant (instead of the misleadingUnpicklableArtifactExceptionfor a non-pickle failure).load_artifacts: if no deserializer claims an artifact, the error includesserializer_info["source"]when present, hinting at a missing extension.Test plan
ArtifactSerializer/SerializerStore/SerializedBlob/SerializationMetadata, including wire/storage dispatch and the ABC's abstract-method enforcement.PickleSerializer(round-trips across scalars, containers, custom classes;WIRErejection; encoding validation).TaskDataStoreround-trips, custom serializer priority, and backward compat.spinfailures remain, unrelated).