fix: patch litellm exceptions for cloudpickle serialization#234
fix: patch litellm exceptions for cloudpickle serialization#234
Conversation
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
| except (TypeError, pickle.PicklingError): | ||
| # Fall back to string representation for unpicklable attributes | ||
| state[key] = repr(value) | ||
| return (_reconstruct_litellm_exception, (type(self), state)) |
There was a problem hiding this comment.
Potential Bug: Exception chaining attributes are not preserved
The current implementation only captures __dict__ attributes, but Python exceptions have special attributes for exception chaining that are not stored in __dict__:
__cause__(set byraise ... from ...)__context__(implicit exception context)__traceback__(traceback object)
When these exceptions are pickled and unpickled, this chaining information will be lost, making debugging more difficult.
Suggested improvement:
def _litellm_reduce(self):
"""Custom __reduce__ for litellm exceptions.
Captures __dict__ state, filtering out any attributes that themselves
cannot be pickled (e.g. httpx connection pools from real responses).
Also preserves exception chaining attributes.
"""
import pickle
state = {}
for key, value in self.__dict__.items():
try:
pickle.dumps(value)
state[key] = value
except (TypeError, pickle.PicklingError):
# Fall back to string representation for unpicklable attributes
state[key] = repr(value)
# Preserve exception chaining and args
state['_exc_args'] = self.args
state['_exc_cause'] = self.__cause__
state['_exc_context'] = self.__context__
# Note: __traceback__ is typically not pickled as it contains frame objects
return (_reconstruct_litellm_exception, (type(self), state))And update the reconstruction function:
def _reconstruct_litellm_exception(cls, state):
"""Reconstruct a litellm exception without calling __init__.
Uses Exception.__new__ to create the instance, then restores __dict__
and sets Exception.args for proper str() representation.
"""
exc = Exception.__new__(cls)
# Extract special exception attributes before updating __dict__
exc_args = state.pop('_exc_args', (state.get('message', ''),))
exc_cause = state.pop('_exc_cause', None)
exc_context = state.pop('_exc_context', None)
exc.__dict__.update(state)
exc.args = exc_args
exc.__cause__ = exc_cause
exc.__context__ = exc_context
return excThis ensures that exception chains are preserved, which is important for debugging background task failures.
There was a problem hiding this comment.
Pull request overview
This PR addresses a Docket/cloudpickle interoperability issue where LiteLLM exception instances can’t be reliably deserialized (due to required __init__ positional args), causing failed worker tasks to effectively “disappear” instead of being reported.
Changes:
- Adds a monkey-patch module that implements a custom pickle reduction path for LiteLLM exception classes (bypassing
__init__on unpickle). - Ensures the patch is applied by importing it during Docket task module import.
- Introduces a dedicated test suite to validate cloudpickle serialization of task arguments and multiple exception types (including patched LiteLLM exceptions).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
agent_memory_server/litellm_pickle_compat.py |
Introduces the LiteLLM exception __reduce__ patch + reconstruction helper for cloudpickle round-tripping. |
agent_memory_server/docket_tasks.py |
Imports the compat module so the patch is applied when the worker loads task definitions. |
tests/test_docket_serialization.py |
Adds tests for argument serialization, baseline exception serialization, and the LiteLLM exception bug + patch verification. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
0ac9bc9 to
708f91a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _LITELLM_EXCEPTION_CLASSES = [ | ||
| litellm.exceptions.AuthenticationError, | ||
| litellm.exceptions.NotFoundError, | ||
| litellm.exceptions.BadRequestError, | ||
| litellm.exceptions.UnprocessableEntityError, | ||
| litellm.exceptions.Timeout, | ||
| litellm.exceptions.PermissionDeniedError, | ||
| litellm.exceptions.RateLimitError, | ||
| litellm.exceptions.ContextWindowExceededError, | ||
| litellm.exceptions.ContentPolicyViolationError, | ||
| litellm.exceptions.ServiceUnavailableError, | ||
| litellm.exceptions.BadGatewayError, | ||
| litellm.exceptions.InternalServerError, | ||
| litellm.exceptions.APIError, | ||
| litellm.exceptions.APIConnectionError, | ||
| litellm.exceptions.APIResponseValidationError, |
There was a problem hiding this comment.
_LITELLM_EXCEPTION_CLASSES is built via direct attribute access at import time. Because pyproject.toml allows litellm>=1.80.11 (no upper bound), a future LiteLLM release that renames/removes any of these exception classes would raise AttributeError during module import and could prevent the API server/worker from starting. Consider building this list defensively (e.g., getattr with a default + filter Nones, or discovering exception subclasses dynamically) so missing classes don’t crash startup.
| _LITELLM_EXCEPTION_CLASSES = [ | |
| litellm.exceptions.AuthenticationError, | |
| litellm.exceptions.NotFoundError, | |
| litellm.exceptions.BadRequestError, | |
| litellm.exceptions.UnprocessableEntityError, | |
| litellm.exceptions.Timeout, | |
| litellm.exceptions.PermissionDeniedError, | |
| litellm.exceptions.RateLimitError, | |
| litellm.exceptions.ContextWindowExceededError, | |
| litellm.exceptions.ContentPolicyViolationError, | |
| litellm.exceptions.ServiceUnavailableError, | |
| litellm.exceptions.BadGatewayError, | |
| litellm.exceptions.InternalServerError, | |
| litellm.exceptions.APIError, | |
| litellm.exceptions.APIConnectionError, | |
| litellm.exceptions.APIResponseValidationError, | |
| _LITELLM_EXCEPTION_CLASS_NAMES = [ | |
| "AuthenticationError", | |
| "NotFoundError", | |
| "BadRequestError", | |
| "UnprocessableEntityError", | |
| "Timeout", | |
| "PermissionDeniedError", | |
| "RateLimitError", | |
| "ContextWindowExceededError", | |
| "ContentPolicyViolationError", | |
| "ServiceUnavailableError", | |
| "BadGatewayError", | |
| "InternalServerError", | |
| "APIError", | |
| "APIConnectionError", | |
| "APIResponseValidationError", | |
| ] | |
| _LITELLM_EXCEPTION_CLASSES = [ | |
| cls | |
| for name in _LITELLM_EXCEPTION_CLASS_NAMES | |
| for cls in (getattr(litellm.exceptions, name, None),) | |
| if isinstance(cls, type) and issubclass(cls, Exception) |
| import pickle | ||
|
|
||
| state = {} | ||
| for key, value in self.__dict__.items(): | ||
| try: | ||
| pickle.dumps(value) | ||
| state[key] = value | ||
| except Exception: | ||
| # Fall back to string representation for any unpicklable attributes | ||
| state[key] = repr(value) | ||
|
|
There was a problem hiding this comment.
The picklability check uses stdlib pickle.dumps(), but Docket uses cloudpickle. Some values may be picklable by cloudpickle but not by pickle, causing avoidable loss of state by converting them to repr(). Using cloudpickle.dumps() for the check (or otherwise matching the actual serializer) would preserve more exception context.
| # Extract special exception attributes before updating __dict__ | ||
| exc_args = state.pop("_exc_args", (state.get("message", ""),)) | ||
| exc_cause = state.pop("_exc_cause", None) | ||
| exc_context = state.pop("_exc_context", None) | ||
|
|
||
| exc.__dict__.update(state) | ||
| exc.args = exc_args | ||
| exc.__cause__ = exc_cause | ||
| exc.__context__ = exc_context | ||
| return exc |
There was a problem hiding this comment.
Chaining restoration handles cause and context, but it doesn’t preserve suppress_context. If suppress_context was set on the original exception, that semantic will be lost after unpickling. Consider capturing/restoring suppress_context alongside the other chaining attributes.
| if not hasattr(cls, "_pickle_patched"): | ||
| cls.__reduce__ = _litellm_reduce | ||
| cls._pickle_patched = True |
There was a problem hiding this comment.
patch() skips patching if the class merely has an attribute named _pickle_patched, even if it’s False or set by LiteLLM for another purpose. Using getattr(cls, "_pickle_patched", False) (or a less collision-prone sentinel name) would make the idempotence check more reliable.
| if not hasattr(cls, "_pickle_patched"): | |
| cls.__reduce__ = _litellm_reduce | |
| cls._pickle_patched = True | |
| # Use a less collision-prone attribute name and a value-based check. | |
| if getattr(cls, "_docket_pickle_patched", False): | |
| continue | |
| cls.__reduce__ = _litellm_reduce | |
| cls._docket_pickle_patched = True |
| from docket import Docket | ||
|
|
||
| import agent_memory_server.litellm_pickle_compat # noqa: F401 — patches litellm exceptions for cloudpickle | ||
| from agent_memory_server.config import settings |
There was a problem hiding this comment.
Importing litellm_pickle_compat at module import time means the LiteLLM monkey-patch runs whenever agent_memory_server.docket_tasks is imported (agent_memory_server/main.py imports register_tasks unconditionally), not just in the Docket worker. If the intent is to scope this to workers or only when settings.use_docket is enabled, consider moving this import/patch inside the worker entrypoint and/or inside register_tasks() after the use_docket guard.
| try: | ||
| MemoryMessage(role=123, content=456) # type: ignore | ||
| except Exception as e: | ||
| try: | ||
| data = cloudpickle.dumps(e) | ||
| cloudpickle.loads(data) | ||
| except TypeError as pickle_err: | ||
| pytest.fail( | ||
| f"MemoryMessage validation exception cannot be pickled: {pickle_err}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
If MemoryMessage(role=123, content=456) does not raise (e.g., due to coercion changes in Pydantic), this test will silently pass without asserting anything. Consider using pytest.raises(...) to assert the validation error occurs, then verify that exception can be cloudpickled.
| try: | |
| MemoryMessage(role=123, content=456) # type: ignore | |
| except Exception as e: | |
| try: | |
| data = cloudpickle.dumps(e) | |
| cloudpickle.loads(data) | |
| except TypeError as pickle_err: | |
| pytest.fail( | |
| f"MemoryMessage validation exception cannot be pickled: {pickle_err}" | |
| ) | |
| with pytest.raises(Exception) as excinfo: | |
| MemoryMessage(role=123, content=456) # type: ignore | |
| e = excinfo.value | |
| try: | |
| data = cloudpickle.dumps(e) | |
| cloudpickle.loads(data) | |
| except TypeError as pickle_err: | |
| pytest.fail( | |
| f"MemoryMessage validation exception cannot be pickled: {pickle_err}" | |
| ) |
| with pytest.raises(TypeError, match="missing.*required"): | ||
| exc_class() |
There was a problem hiding this comment.
This assertion hard-codes the current LiteLLM behavior that these exception classes require positional args. Since the dependency is unbounded (litellm>=1.80.11), an upstream fix that makes init no-arg-safe would cause this test to fail even though the serialization behavior is fine. Consider skipping/relaxing this test when the signature indicates defaultable args, or pinning an upper bound on LiteLLM if this behavior is relied upon.
| exc = _make_litellm_exc(exc_class) | ||
| data = cloudpickle.dumps(exc) | ||
| restored = cloudpickle.loads(data) | ||
| assert isinstance(restored, Exception) |
There was a problem hiding this comment.
Since the patch reconstructs using type(self), the restored exception should be an instance of exc_class. Asserting only Exception is too weak and could miss regressions where the wrong class is reconstructed; consider asserting isinstance(restored, exc_class).
| assert isinstance(restored, Exception) | |
| assert isinstance(restored, exc_class) |
tylerhutcherson
left a comment
There was a problem hiding this comment.
This might be too complex here. Is there a way we could catch LiteLLM exceptions on the worker process and then use some approach to raise our own exception class (with an encoded message) and pass that back to the task managers instead of hacking the litellm exception serializers and using cloudpickle. Also for failures on LiteLLM/LLM extraction, where do those surface to clients of AMS?

Summary
Closes #231
Docket worker tasks that fail with LiteLLM exceptions (rate limit, timeout, connection error, etc.) silently disappear because cloudpickle cannot deserialize them. LiteLLM exception classes require positional args (
message,model,llm_provider) in__init__, but cloudpickle calls__init__()with no args during deserialization, causingTypeError.agent_memory_server/litellm_pickle_compat.py— patches__reduce__on all 15 LiteLLM exception classes to bypass__init__during deserialization, usingException.__new__()+__dict__restorationagent_memory_server/docket_tasks.py— imports the patch module at worker startup so it's applied before any task can failtests/test_docket_serialization.py— comprehensive tests proving the bug and verifying the fixTest plan
TestTaskArgumentSerialization— verifies MemoryRecord, MemoryMessage, and lists serialize through cloudpickleTestExceptionSerializationBaseline— verifies standard Python/httpx exceptions serialize correctly (baseline sanity)TestLiteLLMExceptionBugProof— proves the underlying bug: LiteLLM__init__requires positional args that cloudpickle doesn't preserveTestLiteLLMExceptionPatched— verifies all 15 LiteLLM exception classes roundtrip through cloudpickle after patching, preserving message, model, llm_provider, and status_codeNote
Medium Risk
Medium risk because it monkey-patches third-party LiteLLM exception classes globally via
__reduce__, which could affect serialization/representation across the process. The change is scoped to error (de)serialization paths and is covered by targeted tests.Overview
Fixes a Docket failure mode where tasks that raise LiteLLM exceptions can’t be deserialized from the result queue (due to required
__init__args), causing failures to disappear.Adds
agent_memory_server/litellm_pickle_compat.pyto monkey-patchlitellm.exceptions.*with a custom__reduce__that reconstructs viaException.__new__, restores safe picklable state (and preserves chaining when possible), and auto-applies on import;docket_tasks.pyimports this module to enable the patch for workers.Introduces
tests/test_docket_serialization.pyto validate cloudpickle round-tripping for key task args and to prove/guard the LiteLLM exception serialization fix across the supported exception types.Written by Cursor Bugbot for commit 708f91a. This will update automatically on new commits. Configure here.