fix: patch litellm exceptions for cloudpickle serialization by bsbodden · Pull Request #234 · redis/agent-memory-server

bsbodden · 2026-03-20T18:53:43Z

Summary

Closes #231

Docket worker tasks that fail with LiteLLM exceptions (rate limit, timeout, connection error, etc.) silently disappear because cloudpickle cannot deserialize them. LiteLLM exception classes require positional args (message, model, llm_provider) in __init__, but cloudpickle calls __init__() with no args during deserialization, causing TypeError.

agent_memory_server/litellm_pickle_compat.py — patches __reduce__ on all 15 LiteLLM exception classes to bypass __init__ during deserialization, using Exception.__new__() + __dict__ restoration
agent_memory_server/docket_tasks.py — imports the patch module at worker startup so it's applied before any task can fail
tests/test_docket_serialization.py — comprehensive tests proving the bug and verifying the fix

Test plan

TestTaskArgumentSerialization — verifies MemoryRecord, MemoryMessage, and lists serialize through cloudpickle
TestExceptionSerializationBaseline — verifies standard Python/httpx exceptions serialize correctly (baseline sanity)
TestLiteLLMExceptionBugProof — proves the underlying bug: LiteLLM __init__ requires positional args that cloudpickle doesn't preserve
TestLiteLLMExceptionPatched — verifies all 15 LiteLLM exception classes roundtrip through cloudpickle after patching, preserving message, model, llm_provider, and status_code
Full test suite passes (805 passed, 105 skipped)

Note

Medium Risk
Medium risk because it monkey-patches third-party LiteLLM exception classes globally via __reduce__, which could affect serialization/representation across the process. The change is scoped to error (de)serialization paths and is covered by targeted tests.

Overview
Fixes a Docket failure mode where tasks that raise LiteLLM exceptions can’t be deserialized from the result queue (due to required __init__ args), causing failures to disappear.

Adds agent_memory_server/litellm_pickle_compat.py to monkey-patch litellm.exceptions.* with a custom __reduce__ that reconstructs via Exception.__new__, restores safe picklable state (and preserves chaining when possible), and auto-applies on import; docket_tasks.py imports this module to enable the patch for workers.

Introduces tests/test_docket_serialization.py to validate cloudpickle round-tripping for key task args and to prove/guard the LiteLLM exception serialization fix across the supported exception types.

^{Written by Cursor Bugbot for commit 708f91a. This will update automatically on new commits. Configure here.}

jit-ci · 2026-03-20T18:55:06Z

🛡️ Jit Security Scan Results

✅ No security findings were detected in this PR

^{Security scan by Jit}

github-actions

Review completed. I found one potential issue with exception chaining that should be addressed.

🤖 Automated review complete. Please react with 👍 or 👎 on the individual review comments to provide feedback on their usefulness.

github-actions · 2026-03-20T18:55:26Z

agent_memory_server/litellm_pickle_compat.py

+        except (TypeError, pickle.PicklingError):
+            # Fall back to string representation for unpicklable attributes
+            state[key] = repr(value)
+    return (_reconstruct_litellm_exception, (type(self), state))


Potential Bug: Exception chaining attributes are not preserved

The current implementation only captures __dict__ attributes, but Python exceptions have special attributes for exception chaining that are not stored in __dict__:

__cause__ (set by raise ... from ...)

__context__ (implicit exception context)

__traceback__ (traceback object)

When these exceptions are pickled and unpickled, this chaining information will be lost, making debugging more difficult.

Suggested improvement:

def _litellm_reduce(self): """Custom __reduce__ for litellm exceptions. Captures __dict__ state, filtering out any attributes that themselves cannot be pickled (e.g. httpx connection pools from real responses). Also preserves exception chaining attributes. """ import pickle state = {} for key, value in self.__dict__.items(): try: pickle.dumps(value) state[key] = value except (TypeError, pickle.PicklingError): # Fall back to string representation for unpicklable attributes state[key] = repr(value) # Preserve exception chaining and args state['_exc_args'] = self.args state['_exc_cause'] = self.__cause__ state['_exc_context'] = self.__context__ # Note: __traceback__ is typically not pickled as it contains frame objects return (_reconstruct_litellm_exception, (type(self), state))

And update the reconstruction function:

def _reconstruct_litellm_exception(cls, state): """Reconstruct a litellm exception without calling __init__. Uses Exception.__new__ to create the instance, then restores __dict__ and sets Exception.args for proper str() representation. """ exc = Exception.__new__(cls) # Extract special exception attributes before updating __dict__ exc_args = state.pop('_exc_args', (state.get('message', ''),)) exc_cause = state.pop('_exc_cause', None) exc_context = state.pop('_exc_context', None) exc.__dict__.update(state) exc.args = exc_args exc.__cause__ = exc_cause exc.__context__ = exc_context return exc

This ensures that exception chains are preserved, which is important for debugging background task failures.

Copilot

Pull request overview

This PR addresses a Docket/cloudpickle interoperability issue where LiteLLM exception instances can’t be reliably deserialized (due to required __init__ positional args), causing failed worker tasks to effectively “disappear” instead of being reported.

Changes:

Adds a monkey-patch module that implements a custom pickle reduction path for LiteLLM exception classes (bypassing __init__ on unpickle).
Ensures the patch is applied by importing it during Docket task module import.
Introduces a dedicated test suite to validate cloudpickle serialization of task arguments and multiple exception types (including patched LiteLLM exceptions).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`agent_memory_server/litellm_pickle_compat.py`	Introduces the LiteLLM exception `__reduce__` patch + reconstruction helper for cloudpickle round-tripping.
`agent_memory_server/docket_tasks.py`	Imports the compat module so the patch is applied when the worker loads task definitions.
`tests/test_docket_serialization.py`	Adds tests for argument serialization, baseline exception serialization, and the LiteLLM exception bug + patch verification.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

agent_memory_server/litellm_pickle_compat.py

tests/test_docket_serialization.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

agent_memory_server/litellm_pickle_compat.py

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T22:14:14Z

agent_memory_server/litellm_pickle_compat.py

+_LITELLM_EXCEPTION_CLASSES = [
+    litellm.exceptions.AuthenticationError,
+    litellm.exceptions.NotFoundError,
+    litellm.exceptions.BadRequestError,
+    litellm.exceptions.UnprocessableEntityError,
+    litellm.exceptions.Timeout,
+    litellm.exceptions.PermissionDeniedError,
+    litellm.exceptions.RateLimitError,
+    litellm.exceptions.ContextWindowExceededError,
+    litellm.exceptions.ContentPolicyViolationError,
+    litellm.exceptions.ServiceUnavailableError,
+    litellm.exceptions.BadGatewayError,
+    litellm.exceptions.InternalServerError,
+    litellm.exceptions.APIError,
+    litellm.exceptions.APIConnectionError,
+    litellm.exceptions.APIResponseValidationError,


_LITELLM_EXCEPTION_CLASSES is built via direct attribute access at import time. Because pyproject.toml allows litellm>=1.80.11 (no upper bound), a future LiteLLM release that renames/removes any of these exception classes would raise AttributeError during module import and could prevent the API server/worker from starting. Consider building this list defensively (e.g., getattr with a default + filter Nones, or discovering exception subclasses dynamically) so missing classes don’t crash startup.

Suggested change

_LITELLM_EXCEPTION_CLASSES = [

litellm.exceptions.AuthenticationError,

litellm.exceptions.NotFoundError,

litellm.exceptions.BadRequestError,

litellm.exceptions.UnprocessableEntityError,

litellm.exceptions.Timeout,

litellm.exceptions.PermissionDeniedError,

litellm.exceptions.RateLimitError,

litellm.exceptions.ContextWindowExceededError,

litellm.exceptions.ContentPolicyViolationError,

litellm.exceptions.ServiceUnavailableError,

litellm.exceptions.BadGatewayError,

litellm.exceptions.InternalServerError,

litellm.exceptions.APIError,

litellm.exceptions.APIConnectionError,

litellm.exceptions.APIResponseValidationError,

_LITELLM_EXCEPTION_CLASS_NAMES = [

"AuthenticationError",

"NotFoundError",

"BadRequestError",

"UnprocessableEntityError",

"Timeout",

"PermissionDeniedError",

"RateLimitError",

"ContextWindowExceededError",

"ContentPolicyViolationError",

"ServiceUnavailableError",

"BadGatewayError",

"InternalServerError",

"APIError",

"APIConnectionError",

"APIResponseValidationError",

]

_LITELLM_EXCEPTION_CLASSES = [

cls

for name in _LITELLM_EXCEPTION_CLASS_NAMES

for cls in (getattr(litellm.exceptions, name, None),)

if isinstance(cls, type) and issubclass(cls, Exception)

Copilot · 2026-03-20T22:14:15Z

agent_memory_server/litellm_pickle_compat.py

+    import pickle
+
+    state = {}
+    for key, value in self.__dict__.items():
+        try:
+            pickle.dumps(value)
+            state[key] = value
+        except Exception:
+            # Fall back to string representation for any unpicklable attributes
+            state[key] = repr(value)
+


The picklability check uses stdlib pickle.dumps(), but Docket uses cloudpickle. Some values may be picklable by cloudpickle but not by pickle, causing avoidable loss of state by converting them to repr(). Using cloudpickle.dumps() for the check (or otherwise matching the actual serializer) would preserve more exception context.

Copilot · 2026-03-20T22:14:15Z

agent_memory_server/litellm_pickle_compat.py

+    # Extract special exception attributes before updating __dict__
+    exc_args = state.pop("_exc_args", (state.get("message", ""),))
+    exc_cause = state.pop("_exc_cause", None)
+    exc_context = state.pop("_exc_context", None)
+
+    exc.__dict__.update(state)
+    exc.args = exc_args
+    exc.__cause__ = exc_cause
+    exc.__context__ = exc_context
+    return exc


Chaining restoration handles cause and context, but it doesn’t preserve suppress_context. If suppress_context was set on the original exception, that semantic will be lost after unpickling. Consider capturing/restoring suppress_context alongside the other chaining attributes.

Copilot · 2026-03-20T22:14:15Z

agent_memory_server/litellm_pickle_compat.py

+        if not hasattr(cls, "_pickle_patched"):
+            cls.__reduce__ = _litellm_reduce
+            cls._pickle_patched = True


patch() skips patching if the class merely has an attribute named _pickle_patched, even if it’s False or set by LiteLLM for another purpose. Using getattr(cls, "_pickle_patched", False) (or a less collision-prone sentinel name) would make the idempotence check more reliable.

Suggested change

if not hasattr(cls, "_pickle_patched"):

cls.__reduce__ = _litellm_reduce

cls._pickle_patched = True

# Use a less collision-prone attribute name and a value-based check.

if getattr(cls, "_docket_pickle_patched", False):

continue

cls.__reduce__ = _litellm_reduce

cls._docket_pickle_patched = True

Copilot · 2026-03-20T22:14:15Z

agent_memory_server/docket_tasks.py

 from docket import Docket

+import agent_memory_server.litellm_pickle_compat  # noqa: F401 — patches litellm exceptions for cloudpickle
 from agent_memory_server.config import settings


Importing litellm_pickle_compat at module import time means the LiteLLM monkey-patch runs whenever agent_memory_server.docket_tasks is imported (agent_memory_server/main.py imports register_tasks unconditionally), not just in the Docket worker. If the intent is to scope this to workers or only when settings.use_docket is enabled, consider moving this import/patch inside the worker entrypoint and/or inside register_tasks() after the use_docket guard.

Copilot · 2026-03-20T22:14:15Z

tests/test_docket_serialization.py

+        try:
+            MemoryMessage(role=123, content=456)  # type: ignore
+        except Exception as e:
+            try:
+                data = cloudpickle.dumps(e)
+                cloudpickle.loads(data)
+            except TypeError as pickle_err:
+                pytest.fail(
+                    f"MemoryMessage validation exception cannot be pickled: {pickle_err}"
+                )
+


If MemoryMessage(role=123, content=456) does not raise (e.g., due to coercion changes in Pydantic), this test will silently pass without asserting anything. Consider using pytest.raises(...) to assert the validation error occurs, then verify that exception can be cloudpickled.

Suggested change

try:

MemoryMessage(role=123, content=456) # type: ignore

except Exception as e:

try:

data = cloudpickle.dumps(e)

cloudpickle.loads(data)

except TypeError as pickle_err:

pytest.fail(

f"MemoryMessage validation exception cannot be pickled: {pickle_err}"

)

with pytest.raises(Exception) as excinfo:

MemoryMessage(role=123, content=456) # type: ignore

e = excinfo.value

try:

data = cloudpickle.dumps(e)

cloudpickle.loads(data)

except TypeError as pickle_err:

pytest.fail(

f"MemoryMessage validation exception cannot be pickled: {pickle_err}"

)

Copilot · 2026-03-20T22:14:15Z

tests/test_docket_serialization.py

+        with pytest.raises(TypeError, match="missing.*required"):
+            exc_class()


This assertion hard-codes the current LiteLLM behavior that these exception classes require positional args. Since the dependency is unbounded (litellm>=1.80.11), an upstream fix that makes init no-arg-safe would cause this test to fail even though the serialization behavior is fine. Consider skipping/relaxing this test when the signature indicates defaultable args, or pinning an upper bound on LiteLLM if this behavior is relied upon.

Copilot · 2026-03-20T22:14:16Z

tests/test_docket_serialization.py

+        exc = _make_litellm_exc(exc_class)
+        data = cloudpickle.dumps(exc)
+        restored = cloudpickle.loads(data)
+        assert isinstance(restored, Exception)


Since the patch reconstructs using type(self), the restored exception should be an instance of exc_class. Asserting only Exception is too weak and could miss regressions where the wrong class is reconstructed; consider asserting isinstance(restored, exc_class).

Suggested change

assert isinstance(restored, Exception)

assert isinstance(restored, exc_class)

tylerhutcherson

This might be too complex here. Is there a way we could catch LiteLLM exceptions on the worker process and then use some approach to raise our own exception class (with an encoded message) and pass that back to the task managers instead of hacking the litellm exception serializers and using cloudpickle. Also for failures on LiteLLM/LLM extraction, where do those surface to clients of AMS?

Copilot AI review requested due to automatic review settings March 20, 2026 18:53

Copilot started reviewing on behalf of bsbodden March 20, 2026 18:54 View session

github-actions bot reviewed Mar 20, 2026

View reviewed changes

Copilot AI reviewed Mar 20, 2026

View reviewed changes

agent_memory_server/litellm_pickle_compat.py Outdated Show resolved Hide resolved

tests/test_docket_serialization.py Outdated Show resolved Hide resolved

tests/test_docket_serialization.py Outdated Show resolved Hide resolved

cursor bot reviewed Mar 20, 2026

View reviewed changes

agent_memory_server/litellm_pickle_compat.py Outdated Show resolved Hide resolved

bsbodden added 3 commits March 20, 2026 12:14

fix: patch litellm exceptions for cloudpickle serialization

2b7febd

fix: import litellm pickle patch at worker startup

90a8122

test: verify litellm exception cloudpickle roundtripping

708f91a

bsbodden force-pushed the bsb/fix-litellm-pickle-231 branch from 0ac9bc9 to 708f91a Compare March 20, 2026 19:14

bsbodden requested a review from Copilot March 20, 2026 22:10

Copilot started reviewing on behalf of bsbodden March 20, 2026 22:11 View session

bsbodden self-assigned this Mar 20, 2026

bsbodden requested a review from abrookins March 20, 2026 22:13

Copilot AI reviewed Mar 20, 2026

View reviewed changes

tylerhutcherson requested changes Mar 24, 2026

View reviewed changes

-_LITELLM_EXCEPTION_CLASSES = [
-    litellm.exceptions.AuthenticationError,
-    litellm.exceptions.NotFoundError,
-    litellm.exceptions.BadRequestError,
-    litellm.exceptions.UnprocessableEntityError,
-    litellm.exceptions.Timeout,
-    litellm.exceptions.PermissionDeniedError,
-    litellm.exceptions.RateLimitError,
-    litellm.exceptions.ContextWindowExceededError,
-    litellm.exceptions.ContentPolicyViolationError,
-    litellm.exceptions.ServiceUnavailableError,
-    litellm.exceptions.BadGatewayError,
-    litellm.exceptions.InternalServerError,
-    litellm.exceptions.APIError,
-    litellm.exceptions.APIConnectionError,
-    litellm.exceptions.APIResponseValidationError,
+_LITELLM_EXCEPTION_CLASS_NAMES = [
+    "AuthenticationError",
+    "NotFoundError",
+    "BadRequestError",
+    "UnprocessableEntityError",
+    "Timeout",
+    "PermissionDeniedError",
+    "RateLimitError",
+    "ContextWindowExceededError",
+    "ContentPolicyViolationError",
+    "ServiceUnavailableError",
+    "BadGatewayError",
+    "InternalServerError",
+    "APIError",
+    "APIConnectionError",
+    "APIResponseValidationError",
+]
+_LITELLM_EXCEPTION_CLASSES = [
+    cls
+    for name in _LITELLM_EXCEPTION_CLASS_NAMES
+    for cls in (getattr(litellm.exceptions, name, None),)
+    if isinstance(cls, type) and issubclass(cls, Exception)

		with pytest.raises(TypeError, match="missing.*required"):
		exc_class()

	assert isinstance(restored, Exception)
	assert isinstance(restored, exc_class)

Conversation

bsbodden commented Mar 20, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

jit-ci bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛡️ Jit Security Scan Results

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

tylerhutcherson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bsbodden commented Mar 20, 2026 •

edited by cursor bot

Loading

jit-ci bot commented Mar 20, 2026 •

edited

Loading