Skip to content

fix: patch litellm exceptions for cloudpickle serialization#234

Open
bsbodden wants to merge 3 commits intomainfrom
bsb/fix-litellm-pickle-231
Open

fix: patch litellm exceptions for cloudpickle serialization#234
bsbodden wants to merge 3 commits intomainfrom
bsb/fix-litellm-pickle-231

Conversation

@bsbodden
Copy link
Copy Markdown
Collaborator

@bsbodden bsbodden commented Mar 20, 2026

Summary

Closes #231

Docket worker tasks that fail with LiteLLM exceptions (rate limit, timeout, connection error, etc.) silently disappear because cloudpickle cannot deserialize them. LiteLLM exception classes require positional args (message, model, llm_provider) in __init__, but cloudpickle calls __init__() with no args during deserialization, causing TypeError.

  • agent_memory_server/litellm_pickle_compat.py — patches __reduce__ on all 15 LiteLLM exception classes to bypass __init__ during deserialization, using Exception.__new__() + __dict__ restoration
  • agent_memory_server/docket_tasks.py — imports the patch module at worker startup so it's applied before any task can fail
  • tests/test_docket_serialization.py — comprehensive tests proving the bug and verifying the fix

Test plan

  • TestTaskArgumentSerialization — verifies MemoryRecord, MemoryMessage, and lists serialize through cloudpickle
  • TestExceptionSerializationBaseline — verifies standard Python/httpx exceptions serialize correctly (baseline sanity)
  • TestLiteLLMExceptionBugProof — proves the underlying bug: LiteLLM __init__ requires positional args that cloudpickle doesn't preserve
  • TestLiteLLMExceptionPatched — verifies all 15 LiteLLM exception classes roundtrip through cloudpickle after patching, preserving message, model, llm_provider, and status_code
  • Full test suite passes (805 passed, 105 skipped)

Note

Medium Risk
Medium risk because it monkey-patches third-party LiteLLM exception classes globally via __reduce__, which could affect serialization/representation across the process. The change is scoped to error (de)serialization paths and is covered by targeted tests.

Overview
Fixes a Docket failure mode where tasks that raise LiteLLM exceptions can’t be deserialized from the result queue (due to required __init__ args), causing failures to disappear.

Adds agent_memory_server/litellm_pickle_compat.py to monkey-patch litellm.exceptions.* with a custom __reduce__ that reconstructs via Exception.__new__, restores safe picklable state (and preserves chaining when possible), and auto-applies on import; docket_tasks.py imports this module to enable the patch for workers.

Introduces tests/test_docket_serialization.py to validate cloudpickle round-tripping for key task args and to prove/guard the LiteLLM exception serialization fix across the supported exception types.

Written by Cursor Bugbot for commit 708f91a. This will update automatically on new commits. Configure here.

Copilot AI review requested due to automatic review settings March 20, 2026 18:53
@jit-ci
Copy link
Copy Markdown

jit-ci bot commented Mar 20, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. I found one potential issue with exception chaining that should be addressed.


🤖 Automated review complete. Please react with 👍 or 👎 on the individual review comments to provide feedback on their usefulness.

except (TypeError, pickle.PicklingError):
# Fall back to string representation for unpicklable attributes
state[key] = repr(value)
return (_reconstruct_litellm_exception, (type(self), state))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential Bug: Exception chaining attributes are not preserved

The current implementation only captures __dict__ attributes, but Python exceptions have special attributes for exception chaining that are not stored in __dict__:

  • __cause__ (set by raise ... from ...)
  • __context__ (implicit exception context)
  • __traceback__ (traceback object)

When these exceptions are pickled and unpickled, this chaining information will be lost, making debugging more difficult.

Suggested improvement:

def _litellm_reduce(self):
    """Custom __reduce__ for litellm exceptions.

    Captures __dict__ state, filtering out any attributes that themselves
    cannot be pickled (e.g. httpx connection pools from real responses).
    Also preserves exception chaining attributes.
    """
    import pickle

    state = {}
    for key, value in self.__dict__.items():
        try:
            pickle.dumps(value)
            state[key] = value
        except (TypeError, pickle.PicklingError):
            # Fall back to string representation for unpicklable attributes
            state[key] = repr(value)
    
    # Preserve exception chaining and args
    state['_exc_args'] = self.args
    state['_exc_cause'] = self.__cause__
    state['_exc_context'] = self.__context__
    # Note: __traceback__ is typically not pickled as it contains frame objects
    
    return (_reconstruct_litellm_exception, (type(self), state))

And update the reconstruction function:

def _reconstruct_litellm_exception(cls, state):
    """Reconstruct a litellm exception without calling __init__.

    Uses Exception.__new__ to create the instance, then restores __dict__
    and sets Exception.args for proper str() representation.
    """
    exc = Exception.__new__(cls)
    
    # Extract special exception attributes before updating __dict__
    exc_args = state.pop('_exc_args', (state.get('message', ''),))
    exc_cause = state.pop('_exc_cause', None)
    exc_context = state.pop('_exc_context', None)
    
    exc.__dict__.update(state)
    exc.args = exc_args
    exc.__cause__ = exc_cause
    exc.__context__ = exc_context
    
    return exc

This ensures that exception chains are preserved, which is important for debugging background task failures.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Docket/cloudpickle interoperability issue where LiteLLM exception instances can’t be reliably deserialized (due to required __init__ positional args), causing failed worker tasks to effectively “disappear” instead of being reported.

Changes:

  • Adds a monkey-patch module that implements a custom pickle reduction path for LiteLLM exception classes (bypassing __init__ on unpickle).
  • Ensures the patch is applied by importing it during Docket task module import.
  • Introduces a dedicated test suite to validate cloudpickle serialization of task arguments and multiple exception types (including patched LiteLLM exceptions).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
agent_memory_server/litellm_pickle_compat.py Introduces the LiteLLM exception __reduce__ patch + reconstruction helper for cloudpickle round-tripping.
agent_memory_server/docket_tasks.py Imports the compat module so the patch is applied when the worker loads task definitions.
tests/test_docket_serialization.py Adds tests for argument serialization, baseline exception serialization, and the LiteLLM exception bug + patch verification.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

@bsbodden bsbodden force-pushed the bsb/fix-litellm-pickle-231 branch from 0ac9bc9 to 708f91a Compare March 20, 2026 19:14
@bsbodden bsbodden requested a review from Copilot March 20, 2026 22:10
@bsbodden bsbodden self-assigned this Mar 20, 2026
@bsbodden bsbodden requested a review from abrookins March 20, 2026 22:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +18 to +33
_LITELLM_EXCEPTION_CLASSES = [
litellm.exceptions.AuthenticationError,
litellm.exceptions.NotFoundError,
litellm.exceptions.BadRequestError,
litellm.exceptions.UnprocessableEntityError,
litellm.exceptions.Timeout,
litellm.exceptions.PermissionDeniedError,
litellm.exceptions.RateLimitError,
litellm.exceptions.ContextWindowExceededError,
litellm.exceptions.ContentPolicyViolationError,
litellm.exceptions.ServiceUnavailableError,
litellm.exceptions.BadGatewayError,
litellm.exceptions.InternalServerError,
litellm.exceptions.APIError,
litellm.exceptions.APIConnectionError,
litellm.exceptions.APIResponseValidationError,
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_LITELLM_EXCEPTION_CLASSES is built via direct attribute access at import time. Because pyproject.toml allows litellm>=1.80.11 (no upper bound), a future LiteLLM release that renames/removes any of these exception classes would raise AttributeError during module import and could prevent the API server/worker from starting. Consider building this list defensively (e.g., getattr with a default + filter Nones, or discovering exception subclasses dynamically) so missing classes don’t crash startup.

Suggested change
_LITELLM_EXCEPTION_CLASSES = [
litellm.exceptions.AuthenticationError,
litellm.exceptions.NotFoundError,
litellm.exceptions.BadRequestError,
litellm.exceptions.UnprocessableEntityError,
litellm.exceptions.Timeout,
litellm.exceptions.PermissionDeniedError,
litellm.exceptions.RateLimitError,
litellm.exceptions.ContextWindowExceededError,
litellm.exceptions.ContentPolicyViolationError,
litellm.exceptions.ServiceUnavailableError,
litellm.exceptions.BadGatewayError,
litellm.exceptions.InternalServerError,
litellm.exceptions.APIError,
litellm.exceptions.APIConnectionError,
litellm.exceptions.APIResponseValidationError,
_LITELLM_EXCEPTION_CLASS_NAMES = [
"AuthenticationError",
"NotFoundError",
"BadRequestError",
"UnprocessableEntityError",
"Timeout",
"PermissionDeniedError",
"RateLimitError",
"ContextWindowExceededError",
"ContentPolicyViolationError",
"ServiceUnavailableError",
"BadGatewayError",
"InternalServerError",
"APIError",
"APIConnectionError",
"APIResponseValidationError",
]
_LITELLM_EXCEPTION_CLASSES = [
cls
for name in _LITELLM_EXCEPTION_CLASS_NAMES
for cls in (getattr(litellm.exceptions, name, None),)
if isinstance(cls, type) and issubclass(cls, Exception)

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +75
import pickle

state = {}
for key, value in self.__dict__.items():
try:
pickle.dumps(value)
state[key] = value
except Exception:
# Fall back to string representation for any unpicklable attributes
state[key] = repr(value)

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The picklability check uses stdlib pickle.dumps(), but Docket uses cloudpickle. Some values may be picklable by cloudpickle but not by pickle, causing avoidable loss of state by converting them to repr(). Using cloudpickle.dumps() for the check (or otherwise matching the actual serializer) would preserve more exception context.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +55
# Extract special exception attributes before updating __dict__
exc_args = state.pop("_exc_args", (state.get("message", ""),))
exc_cause = state.pop("_exc_cause", None)
exc_context = state.pop("_exc_context", None)

exc.__dict__.update(state)
exc.args = exc_args
exc.__cause__ = exc_cause
exc.__context__ = exc_context
return exc
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chaining restoration handles cause and context, but it doesn’t preserve suppress_context. If suppress_context was set on the original exception, that semantic will be lost after unpickling. Consider capturing/restoring suppress_context alongside the other chaining attributes.

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +97
if not hasattr(cls, "_pickle_patched"):
cls.__reduce__ = _litellm_reduce
cls._pickle_patched = True
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch() skips patching if the class merely has an attribute named _pickle_patched, even if it’s False or set by LiteLLM for another purpose. Using getattr(cls, "_pickle_patched", False) (or a less collision-prone sentinel name) would make the idempotence check more reliable.

Suggested change
if not hasattr(cls, "_pickle_patched"):
cls.__reduce__ = _litellm_reduce
cls._pickle_patched = True
# Use a less collision-prone attribute name and a value-based check.
if getattr(cls, "_docket_pickle_patched", False):
continue
cls.__reduce__ = _litellm_reduce
cls._docket_pickle_patched = True

Copilot uses AI. Check for mistakes.
Comment on lines 7 to 10
from docket import Docket

import agent_memory_server.litellm_pickle_compat # noqa: F401 — patches litellm exceptions for cloudpickle
from agent_memory_server.config import settings
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing litellm_pickle_compat at module import time means the LiteLLM monkey-patch runs whenever agent_memory_server.docket_tasks is imported (agent_memory_server/main.py imports register_tasks unconditionally), not just in the Docket worker. If the intent is to scope this to workers or only when settings.use_docket is enabled, consider moving this import/patch inside the worker entrypoint and/or inside register_tasks() after the use_docket guard.

Copilot uses AI. Check for mistakes.
Comment on lines +118 to +128
try:
MemoryMessage(role=123, content=456) # type: ignore
except Exception as e:
try:
data = cloudpickle.dumps(e)
cloudpickle.loads(data)
except TypeError as pickle_err:
pytest.fail(
f"MemoryMessage validation exception cannot be pickled: {pickle_err}"
)

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If MemoryMessage(role=123, content=456) does not raise (e.g., due to coercion changes in Pydantic), this test will silently pass without asserting anything. Consider using pytest.raises(...) to assert the validation error occurs, then verify that exception can be cloudpickled.

Suggested change
try:
MemoryMessage(role=123, content=456) # type: ignore
except Exception as e:
try:
data = cloudpickle.dumps(e)
cloudpickle.loads(data)
except TypeError as pickle_err:
pytest.fail(
f"MemoryMessage validation exception cannot be pickled: {pickle_err}"
)
with pytest.raises(Exception) as excinfo:
MemoryMessage(role=123, content=456) # type: ignore
e = excinfo.value
try:
data = cloudpickle.dumps(e)
cloudpickle.loads(data)
except TypeError as pickle_err:
pytest.fail(
f"MemoryMessage validation exception cannot be pickled: {pickle_err}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +203 to +204
with pytest.raises(TypeError, match="missing.*required"):
exc_class()
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion hard-codes the current LiteLLM behavior that these exception classes require positional args. Since the dependency is unbounded (litellm>=1.80.11), an upstream fix that makes init no-arg-safe would cause this test to fail even though the serialization behavior is fine. Consider skipping/relaxing this test when the signature indicates defaultable args, or pinning an upper bound on LiteLLM if this behavior is relied upon.

Copilot uses AI. Check for mistakes.
exc = _make_litellm_exc(exc_class)
data = cloudpickle.dumps(exc)
restored = cloudpickle.loads(data)
assert isinstance(restored, Exception)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the patch reconstructs using type(self), the restored exception should be an instance of exc_class. Asserting only Exception is too weak and could miss regressions where the wrong class is reconstructed; consider asserting isinstance(restored, exc_class).

Suggested change
assert isinstance(restored, Exception)
assert isinstance(restored, exc_class)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@tylerhutcherson tylerhutcherson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be too complex here. Is there a way we could catch LiteLLM exceptions on the worker process and then use some approach to raise our own exception class (with an encoded message) and pass that back to the task managers instead of hacking the litellm exception serializers and using cloudpickle. Also for failures on LiteLLM/LLM extraction, where do those surface to clients of AMS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docket worker cannot serialize litellm exceptions after task failure

3 participants