Make exporter survive connection interruption by NickCao · Pull Request #357 · jumpstarter-dev/jumpstarter-python

NickCao · 2025-03-18T18:50:15Z

Summary by CodeRabbit

Refactor
- Improved system robustness by adopting an asynchronous processing model that effectively manages incoming requests and periodic updates.
- Enhanced error management now gracefully handles connection interruptions with automatic retries, reducing downtime and user disruption.
- These improvements contribute to a more stable and responsive experience, ensuring smoother operations even under varying load conditions.

coderabbitai · 2025-03-18T18:50:23Z

Walkthrough

The changes in the pull request focus on the Exporter class in the exporter module. They enhance the class's robustness by adding error handling and asynchronous processing. Specifically, the __handle method now wraps connection logic in a try-except block to log errors. In addition, both handle and serve methods now use asynchronous memory streams with inner asynchronous functions (named listen and status respectively) to continuously listen for incoming requests and updates from the controller while incorporating error logging and retry mechanisms.

Changes

File(s)	Change Summary
packages/jumpstarter/.../exporter.py	Enhanced error handling in the `__handle` method and replaced synchronous communication in the `handle` and `serve` methods with asynchronous memory streams and retry mechanisms using inner asynchronous functions.

Sequence Diagram(s)

sequenceDiagram
    participant Controller
    participant Exporter
    participant ListenStream

    Controller->>Exporter: Sends request
    Exporter->>ListenStream: Push request into listen_tx
    ListenStream-->>Exporter: Emit request via listen_rx
    alt Handling error
        Exporter->>Exporter: Log error and retry with backoff
    end

sequenceDiagram
    participant Controller
    participant Exporter
    participant StatusStream

    Controller->>Exporter: Sends status update
    Exporter->>StatusStream: Push update into status_tx
    StatusStream-->>Exporter: Emit update via status_rx
    alt Handling error
        Exporter->>Exporter: Log error and retry with backoff
    end

Poem

Hop, hop, through code I roam,
Errors caught in every dome,
Async streams flow like a gentle breeze,
Listening, serving with the greatest of ease,
A bunny’s nod to changes made, with joyful bytes and furry parades!
🐇💻

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

netlify · 2025-03-18T18:51:18Z

✅ Deploy Preview for jumpstarter-docs ready!

Name	Link
🔨 Latest commit	`3d8bd6a`
🔍 Latest deploy log	https://app.netlify.com/sites/jumpstarter-docs/deploys/67e2fa85a8554b00082839f9
😎 Deploy Preview	https://deploy-preview-357--jumpstarter-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

packages/jumpstarter/jumpstarter/exporter/exporter.py (3)
40-45: Connection handling wrapped in try-except.

A nice addition for basic error handling. However, consider whether logging at an error level (logger.error) would be more appropriate for visibility or if you want to re-raise in certain failure scenarios.

71-85: Introducing a long-lived listen loop with retries.

Good approach to continuously watch for incoming requests. However, this may loop infinitely if errors persist. Consider adding a maximum retry count or an exponential backoff to avoid a potential hard loop.

Example (adding a brief exponential backoff snippet):
 except Exception as e:
     logger.info("Listen stream interrupted, restarting: {}".format(e))
+    # Consider an exponential backoff or maximum retry limit here
     await sleep(3)
98-107: Similar retry logic for the status routine.

As with the listen loop, use caution regarding infinite retry. You may want to introduce adjustable sleep intervals (e.g., exponential backoff) or a maximum retry limit.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d149734 and ebadc18.

📒 Files selected for processing (2)

packages/jumpstarter-cli-admin/jumpstarter_cli_admin/get_test.py (2 hunks)
packages/jumpstarter/jumpstarter/exporter/exporter.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (11)

GitHub Check: Redirect rules - jumpstarter-docs
GitHub Check: Header rules - jumpstarter-docs
GitHub Check: Pages changed - jumpstarter-docs
GitHub Check: e2e
GitHub Check: pytest-matrix (3.11)
GitHub Check: pytest-matrix (3.12)
GitHub Check: pytest-matrix (3.13)
GitHub Check: build-and-push-image (jumpstarter-dev/jumpstarter-dev .devfile/Containerfile)
GitHub Check: build-and-push-image (jumpstarter-dev/jumpstarter-utils Dockerfile.utils)
GitHub Check: build-and-push-image (jumpstarter-dev/jumpstarter Dockerfile)
GitHub Check: build-and-push-image (jumpstarter-dev/jumpstarter-devspace .devfile/Containerfile.client)

🔇 Additional comments (6)

packages/jumpstarter-cli-admin/jumpstarter_cli_admin/get_test.py (2)

767-767: Added blank line before decorator for improved readability.

No functional changes introduced. This is purely a readability enhancement.

1145-1145: Added blank line before decorator for improved readability.

No functional changes introduced. This is purely a readability enhancement.

packages/jumpstarter/jumpstarter/exporter/exporter.py (4)

7-7: Importing sleep from anyio.

This import is used for the new retry logic. Good addition for asynchronously handling retries.

87-87: Asynchronous iteration over listen_rx.

This leverages the new listen_tx/listen_rx streams seamlessly. No issues spotted.

96-97: Creating memory object streams for status updates.

Good pattern to handle asynchronous status messages.

109-110: Starting the status coroutine within the task group.

This is consistent with the overall async pattern in the exporter. Nicely done.

mangelajo · 2025-03-20T11:39:23Z

+        listen_tx, listen_rx = create_memory_object_stream()
+
+        async def listen():
+            while True:


I am worried about our previous experiences with re-connections in a long running service....

can we add a limit to retries? and may be a flag to enable this?

Implemented the retry limit, yet to add a flag.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

packages/jumpstarter/jumpstarter/exporter/exporter.py (2)

99-100: Consider adding error handling for the consumer loop.

While you've added robust error handling for the producer side (the listen function), the consumer loop doesn't have equivalent error handling. This might be intentional if errors here should cause termination, but for consistency consider adding error handling.

-       async for request in listen_rx:
+       try:
+           async for request in listen_rx:
+               logger.info("Handling new connection request on lease %s", lease_name)
+               tg.start_soon(
+                   self.__handle, path, request.router_endpoint, request.router_token, self.tls, self.grpc_options
+               )
+       except Exception as e:
+           logger.error("Error consuming connection requests: %s", e)
+           # Handle consumer errors appropriately

110-132: Refactor duplicate retry logic into a helper function.

The retry logic in status is identical to the logic in listen. Consider extracting this pattern into a helper function to avoid code duplication and ensure consistent behavior.

# Add at the beginning of the class or as a standalone function:
+ async def _stream_with_retry(self, stream_func, stream_tx, retries=5, backoff=3):
+     """Helper function for streaming with retry logic."""
+     retries_left = retries
+     while True:
+         try:
+             async for item in stream_func():
+                 await stream_tx.send(item)
+         except Exception as e:
+             if retries_left > 0:
+                 retries_left -= 1
+                 logger.info(
+                     "Stream interrupted, restarting in %ss, %s retries left: %s",
+                     backoff, retries_left, e
+                 )
+                 await sleep(backoff)
+             else:
+                 logger.error("Maximum retries exceeded, waiting for manual intervention: %s", e)
+                 await sleep(60)
+                 retries_left = retries
+         else:
+             retries_left = retries

# Then in handle method:
+ async def listen():
+     controller = jumpstarter_pb2_grpc.ControllerServiceStub(self.channel_factory())
+     await self._stream_with_retry(
+         lambda: controller.Listen(jumpstarter_pb2.ListenRequest(lease_name=lease_name)),
+         listen_tx,
+         self.max_retries if self.retry_enabled else 0,
+         self.retry_backoff
+     )

# And in serve method:
+ async def status():
+     controller = jumpstarter_pb2_grpc.ControllerServiceStub(self.channel_factory())
+     await self._stream_with_retry(
+         lambda: controller.Status(jumpstarter_pb2.StatusRequest()),
+         status_tx,
+         self.max_retries if self.retry_enabled else 0,
+         self.retry_backoff
+     )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebadc18 and 1d0fd22.

📒 Files selected for processing (1)

packages/jumpstarter/jumpstarter/exporter/exporter.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (11)

GitHub Check: Redirect rules - jumpstarter-docs
GitHub Check: Header rules - jumpstarter-docs
GitHub Check: Pages changed - jumpstarter-docs
GitHub Check: e2e
GitHub Check: pytest-matrix (3.13)
GitHub Check: pytest-matrix (3.12)
GitHub Check: pytest-matrix (3.11)
GitHub Check: build-and-push-image (jumpstarter-dev/jumpstarter-dev .devfile/Containerfile)
GitHub Check: build-and-push-image (jumpstarter-dev/jumpstarter Dockerfile)
GitHub Check: build-and-push-image (jumpstarter-dev/jumpstarter-devspace .devfile/Containerfile.client)
GitHub Check: build-and-push-image (jumpstarter-dev/jumpstarter-utils Dockerfile.utils)

🔇 Additional comments (5)

packages/jumpstarter/jumpstarter/exporter/exporter.py (5)

41-46: Good addition of error handling.

The try-except block for handling connection failures is a good start for making the exporter more robust. This implementation will properly log errors when connections fail.

73-95: Consider making retry parameters configurable.

While adding retry logic with backoff is an excellent improvement, hardcoding the retry count and backoff duration limits flexibility. This addresses part of the previous comment about adding retry limits, but doesn't provide a way to configure or disable this behavior.

Consider making these parameters configurable through class initialization:
@dataclass(kw_only=True)
class Exporter(AbstractAsyncContextManager, Metadata):
    channel_factory: Callable[[], grpc.aio.Channel]
    device_factory: Callable[[], Driver]
    lease_name: str = field(init=False, default="")
    tls: TLSConfigV1Alpha1 = field(default_factory=TLSConfigV1Alpha1)
    grpc_options: dict[str, str] = field(default_factory=dict)
+   max_retries: int = field(default=5)
+   retry_backoff: int = field(default=3)
+   retry_enabled: bool = field(default=True)
Then update the listen function signature:
-   async def listen(retries=5, backoff=3):
+   async def listen():
+       retries = self.max_retries if self.retry_enabled else 0
+       backoff = self.retry_backoff
96-96: LGTM: Task startup looks good.

The use of task groups to manage the asynchronous listen operation is properly implemented.

135-135: Good job using memory streams for decoupling producers and consumers.

Using memory object streams to separate the producer (status) from the consumer loop improves code modularity and error isolation, which is a good design practice.

133-135:
Details

❓ Verification inconclusive

Consider synchronizing cancellation across both status and listen streams.

The task group is properly set up, but there's a potential race condition if a status change occurs while the status function is in a retry backoff. Make sure both streams are properly cancelled when the service state changes.

Verify how task cancellation propagates to inner functions during backoff periods. Does the cancellation immediately terminate the sleeping tasks?

🏁 Script executed:
#!/bin/bash
# Search for anyio sleep cancellation behavior
rg -A 3 -B 3 "cancel.*sleep|sleep.*cancel" --glob "*.py"

# Check for explicit cancellation handling with sleep
rg -A 3 -B 3 "try.*sleep.*except.*CancelledError|CancelledError.*sleep" --glob "*.py"
Length of output: 146

I'll perform an additional investigation into the cancellation behavior during backoff periods in the exporter. To help clarify whether sleeping tasks (such as those in a retry backoff) are immediately cancelled, please run the following scripts:

If these searches still produce no output, it might indicate that cancellation or backoff logic is managed indirectly (for example, via anyio’s built-in mechanisms). In that case—and because the absence of explicit sleep cancellation handling makes it unclear if backoff sleeping tasks are immediately terminated—I recommend that you manually verify the cancellation propagation (e.g., by checking anyio’s documentation or your test suite behavior).

Action Required: Verify Cancellation Synchronization for Backoff Tasks

The current task group setup in exporter.py (lines 133–135) is correct, but a potential race condition remains if a status change occurs while the status function is in a retry backoff. Specifically, it’s unclear if sleeping tasks within backoff periods are immediately cancelled. Please ensure that cancellation propagates correctly across both the status and listen streams when the service state changes.

🏁 Script executed:
#!/bin/bash
echo "Searching exporter.py for sleep usage to uncover backoff/sleep patterns..."
rg -A 5 -B 5 "sleep(" packages/jumpstarter/jumpstarter/exporter/exporter.py

echo "Searching exporter.py for CancelledError handling..."
rg -A 5 -B 5 "CancelledError" packages/jumpstarter/jumpstarter/exporter/exporter.py

echo "Searching globally for backoff usage in Python files..."
rg -A 3 -B 3 "backoff" --glob "*.py"
Length of output: 3866
Ensure proper cancellation propagation during retry backoff

The exporter’s status (and similarly listen) function uses await sleep(backoff) within a try/except block. Since AnyIO’s sleep is cancellation-friendly, a running backoff should be interrupted by a CancelledError when the task group is cancelled.

However, there’s no explicit synchronization between the streams outside the task group context. Please verify (e.g., via tests or documentation review) that cancellation indeed terminates the sleeping tasks immediately when the service state changes.

If cancellation is not prompt during backoff, consider adding explicit cancellation checks before or after the sleep to avoid any potential race condition.

coderabbitai Bot reviewed Mar 18, 2025

View reviewed changes

NickCao mentioned this pull request Mar 19, 2025

Init ExporterService jumpstarter-dev/jumpstarter-protocol#20

Closed

mangelajo reviewed Mar 20, 2025

View reviewed changes

NickCao added 2 commits March 20, 2025 09:11

Make exporter survive connection interruption

f15e76e

Limit the number of retries

1d0fd22

NickCao force-pushed the robust-exporter branch from ebadc18 to 1d0fd22 Compare March 20, 2025 13:25

coderabbitai Bot reviewed Mar 20, 2025

View reviewed changes

Comment thread packages/jumpstarter/jumpstarter/exporter/exporter.py

mangelajo approved these changes Mar 25, 2025

View reviewed changes

Merge branch 'main' into robust-exporter

3d8bd6a

NickCao enabled auto-merge March 25, 2025 18:48

NickCao merged commit e0a049d into main Mar 25, 2025
18 checks passed

NickCao deleted the robust-exporter branch March 31, 2025 15:44

mangelajo added this to the 0.6.0 milestone May 8, 2025

github-actions Bot mentioned this pull request Jun 3, 2025

Pull requests report (3/06/2025 12:59) #516

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make exporter survive connection interruption#357

Make exporter survive connection interruption#357
NickCao merged 3 commits intomainfrom
robust-exporter

NickCao commented Mar 18, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 18, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

netlify Bot commented Mar 18, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

mangelajo Mar 20, 2025

Uh oh!

NickCao Mar 20, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NickCao commented Mar 18, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

netlify Bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for jumpstarter-docs ready!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mangelajo Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

NickCao Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NickCao commented Mar 18, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 18, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

netlify Bot commented Mar 18, 2025 •

edited

Loading