Skip to content

[fix] Modify the exception message when storage strategy does not support a certain type of data#120

Open
dpj135 wants to merge 2 commits into
Ascend:mainfrom
dpj135:fix_yr_for_0.1.8
Open

[fix] Modify the exception message when storage strategy does not support a certain type of data#120
dpj135 wants to merge 2 commits into
Ascend:mainfrom
dpj135:fix_yr_for_0.1.8

Conversation

@dpj135

@dpj135 dpj135 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Main changes

  • Modify the exception message to make it more understandable.
  • Add a solution for the "Cannot retrieve stored data" error to FAQ.
  • Add two environmental variables DS_D2H_MEMCPY_POLICY and DS_H2D_MEMCPY_POLICY to datasystem workers

@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: dpj135 <958208521@qq.com>
@dpj135 dpj135 force-pushed the fix_yr_for_0.1.8 branch from c5ec9c7 to 30b6ba5 Compare June 8, 2026 06:34
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve the diagnosability of Yuanrong DataSystem backend routing failures by updating exception messaging, documenting the common “Cannot retrieve stored data” scenario in the FAQ, and configuring datasystem worker environment variables to enforce direct memcpy policies.

Changes:

  • Updated Yuanrong storage strategy routing to emit a clearer error when no backend matches (especially for backend_meta).
  • Added an FAQ entry describing causes/solutions for the “Cannot retrieve stored data” error.
  • Set DS_D2H_MEMCPY_POLICY / DS_H2D_MEMCPY_POLICY to direct for datasystem workers when device IDs are specified.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
transfer_queue/storage/clients/yuanrong_client.py Improves routing error messaging and adds an item_label to distinguish user values vs backend metadata.
transfer_queue/storage/bootstrap/yuanrong_bootstrap.py Sets datasystem worker env vars for memcpy policy when remote device IDs are configured.
docs/storage_backends/openyuanrong_datasystem.md Adds FAQ guidance for the new “Cannot retrieve stored data” error.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +483 to +488
raise ValueError(
"Cannot retrieve stored data because the backend that originally "
"stored it is unavailable in the current process or node. Please "
"check that the configuration and NPU resource availability are "
"consistent across all processes and nodes."
)
Comment on lines +431 to 435
strategy_tags,
lambda strategy_, item_: strategy_.supports_clear(item_),
ignore_unmatched=True,
item_label="backend_meta",
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distinguishes between two cases when ignore_unmatched is True : keys that have never been put into and keys put into ds from different storageClient.

Comment on lines +123 to +125
# Ensure direct copy for specified devices
env["DS_D2H_MEMCPY_POLICY"] = "direct"
env["DS_H2D_MEMCPY_POLICY"] = "direct"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +473 to +477
This occurs when `kv_batch_get` or `kv_batch_clear` cannot find the storage backend that originally handled the data. The most common cause is a mismatch between the process that originally `put` the data and the process performing `get`/`clear`, such as:

- Different `enable_yr_npu_transport` settings across processes or nodes (e.g., `true` vs `false`).
- NPU hardware or CANN/torch-npu unavailable on the `get`/`clear` process or node, even though the configuration is identical.
- When running inside Ray actors, the actor may not be assigned NPU resources (e.g., missing `"NPU": 1` in `.options(resources=...)`), preventing the NPU transport backend from initializing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines +482 to +490
if item_label == "backend_meta":
raise ValueError(
"Cannot retrieve stored data because the backend that originally "
"stored it is unavailable in the current process or node. Please "
"check that the configuration and NPU resource availability are "
"consistent across all processes and nodes."
)
else:
raise ValueError(f"No strategy can handle {item_label} of type {type(item).__name__}.")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@dpj135 dpj135 force-pushed the fix_yr_for_0.1.8 branch from c9f4de1 to b76fd84 Compare June 9, 2026 03:25
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: dpj135 <958208521@qq.com>
@dpj135 dpj135 force-pushed the fix_yr_for_0.1.8 branch from b76fd84 to 333bb13 Compare June 9, 2026 11:43
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants