[fix] Modify the exception message when storage strategy does not support a certain type of data#120
[fix] Modify the exception message when storage strategy does not support a certain type of data#120dpj135 wants to merge 2 commits into
Conversation
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: dpj135 <958208521@qq.com>
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
This PR aims to improve the diagnosability of Yuanrong DataSystem backend routing failures by updating exception messaging, documenting the common “Cannot retrieve stored data” scenario in the FAQ, and configuring datasystem worker environment variables to enforce direct memcpy policies.
Changes:
- Updated Yuanrong storage strategy routing to emit a clearer error when no backend matches (especially for
backend_meta). - Added an FAQ entry describing causes/solutions for the “Cannot retrieve stored data” error.
- Set
DS_D2H_MEMCPY_POLICY/DS_H2D_MEMCPY_POLICYtodirectfor datasystem workers when device IDs are specified.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| transfer_queue/storage/clients/yuanrong_client.py | Improves routing error messaging and adds an item_label to distinguish user values vs backend metadata. |
| transfer_queue/storage/bootstrap/yuanrong_bootstrap.py | Sets datasystem worker env vars for memcpy policy when remote device IDs are configured. |
| docs/storage_backends/openyuanrong_datasystem.md | Adds FAQ guidance for the new “Cannot retrieve stored data” error. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| raise ValueError( | ||
| "Cannot retrieve stored data because the backend that originally " | ||
| "stored it is unavailable in the current process or node. Please " | ||
| "check that the configuration and NPU resource availability are " | ||
| "consistent across all processes and nodes." | ||
| ) |
| strategy_tags, | ||
| lambda strategy_, item_: strategy_.supports_clear(item_), | ||
| ignore_unmatched=True, | ||
| item_label="backend_meta", | ||
| ) |
There was a problem hiding this comment.
distinguishes between two cases when ignore_unmatched is True : keys that have never been put into and keys put into ds from different storageClient.
| # Ensure direct copy for specified devices | ||
| env["DS_D2H_MEMCPY_POLICY"] = "direct" | ||
| env["DS_H2D_MEMCPY_POLICY"] = "direct" |
| This occurs when `kv_batch_get` or `kv_batch_clear` cannot find the storage backend that originally handled the data. The most common cause is a mismatch between the process that originally `put` the data and the process performing `get`/`clear`, such as: | ||
|
|
||
| - Different `enable_yr_npu_transport` settings across processes or nodes (e.g., `true` vs `false`). | ||
| - NPU hardware or CANN/torch-npu unavailable on the `get`/`clear` process or node, even though the configuration is identical. | ||
| - When running inside Ray actors, the actor may not be assigned NPU resources (e.g., missing `"NPU": 1` in `.options(resources=...)`), preventing the NPU transport backend from initializing. |
| if item_label == "backend_meta": | ||
| raise ValueError( | ||
| "Cannot retrieve stored data because the backend that originally " | ||
| "stored it is unavailable in the current process or node. Please " | ||
| "check that the configuration and NPU resource availability are " | ||
| "consistent across all processes and nodes." | ||
| ) | ||
| else: | ||
| raise ValueError(f"No strategy can handle {item_label} of type {type(item).__name__}.") |
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: dpj135 <958208521@qq.com>
CLA Signature Passdpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Main changes
DS_D2H_MEMCPY_POLICYandDS_H2D_MEMCPY_POLICYto datasystem workers