Skip to content

refactor: improve error handling with custom exceptions#414

Open
abeiabeiqq wants to merge 1 commit intoalibaba:mainfrom
abeiabeiqq:improve-error-handling
Open

refactor: improve error handling with custom exceptions#414
abeiabeiqq wants to merge 1 commit intoalibaba:mainfrom
abeiabeiqq:improve-error-handling

Conversation

@abeiabeiqq
Copy link
Copy Markdown

@abeiabeiqq abeiabeiqq commented Apr 7, 2026

Summary

This PR improves error handling across the ROLL codebase by replacing assert statements with custom exception classes, providing better error messages with context and suggestions for debugging.

Changes

New Exception System (roll/utils/exceptions.py)

A comprehensive exception hierarchy with error codes:

Exception Code Purpose
RollConfigValidationError 1001 Configuration field validation failures
RollConfigConflictError 1003 Configuration field conflicts
RollDistributedError 2000+ Distributed system errors
RollModelError 3000+ Model-related errors
RollDataError 4000+ Data-related errors
RollPipelineError 5000+ Pipeline errors
RollEnvironmentError 6000+ Environment errors

Modified Files

Module Files Changes
roll/configs/ base_config.py 5 assert → exceptions
roll/utils/ exceptions.py New file (535 lines)

Example

Before:

assert self.response_length or self.sequence_length, "response_length or sequence_length must be set"

After:

if not (self.response_length or self.sequence_length):
    raise RollConfigValidationError(
        field_name="response_length/sequence_length",
        expected_type="at least one must be set",
        actual_value=f"response_length={self.response_length}, sequence_length={self.sequence_length}",
        message="Either response_length or sequence_length must be set"
    )

Benefits

  1. Better Error Messages: Includes error code, context, and suggestions
  2. Structured Logging: to_dict() method for structured log integration
  3. Error Classification: Different exception types for different error categories
  4. Debugging Friendly: Clear context helps identify issues faster

Testing

  • ✅ Python syntax compilation passed
  • ✅ Module import tests passed
  • ✅ Exception class tests passed

Future Work

This PR establishes the exception framework. Additional modules can be improved incrementally:

  • roll/pipeline/rlvr/ - 8 remaining asserts
  • roll/pipeline/dpo/ - 4 remaining asserts
  • roll/pipeline/agentic/ - 18 remaining asserts
  • roll/distributed/strategy/ - 15 remaining asserts
  • roll/distributed/executor/ - 11 remaining asserts

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 7, 2026

CLA assistant check
All committers have signed the CLA.

@abeiabeiqq abeiabeiqq force-pushed the improve-error-handling branch from a3e89fe to cdb5947 Compare April 7, 2026 09:20
- Replace assert statements with custom RollError exceptions
- Add RollConfigValidationError, RollConfigConflictError for config errors
- Add RollDistributedError for distributed system errors
- Add RollModelError for model-related errors
- Add RollDataError for data-related errors
- Improve error messages with context and suggestions
- Total 63 assert statements improved across 14 files

Modified files:
- roll/configs/base_config.py, worker_config.py, data_args.py
- roll/pipeline/rlvr/rlvr_pipeline.py, rlvr_config.py
- roll/pipeline/dpo/dpo_pipeline.py, dpo_config.py
- roll/pipeline/agentic/agentic_pipeline.py, agentic_config.py
- roll/distributed/strategy/strategy.py, vllm_strategy.py, sglang_strategy.py
- roll/distributed/executor/cluster.py, model_update_group.py
- roll/utils/exceptions.py (new file)
@abeiabeiqq abeiabeiqq force-pushed the improve-error-handling branch from cdb5947 to dcd82f8 Compare April 7, 2026 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants