[Bug] Agentic Training Failed with "rock: command not found"

I am trying to use ROLL for RL training. I saw the following error message after model service is installed. It is strange because the model service has been installed and started.

<img width="1618" height="1184" alt="Image" src="https://github.com/user-attachments/assets/16f35503-f28b-4943-911e-fb2c6591265a" />

This is my config:

```
    defaults:
      - ../config/traj_envs@_here_
      - ../config/deepspeed_zero@_here_
      - ../config/deepspeed_zero2@_here_
      - ../config/deepspeed_zero3@_here_
      - ../config/deepspeed_zero3_cpuoffload@_here_
    
    hydra:
      run:
        dir: .
      output_subdir: null
    
    exp_name: "agentic_rollout_swe"
    seed: 42
    
    logging_dir: ./output/logs
    output_dir: ./output
    model_name: ${exp_name}-${now:%Y%m%d_%H%M%S}
    rollout_dump_dir: ./output/rollout_dump
    system_envs:
      USE_MODELSCOPE: '1'
    
    num_gpus_per_node: 8 # change this
    rpc_timeout: 72000
    
    max_steps: 10
    save_steps: 10
    logging_steps: 1
    eval_steps: 2
    resume_from_checkpoint: false
    
    rollout_batch_size: 1
    val_batch_size: 1
    sequence_length: 65536
    
    max_tokens_per_step: 4096
    
    advantage_clip: 0.2
    ppo_epochs: 1
    adv_estimator: "step_reinforce"
    batch_adjust_mode: "random_sample"
    step_reward_gamma: 1.0
    
    #pg_clip: 0.1
    #dual_clip_loss: True
    init_kl_coef: 0.0
    whiten_advantages: true
    entropy_loss_coef: 0
    max_grad_norm: 1.0
    
    
    pretrain: /var/model/Qwen2.5-7B-Instruct # change this
    reward_pretrain: /var/model/Qwen2.5-7B-Instruct # change this
    
    actor_train:
      model_args:
        flash_attn: fa2
        disable_gradient_checkpointing: false
        dtype: bf16
        model_type: ~
    actor_infer:
      model_args:
        flash_attn: fa2
        disable_gradient_checkpointing: true
        dtype: bf16
      generating_args:
        max_new_tokens: ${max_tokens_per_step} # single-turn response length
        top_p: 1.0
        top_k: 50
        num_beams: 1
        temperature: 1.0
        num_return_sequences: 1
        stop_strings: ["</tool_call>","</tool_call>\n","\n</tool_call>\n","\n</function>"]
        include_stop_str_in_output: true
      data_args:
        template: qwen3_coder
      strategy_args:
        strategy_name: vllm
        strategy_config:
          gpu_memory_utilization: 0.8
          block_size: 16
          load_format: auto
          tensor_parallel_size: 1
      device_mapping: list(range(1,2))
    
    reward_normalization:
      grouping: traj_group_id # tags(env_type)/traj_group_id(group)/batch(rollout_batch)... group_by reward/adv
      method: mean
      # norm_mean_type: batch
      # norm_std_type: group
    
    train_env_manager:
      max_env_num_per_worker: 1
      num_env_groups: 1
      # under the same group, the env config and env seed are ensured to be equal
      group_size: 1
      tags: [swebench_native_verified]
      num_groups_partition: [1] # If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation
      system_envs:
        # if you cannot get python env in rock due to connetion error, try to use this, may expire in the future
        ROCK_RTENV_PYTHON_V31114_INSTALL_CMD: '[ -f cpython31115.tar.gz ] && rm cpython31115.tar.gz; [ -d python ] && rm -rf python; wget -q -O cpython31115.tar.gz https://mirror.nju.edu.cn/github-release/astral-sh/python-build-standalone/20260303/cpython-3.11.15+20260303-x86_64-unknown-linux-gnu-install_only.tar.gz && tar -xzf cpython31115.tar.gz && mv python runtime-env'
    val_env_manager:
      max_env_num_per_worker: 1
      num_env_groups: 1
      group_size: 1 # should be set to 1 because val temperature is set to 0 and same prompt leads to same output
      tags: [swebench_native_verified]
      num_groups_partition: [1] # TODO: If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation
      system_envs:
        # if you cannot get python env in rock due to connetion error, try to use this, may expire in the future
        ROCK_RTENV_PYTHON_V31114_INSTALL_CMD: '[ -f cpython31115.tar.gz ] && rm cpython31115.tar.gz; [ -d python ] && rm -rf python; wget -q -O cpython31115.tar.gz https://mirror.nju.edu.cn/github-release/astral-sh/python-build-standalone/20260303/cpython-3.11.15+20260303-x86_64-unknown-linux-gnu-install_only.tar.gz && tar -xzf cpython31115.tar.gz && mv python runtime-env'
    
    max_actions_per_traj: 60
    env_manager_cls: roll.pipeline.agentic.env_manager.agent_native_env_manager.AgentNativeStepEnvManager
    
    agent_config_common:
      agent_type: "default"
      
      # Startup command; placeholders (e.g., <<PROMPT>>) are parsed in the code
      run_cmd: 'iflow -p <<PROMPT>> --yolo'
      
      # Dependency pre-installation; modify based on your sandbox image
      pre_init_cmds:
        - command: "apt-get update"
          timeout_seconds: 600
        - command: "apt-get install -y curl git wget xz-utils"
          timeout_seconds: 600
        - command: "apt-get install -y build-essential libc6-dev patch procps npm"
          timeout_seconds: 600
        # Install helper tools like 'uv'
        - command: "wget -q https://xrl-sandbox-bucket.oss-cn-hangzhou.aliyuncs.com/uv-files/uv-x86_64-unknown-linux-gnu.tar.gz && tar -xzf uv-x86_64-unknown-linux-gnu.tar.gz --strip-components=1 -C /usr/local/bin && uv --version"
          timeout_seconds: 600 
    
      model_service_config: 
        type: "local"
        enabled: True
      
      # 运行时环境  
      runtime_env_config:
        type: node
        npm_registry: "https://registry.npmmirror.com"
        # Install specific iflow versions as needed
        custom_install_cmd: "wget --retry-connrefused --tries=10 --waitretry=2 -O ~/iflow-cli.tgz 'http://cloud.iflow.cn/iflow-cli/iflow-ai-iflow-cli-for-roll-0-4-4-v5.tgz' && npm i -g ~/iflow-cli.tgz"
      
      env:
        # Configure iflow parameters as needed
        IFLOW_apiKey: "test"
        IFLOW_baseUrl: "http://localhost:8080/v1"
        IFLOW_modelName: "ROME"
        IFLOW_searchApiKey: "88888888"
        IFLOW_selectedAuthType: "openai-compatible"
        IFLOW_disableAutoUpdate: "true"
        IFLOW_tokensLimit: "128000"
        IFLOW_shellTimeout: "360000"
        IFLOW_coreTools: "Edit,exit_plan_mode,glob,list_directory,multi_edit,plan,read plan,read_file,read_many_files,save_memory,Search,Shell,task,web_fetch,web_search,write_file,xml_escape"

    custom_envs:
      swebench_native_verified:
        env_type: "rock_tb_native_env"
        max_steps: ${max_actions_per_traj}
        max_tokens_per_step: ${max_tokens_per_step}
        env_manager_cls: ${env_manager_cls}
        agent_system_template: "agent_system_template placeholder"
        agent_template: "agent_template placeholder"
        env_config:
          dataset_name: /workspace/ROLL/data/swe_bench_verified_example.jsonl # change to your owen data path
          tools: ~
          max_steps: ${max_actions_per_traj}
          mode: "val"
          sandbox_base_url: http://rock:8080 # change to your own service address if needed
          user_id: "xxx"
          experiment_id: "test_tb_native"
          test_files: ["/var/model-dataset/terminal-bench-datasets/datasets/swebench-verified/"]
          agent_config: ${agent_config_common}
```

This is the script:
```
    #!/bin/bash
    set +x
    
    CONFIG_PATH=$(basename $(dirname $0))
    export PYTHONPATH="/workspace/ROLL:$PYTHONPATH"
    python /workspace/ROLL/examples/start_agentic_rollout_pipeline.py --config_path agentic_demo  --config_name agent_rollout_rock_swe_ack
```

Reproduce:

roll-registry-vpc.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:nvcr-25.06-py3-torch280-vllm0102

```
            git clone https://github.com/alibaba/ROLL.git; \
            cd ROLL; \
            cp /var/roll-config/* /workspace/ROLL/examples/; \
            cp /var/roll-config/requirements_ack.txt /workspace/ROLL/; \
            pip install -r requirements_ack.txt -i https://mirrors.aliyun.com/pypi/simple/; \
            pip install rl-rock "protobuf<4.0.0" -i https://mirrors.aliyun.com/pypi/simple/; \
            ulimit -n 65536 && ray start --head  --block  --dashboard-agent-listen-port=52365  --dashboard-host=0.0.0.0  --memory=274877906944  --metrics-export-port=8080  --num-cpus=64
            bash /workspace/ROLL/examples/run_agentic_rollout_pipeline_rock_swe_ack.sh
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Agentic Training Failed with "rock: command not found" #412

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Agentic Training Failed with "rock: command not found" #412

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions