Skip to content

LLaMA-7B SFT died with <Signals.SIGABRT: 6> #539

@PussyCat0700

Description

@PussyCat0700

配置:单卡A100
在Finetune时遇到SIGABRT: 6错误

  • 报错信息
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 240, in <module>
    main()
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 228, in main
    sigkill_handler(signal.SIGTERM, None)
  File "/home/yfliu/anaconda3/envs/oneflow/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 196, in sigkill_handler
    raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/home/yfliu/anaconda3/envs/oneflow/bin/python3', '-u', 'projects/Llama/train_net.py', '--config-file', 'projects/Llama/configs/llama_sft.py']' died with <Signals.SIGABRT: 6>.
  • 我的脚本
set -e
if [ -z "$1" ]; then
    echo "Usage: $0 <number>"
    exit 1
fi
libai_path=../libai
cd $libai_path
# scripts split in case blocks.
case $1 in
1)
# See https://github.com/Oneflow-Inc/libai/tree/main/projects/Llama for reference
# Notice:
# 1. Please make sure you have setup destination_path and checkpoint_dir
# For example, our checkpoint_dir is /data1/yfliu/models/LLaMA2/LLaMA2_hf_7B downloaded from https://llama.meta.com/llama-downloads/
# our destination dir is /data1/yfliu/alpaca
# 2. You should also modify terms in projects/Llama/configs/llama_config.py
python projects/Llama/utils/prepare_alpaca.py
;;
2)
# full finetune
# Please set the finetuning parameters in projects/Llama/configs/llama_sft.py, such as dataset_path and pretrained_model_path
# Type python3 -m oneflow.distributed.launch -h for more usage
FILE=projects/Llama/train_net.py
CONFIG=projects/Llama/configs/llama_sft.py
GPUS=1
NODE=1
NODE_RANK=0
ADDR=127.0.0.1
PORT=12345
LOGDIR=/home/yfliu/horizontal/oneflowtest/runs/llama2/oneflow

export ONEFLOW_FUSE_OPTIMIZER_UPDATE_CAST=true

python3 -m oneflow.distributed.launch \
--nproc_per_node $GPUS --nnodes $NODE --node_rank $NODE_RANK --master_addr $ADDR --master_port $PORT --logdir $LOGDIR --redirect_stdout_and_stderr \
$FILE --config-file $CONFIG
;;
esac
  • 执行脚本方式

bash llama_sft.sh 2

在执行SFT训练时报错,似乎无法定位到是哪里出了问题。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions