Skip to content

Issue of evaluating T5 model #17

@zzhendong

Description

@zzhendong

When I tried to evaluate T5 t5_4gpus.sh in osdi24ae branch, this error occured :

Traceback (most recent call last):
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 221, in <module>
    train()
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 174, in train
    def train_iter(model, dataloader):
  File "/local/home/zzhendong2/projects/nnscaler/cube/compiler.py", line 204, in decorator
    graph = PAS(graph, resource)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/policy.py", line 303, in policy
    sched = OrderSolver().solve(graph, nmicros, config.order_plan)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 51, in solve
    sched = self.sched_tessel(graph, nmicros, sched_file)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 114, in sched_tessel
    tsched = TSched.load(load_sched_file)
  File "/local/home/zzhendong2/projects/nnscaler/Tessel/tessel/schedule/schedplan.py", line 574, in load
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'mllm.4stages.sched.json'

It seems this file "mllm.4stages.sched.json" is missing. If I comment out the line "--order-plan mllm.4stages.sched.json" in t5_4gpus.sh, it will run into another error:

Traceback (most recent call last):
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 221, in <module>
    train()
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 174, in train
    def train_iter(model, dataloader):
  File "/local/home/zzhendong2/projects/nnscaler/cube/compiler.py", line 204, in decorator
    graph = PAS(graph, resource)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/policy.py", line 303, in policy
    sched = OrderSolver().solve(graph, nmicros, config.order_plan)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 49, in solve
    sched = self.sched_1f1b(graph, nmicros)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 80, in sched_1f1b
    sched.add_segment(stage, mb_idx, step)
  File "/local/home/zzhendong2/projects/nnscaler/cube/graph/schedule/schedplan.py", line 177, in add_segment
    self.add_block(block, step)
  File "/local/home/zzhendong2/projects/nnscaler/cube/graph/schedule/schedplan.py", line 151, in add_block
    raise RuntimeError(
RuntimeError: inserting confict at device 1 of time step 2: cannot execute multiple blocks at a same time step

How to solve this issue?
Thank you!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions