-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Description
When I tried to evaluate T5 t5_4gpus.sh in osdi24ae branch, this error occured :
Traceback (most recent call last):
File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 221, in <module>
train()
File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 174, in train
def train_iter(model, dataloader):
File "/local/home/zzhendong2/projects/nnscaler/cube/compiler.py", line 204, in decorator
graph = PAS(graph, resource)
File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/policy.py", line 303, in policy
sched = OrderSolver().solve(graph, nmicros, config.order_plan)
File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 51, in solve
sched = self.sched_tessel(graph, nmicros, sched_file)
File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 114, in sched_tessel
tsched = TSched.load(load_sched_file)
File "/local/home/zzhendong2/projects/nnscaler/Tessel/tessel/schedule/schedplan.py", line 574, in load
with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'mllm.4stages.sched.json'
It seems this file "mllm.4stages.sched.json" is missing. If I comment out the line "--order-plan mllm.4stages.sched.json" in t5_4gpus.sh, it will run into another error:
Traceback (most recent call last):
File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 221, in <module>
train()
File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 174, in train
def train_iter(model, dataloader):
File "/local/home/zzhendong2/projects/nnscaler/cube/compiler.py", line 204, in decorator
graph = PAS(graph, resource)
File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/policy.py", line 303, in policy
sched = OrderSolver().solve(graph, nmicros, config.order_plan)
File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 49, in solve
sched = self.sched_1f1b(graph, nmicros)
File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 80, in sched_1f1b
sched.add_segment(stage, mb_idx, step)
File "/local/home/zzhendong2/projects/nnscaler/cube/graph/schedule/schedplan.py", line 177, in add_segment
self.add_block(block, step)
File "/local/home/zzhendong2/projects/nnscaler/cube/graph/schedule/schedplan.py", line 151, in add_block
raise RuntimeError(
RuntimeError: inserting confict at device 1 of time step 2: cannot execute multiple blocks at a same time step
How to solve this issue?
Thank you!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels