Skip to content

Error with ILP Solver When Compiling Llama3-Mini with nnscaler #13

@CraneQinghe

Description

@CraneQinghe

I encounter an error when using the ILP solver while compiling Llama3-Mini with nnscaler. Can you help me understand why this is happening?

2024-11-03 08:56:37 | INFO | nnscaler.autodist.spmd_solver | finish spmd solver initializetion
Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /opt/conda/envs/andy/nnscaler/lib/python3.10/site-packages/pulp/solverdir/cbc/linux/64/cbc /tmp/fd2f9f31b8df4c399e4cf1c02094c54b-pulp.mps -sec 600 -threads 104 -timeMode elapsed -branch -printingOptions all -solution /tmp/fd2f9f31b8df4c399e4cf1c02094c54b-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 2660 COLUMNS
Duplicate row C0001129 at line 2757 <     X0000017  C0001129   1.000000000000e+00 >
Duplicate row C0001130 at line 2758 <     X0000017  C0001130   1.000000000000e+00 >
Duplicate row C0001132 at line 2759 <     X0000017  C0001132   1.000000000000e+00 >
Duplicate row C0001134 at line 2760 <     X0000017  C0001134   1.000000000000e+00 >
Duplicate row C0001135 at line 2761 <     X0000017  C0001135   1.000000000000e+00 >
Duplicate row C0001137 at line 2762 <     X0000017  C0001137   1.000000000000e+00 >
Duplicate row C0001129 at line 2773 <     X0000019  C0001129   1.000000000000e+00 >
Duplicate row C0001130 at line 2774 <     X0000019  C0001130   1.000000000000e+00 >
Duplicate row C0001133 at line 2775 <     X0000019  C0001133   1.000000000000e+00 >
Duplicate row C0001134 at line 2776 <     X0000019  C0001134   1.000000000000e+00 >
Duplicate row C0001135 at line 2777 <     X0000019  C0001135   1.000000000000e+00 >
Duplicate row C0001138 at line 2778 <     X0000019  C0001138   1.000000000000e+00 >
Duplicate row C0001129 at line 2790 <     X0000021  C0001129   1.000000000000e+00 >
Duplicate row C0001131 at line 2791 <     X0000021  C0001131   1.000000000000e+00 >
Duplicate row C0001132 at line 2792 <     X0000021  C0001132   1.000000000000e+00 >
Duplicate row C0001134 at line 2793 <     X0000021  C0001134   1.000000000000e+00 >
Duplicate row C0001136 at line 2794 <     X0000021  C0001136   1.000000000000e+00 >
Duplicate row C0001137 at line 2795 <     X0000021  C0001137   1.000000000000e+00 >
Duplicate objective at line 2796 <     X0000021  OBJ        8.920574188232e-03 >
Duplicate row C0001129 at line 2807 <     X0000023  C0001129   1.000000000000e+00 >
Duplicate row C0001131 at line 2808 <     X0000023  C0001131   1.000000000000e+00 >
Duplicate row C0001133 at line 2809 <     X0000023  C0001133   1.000000000000e+00 >
Duplicate row C0001134 at line 2810 <     X0000023  C0001134   1.000000000000e+00 >
Duplicate row C0001136 at line 2811 <     X0000023  C0001136   1.000000000000e+00 >
Duplicate row C0001138 at line 2812 <     X0000023  C0001138   1.000000000000e+00 >
Duplicate row C0001666 at line 7085 <     X0000779  C0001666   1.000000000000e+00 >
Duplicate row C0001667 at line 7086 <     X0000779  C0001667   1.000000000000e+00 >
Duplicate row C0001669 at line 7087 <     X0000779  C0001669   1.000000000000e+00 >
Duplicate row C0001671 at line 7088 <     X0000779  C0001671   1.000000000000e+00 >
Duplicate row C0001672 at line 7089 <     X0000779  C0001672   1.000000000000e+00 >
Duplicate row C0001674 at line 7090 <     X0000779  C0001674   1.000000000000e+00 >
Duplicate row C0001666 at line 7101 <     X0000781  C0001666   1.000000000000e+00 >
Duplicate row C0001667 at line 7102 <     X0000781  C0001667   1.000000000000e+00 >
Duplicate row C0001670 at line 7103 <     X0000781  C0001670   1.000000000000e+00 >
Duplicate row C0001671 at line 7104 <     X0000781  C0001671   1.000000000000e+00 >
Duplicate row C0001672 at line 7105 <     X0000781  C0001672   1.000000000000e+00 >
Duplicate row C0001675 at line 7106 <     X0000781  C0001675   1.000000000000e+00 >
Duplicate row C0001666 at line 7118 <     X0000783  C0001666   1.000000000000e+00 >
Duplicate row C0001668 at line 7119 <     X0000783  C0001668   1.000000000000e+00 >
Duplicate row C0001669 at line 7120 <     X0000783  C0001669   1.000000000000e+00 >
Duplicate row C0001671 at line 7121 <     X0000783  C0001671   1.000000000000e+00 >
Duplicate row C0001673 at line 7122 <     X0000783  C0001673   1.000000000000e+00 >
Duplicate row C0001674 at line 7123 <     X0000783  C0001674   1.000000000000e+00 >
Duplicate objective at line 7124 <     X0000783  OBJ        8.920574188232e-03 >
Duplicate row C0001666 at line 7135 <     X0000785  C0001666   1.000000000000e+00 >
Duplicate row C0001668 at line 7136 <     X0000785  C0001668   1.000000000000e+00 >
Duplicate row C0001670 at line 7137 <     X0000785  C0001670   1.000000000000e+00 >
Duplicate row C0001671 at line 7138 <     X0000785  C0001671   1.000000000000e+00 >
Duplicate row C0001673 at line 7139 <     X0000785  C0001673   1.000000000000e+00 >
Duplicate row C0001675 at line 7140 <     X0000785  C0001675   1.000000000000e+00 >
Duplicate row C0002203 at line 11222 <     X0001509  C0002203   1.000000000000e+00 >
Duplicate row C0002204 at line 11223 <     X0001509  C0002204   1.000000000000e+00 >
Duplicate row C0002206 at line 11224 <     X0001509  C0002206   1.000000000000e+00 >
Duplicate row C0002208 at line 11225 <     X0001509  C0002208   1.000000000000e+00 >
Duplicate row C0002209 at line 11226 <     X0001509  C0002209   1.000000000000e+00 >
Duplicate row C0002211 at line 11227 <     X0001509  C0002211   1.000000000000e+00 >
Duplicate row C0002203 at line 11238 <     X0001511  C0002203   1.000000000000e+00 >
Duplicate row C0002204 at line 11239 <     X0001511  C0002204   1.000000000000e+00 >
Duplicate row C0002207 at line 11240 <     X0001511  C0002207   1.000000000000e+00 >
Duplicate row C0002208 at line 11241 <     X0001511  C0002208   1.000000000000e+00 >
Duplicate row C0002209 at line 11242 <     X0001511  C0002209   1.000000000000e+00 >
Duplicate row C0002212 at line 11243 <     X0001511  C0002212   1.000000000000e+00 >
Duplicate row C0002203 at line 11255 <     X0001513  C0002203   1.000000000000e+00 >
Duplicate row C0002205 at line 11256 <     X0001513  C0002205   1.000000000000e+00 >
Duplicate row C0002206 at line 11257 <     X0001513  C0002206   1.000000000000e+00 >
Duplicate row C0002208 at line 11258 <     X0001513  C0002208   1.000000000000e+00 >
Duplicate row C0002210 at line 11259 <     X0001513  C0002210   1.000000000000e+00 >
Duplicate row C0002211 at line 11260 <     X0001513  C0002211   1.000000000000e+00 >
Duplicate objective at line 11261 <     X0001513  OBJ        8.920574188232e-03 >
Duplicate row C0002203 at line 11272 <     X0001515  C0002203   1.000000000000e+00 >
Duplicate row C0002205 at line 11273 <     X0001515  C0002205   1.000000000000e+00 >
Duplicate row C0002207 at line 11274 <     X0001515  C0002207   1.000000000000e+00 >
Duplicate row C0002208 at line 11275 <     X0001515  C0002208   1.000000000000e+00 >
Duplicate row C0002210 at line 11276 <     X0001515  C0002210   1.000000000000e+00 >
Duplicate row C0002212 at line 11277 <     X0001515  C0002212   1.000000000000e+00 >
Duplicate row C0000590 at line 13735 <     X0001952  C0000590   1.000000000000e+00 >
Duplicate row C0000591 at line 13736 <     X0001952  C0000591   1.000000000000e+00 >
Duplicate row C0000593 at line 13737 <     X0001952  C0000593   1.000000000000e+00 >
Duplicate row C0000595 at line 13738 <     X0001952  C0000595   1.000000000000e+00 >
Duplicate row C0000596 at line 13739 <     X0001952  C0000596   1.000000000000e+00 >
Duplicate row C0000598 at line 13740 <     X0001952  C0000598   1.000000000000e+00 >
Duplicate row C0000590 at line 13751 <     X0001954  C0000590   1.000000000000e+00 >
Duplicate row C0000591 at line 13752 <     X0001954  C0000591   1.000000000000e+00 >
Duplicate row C0000594 at line 13753 <     X0001954  C0000594   1.000000000000e+00 >
Duplicate row C0000595 at line 13754 <     X0001954  C0000595   1.000000000000e+00 >
Duplicate row C0000596 at line 13755 <     X0001954  C0000596   1.000000000000e+00 >
Duplicate row C0000599 at line 13756 <     X0001954  C0000599   1.000000000000e+00 >
Duplicate row C0000590 at line 13768 <     X0001956  C0000590   1.000000000000e+00 >
Duplicate row C0000592 at line 13769 <     X0001956  C0000592   1.000000000000e+00 >
Duplicate row C0000593 at line 13770 <     X0001956  C0000593   1.000000000000e+00 >
Duplicate row C0000595 at line 13771 <     X0001956  C0000595   1.000000000000e+00 >
Duplicate row C0000597 at line 13772 <     X0001956  C0000597   1.000000000000e+00 >
Duplicate row C0000598 at line 13773 <     X0001956  C0000598   1.000000000000e+00 >
Duplicate objective at line 13774 <     X0001956  OBJ        8.920574188232e-03 >
Duplicate row C0000590 at line 13785 <     X0001958  C0000590   1.000000000000e+00 >
Duplicate row C0000592 at line 13786 <     X0001958  C0000592   1.000000000000e+00 >
Duplicate row C0000594 at line 13787 <     X0001958  C0000594   1.000000000000e+00 >
Duplicate row C0000595 at line 13788 <     X0001958  C0000595   1.000000000000e+00 >
Duplicate row C0000597 at line 13789 <     X0001958  C0000597   1.000000000000e+00 >
At line 22450 RHS
At line 25106 BOUNDS
At line 28108 ENDATA
Problem MODEL has 2655 rows, 2988 columns and 11651 elements
Coin0008I MODEL read with 116 errors
There were 116 errors on input
seconds was changed from 1e+100 to 600
threads was changed from 0 to 104
Option for timeMode changed from cpu to elapsed
** Current model not valid
Option for printingOptions changed from normal to all
** Current model not valid
No match for /tmp/fd2f9f31b8df4c399e4cf1c02094c54b-pulp.sol - ? for list of commands
Total time (CPU seconds):       0.01   (Wallclock seconds):       0.02

Traceback (most recent call last):
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/examples/llama3_8B_128K/train.py", line 291, in <module>
    main(args)
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/examples/llama3_8B_128K/train.py", line 240, in main
    trainer.run()
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/nnscaler/cli/trainer.py", line 98, in run
    self._setup()
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/nnscaler/cli/trainer.py", line 209, in _setup
    pmodel_class = nnscaler.parallelize(
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/nnscaler/parallel.py", line 996, in parallelize
    regen_status = _gencode(
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/nnscaler/parallel.py", line 761, in _gencode
    graph = pas_policy(graph, compute_config)
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/nnscaler/policies.py", line 307, in pas_autodist
    return parallelize_graph(graph, autodist_cfg)
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/nnscaler/autodist/apis.py", line 120, in parallelize_graph
    search_out = calc_parallel_plan(graph, autodist_config)
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/nnscaler/autodist/apis.py", line 101, in calc_parallel_plan
    pp_out = calc_optimal_spmd_plan(autodist_graph, autodist_config)
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/nnscaler/autodist/spmd_solver.py", line 1516, in calc_optimal_spmd_plan
    spmd_outs = spmd_solver.solve([(0, model_graph.op_num - 1)], 1)[0]
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/nnscaler/autodist/spmd_solver.py", line 1385, in solve
    return self.do_ilp(intervals, topk)
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/nnscaler/autodist/spmd_solver.py", line 1178, in do_ilp
    solver_out = self._solve_by_ilp(start, end)
  File "/data/haiqwa/zevin_nfs/andy/Auto-Parallelization/nnscaler_group1/qinghe/nnscaler-main/nnscaler/autodist/spmd_solver.py", line 1119, in _solve_by_ilp
    prob.solve(solver)
  File "/opt/conda/envs/andy/nnscaler/lib/python3.10/site-packages/pulp/pulp.py", line 1867, in solve
    status = solver.actualSolve(self, **kwargs)
  File "/opt/conda/envs/andy/nnscaler/lib/python3.10/site-packages/pulp/apis/coin_api.py", line 112, in actualSolve
    return self.solve_CBC(lp, **kwargs)
  File "/opt/conda/envs/andy/nnscaler/lib/python3.10/site-packages/pulp/apis/coin_api.py", line 190, in solve_CBC
    raise PulpSolverError("Pulp: Error while executing " + self.path)
pulp.apis.core.PulpSolverError: Pulp: Error while executing /opt/conda/envs/andy/nnscaler/lib/python3.10/site-packages/pulp/solverdir/cbc/linux/64/cbc

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions