Skip to content

Llama-3-8 Baseline for Figure 7 and 8b #3

@Jiminator

Description

@Jiminator

I have successfully gotten the baseline planner evaluation experiments working in artifact evaluation using the OPT-350 model. However, I have been unable to do so with the provided LLAMA-3-8 model. When I use the run_homogenous.sh (Figure 7) run_het.sh (Figure 8b) scripts with LLAMA-3-8, I run into various errors based on which planner I use, but none successfully return a correct result. Sailor runs into a very strange segfault, Metis is missing sufficient profile files (but also I expect its runtime to be too slow to use anyways), and Piper returns an empty result. I made as few code changes as possible to avoid introducing bugs. I modified args.baseline in run_all_sim.py to selectively pass in any of the planner frameworks. I created a new llama config file and added an entry for it in the train_config_files dictionary in run_all_sim.py.

As a sanity test, I also created another llama config file that contains the same values as the OPT-350 config file, except for the "model" and "num_all_layers" entries. The same errors still show up.

Could you guys provide some insight into how to run the planner eval experiments on LLAMA-3-8B? Doing so will allow us to understand on how to explore other models using sailor. Please let me know if I have a mistake or if I am missing something. Would be happy to provide more details if necessary!

Thank you for reading!

I added all of the files and error log messages when running run_homogenous.sh below to help reproduce and debug:

sailor/Planner/simulations/configs/training_config_llama_3.json
{
  "global_batch_size": 2048,
  "type": "gpt2",
  "hidden_size": 4096,
  "sequence_length": 8192,
  "num_layers": 32,
  "vocab_size": 128256,
  "model": "LLAMA-3-8",
  "optimizer": "Adam",
  "heads": 32,
  "head_dim": 128,
  "max_position_embeddings": 8192,
  "num_all_layers": 35
}
sailor/Planner/simulations/configs/sanity_check.json
{
	"global_batch_size": 1024,
	"type": "gpt2",
	"hidden_size": 1024,
	"sequence_length": 2048,
	"num_layers": 32,
	"vocab_size": 50272,
	"model": "LLAMA-3-8",
	"optimizer": "Adam",
    "heads": 16,
    "head_dim": 64,
	"max_position_embeddings": 2048,
	"num_all_layers": 35
}
ae_scripts/planner/run_homogenous.sh
#!/bin/bash

mkdir -p /root/sailor/ae_results/planner/fig7

python ae_scripts/planner/run_all_sim.py \
--model-name LLAMA-3-8 \
--gpu-type A100-40 \
--trace_file /root/sailor/sailor/Planner/simulations/configs/gpu_trace_scaled.csv \
--basic_cluster_config_json /root/sailor/sailor/Planner/simulations/configs/basic_cluster_config.json \
--simulator_profile_file /root/sailor/sailor/Planner/simulations/profiles_tmp.json \
--simulator_profile_file_op /root/sailor/sailor/Planner/simulations/profiles_tmp_aceso.json \
--quotas_dict /root/sailor/sailor/Planner/sailor_planner/dummy_quotas_dict.json \
--gpus-per-node 4 \
--sailor_path /root \
--res_dir ae_results/planner/test \
--objective throughput \
--baselines sailor
ae_scripts/planner/run_het.sh
#!/bin/bash

mkdir -p ae_results/planner/fig8b

python ae_scripts/planner/run_all_sim.py \
--model-name LLAMA-3-8 \
--gpu-type A100-40 \
--trace_file /root/sailor/sailor/Planner/simulations/configs/gpu_trace_heterogeneous_imbalanced.csv \
--basic_cluster_config_json /root/sailor/sailor/Planner/simulations/configs/basic_cluster_config.json \
--simulator_profile_file /root/sailor/sailor/Planner/simulations/profiles_tmp.json \
--quotas_dict /root/sailor/sailor/Planner/sailor_planner/dummy_quotas_dict.json \
--gpus-per-node 4 \
--sailor_path /root \
--res_dir ae_results/planner/test \
--objective throughput \
--baselines sailor
Diff for ae_scripts/planner/run_all_sim.py
diff --git a/ae_scripts/planner/run_all_sim.py b/ae_scripts/planner/run_all_sim.py
index b182e69..7855c28 100644
--- a/ae_scripts/planner/run_all_sim.py
+++ b/ae_scripts/planner/run_all_sim.py
@@ -8,6 +8,7 @@ def run_all_for_model(args):
     train_config_files = {
         "OPT-350": f"{home_dir}/{sailor_sim_path}/configs/training_config_opt_350.json",
         "GPT-Neo-2.7": f"{home_dir}/{sailor_sim_path}/configs/training_config_gpt_neo27.json",
+        "LLAMA-3-8": f"{home_dir}/{sailor_sim_path}/configs/training_config_llama_3.json",
     }
 
     res_dir = args.res_dir
@@ -44,6 +45,25 @@ def run_all_for_model(args):
         all_cmd = [dtfm_cmd, sailor_cmd]
     elif args.baselines=="cost":
         all_cmd = [galvatron_cmd, amp_cmd, flashflex_cmd, metis_cmd, dtfm_cmd, sailor_cmd]
+    elif args.baselines=="sailor":
+        all_cmd = [sailor_cmd]
+    elif args.baselines=="flashflex":
+        all_cmd = [flashflex_cmd]
+    elif args.baselines=="metis":
+        all_cmd = [metis_cmd]
+    elif args.baselines=="varuna":
+        all_cmd = [varuna_cmd]
+    elif args.baselines=="piper":
+        all_cmd = [piper_cmd]
+    elif args.baselines=="amp":
+        all_cmd = [amp_cmd]
+    elif args.baselines=="galvatron":
+        all_cmd = [galvatron_cmd]
+    elif args.baselines=="dtfm":
+        all_cmd = [dtfm_cmd]
+    elif args.baselines=="aceso":
+        all_cmd = [aceso_cmd]
+    
     for cmd in all_cmd:
         if cmd==aceso_cmd:
             cmd += f" --simulator_profile_file {args.simulator_profile_file_op}"
Sailor Output & Error Log
Error in cpuinfo: prctl(PR_SVE_GET_VL) failed
rm -rf *.o *.so
g++ -O3 -fPIC -I/usr/include/python3.10 -c planner.cpp -ljsoncpp -o planner.o
planner.cpp: In member function ‘void SailorPlanner::get_plans_no_heuristics(std::unordered_map<std::__cxx11::basic_string<char>, std::vector<std::pair<std::__cxx11::basic_string<char>, std::vector<int> > > >&)’:
planner.cpp:903:34: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘std::vector<std::__cxx11::basic_string<char> >::size_type’ {aka ‘long unsigned int’} [-Wformat=]
  903 |                 printf("size is %d\n", available_gpu_types.size());
      |                                 ~^     ~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                  |                             |
      |                                  int                           std::vector<std::__cxx11::basic_string<char> >::size_type {aka long unsigned int}
      |                                 %ld
planner.cpp:922:46: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘std::vector<std::pair<std::__cxx11::basic_string<char>, std::vector<std::__cxx11::basic_string<char> > > >::size_type’ {aka ‘long unsigned int’} [-Wformat=]
  922 |                 printf("Permutation size is %d\n", region_list.size());
      |                                             ~^     ~~~~~~~~~~~~~~~~~~
      |                                              |                     |
      |                                              int                   std::vector<std::pair<std::__cxx11::basic_string<char>, std::vector<std::__cxx11::basic_string<char> > > >::size_type {aka long unsigned int}
      |                                             %ld
planner.cpp: In member function ‘void SailorPlanner::get_plans_num_gpus_dp(std::unordered_map<std::__cxx11::basic_string<char>, std::vector<std::pair<std::__cxx11::basic_string<char>, std::vector<int> > > >&)’:
planner.cpp:1123:42: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘std::vector<std::pair<std::__cxx11::basic_string<char>, std::vector<std::__cxx11::basic_string<char> > > >::size_type’ {aka ‘long unsigned int’} [-Wformat=]
 1123 |             printf("Permutation size is %d\n", region_list.size());
      |                                         ~^     ~~~~~~~~~~~~~~~~~~
      |                                          |                     |
      |                                          int                   std::vector<std::pair<std::__cxx11::basic_string<char>, std::vector<std::__cxx11::basic_string<char> > > >::size_type {aka long unsigned int}
      |                                         %ld
planner.cpp: In member function ‘ParallelismConfig* SailorPlanner::solve_dp(int, int, int, int, int, int, const string&, const std::vector<std::__cxx11::basic_string<char> >&, std::unordered_map<std::__cxx11::basic_string<char>, std::vector<std::pair<std::__cxx11::basic_string<char>, std::vector<int> > > >&, std::string, std::vector<std::vector<int> >&, std::vector<std::vector<int> >&)’:
planner.cpp:717:31: note: parameter passing for argument of type ‘std::pair<double, double>’ when C++17 is enabled changed to match C++14 in GCC 10.1
  717 |                     id_to_zone);
      |                               ^
g++ -O3 -fPIC -c utils/read_json.cpp -ljsoncpp -o read_json.o
g++ -O3 -fPIC -c training.cpp -ljsoncpp -o training.o
g++ -O3 -fPIC -c planner_utils.cpp -ljsoncpp -o planner_utils.o
planner_utils.cpp: In function ‘std::pair<double, double> find_p2p_time_cost(int, int, int, int, double, std::vector<StageConfig>&, std::unordered_map<std::__cxx11::basic_string<char>, std::unordered_map<int, std::unordered_map<int, std::unordered_map<std::__cxx11::basic_string<char>, std::unordered_map<int, std::unordered_map<int, std::pair<std::vector<double>, double> > > > > > >&, std::unordered_map<long unsigned int, double>&, std::unordered_map<std::__cxx11::basic_string<char>, std::unordered_map<std::__cxx11::basic_string<char>, double> >&, std::vector<std::__cxx11::basic_string<char> >&, bool)’:
planner_utils.cpp:122:20: note: parameter passing for argument of type ‘std::pair<double, double>’ when C++17 is enabled changed to match C++14 in GCC 10.1
  122 |     bool activation)
      |                    ^
g++ -O3 -fPIC -c utils/network_utils.cpp -o network_utils.o
g++ -O3 -fPIC -shared -I/usr/include/python3.10   read_json.o training.o planner_utils.o network_utils.o planner.o -ljsoncpp -o libplanner.cpython-310-aarch64-linux-gnu.so
/root/sailor/sailor/Planner/sailor_planner/profiles/LLAMA-3-8/
/root/sailor/sailor/providers/multizone_bandwidths_het.json
/root/sailor/sailor/Planner/simulations/configs/training_config_llama_3.json
/root/sailor/sailor/Planner/llm_info.json
/root/sailor/sailor/providers/gcp/communication_cost.json
AT QUOTAS DICT: A100-40
/root/sailor/sailor/Planner/sailor_planner/profiles/LLAMA-3-8/A100-40/profile.json
AT QUOTAS DICT: V100-16
/root/sailor/sailor/Planner/sailor_planner/profiles/LLAMA-3-8/V100-16/profile.json
Check for GPU A100-40
Check for GPU V100-16
build_structs took 21849 ms
NUM_VALID_GPUS is 1
hash_string is us-central1-a_A100-40_32_V100-16_0
region: us-central1
zone: us-central1-a
32,0,
-------------------------------------------------------------------------------------------------------- Check for PP: 1
**************************************************** Check for MBS: 1
MIN_TMPS - Check for GPU 0
[]
MIN_TMPS - Check for GPU 1
[]
Permutation size is 1
++++++++++++++++++++++++++++++++++++++++ Check for regions: us-central1, 
MAX_DP is 0
----------- CHECK WITH D 683, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 23 us, extra_cost is 0
----------- CHECK WITH D 512, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 410, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 13 us, extra_cost is 0
----------- CHECK WITH D 342, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 15 us, extra_cost is 0
----------- CHECK WITH D 293, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 12 us, extra_cost is 0
----------- CHECK WITH D 256, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 13 us, extra_cost is 0
----------- CHECK WITH D 228, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 13 us, extra_cost is 0
----------- CHECK WITH D 205, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 187, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 171, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 158, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 147, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 137, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 128, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 121, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 114, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 108, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 103, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 98, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 94, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 90, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 86, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 82, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 79, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 76, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 74, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 71, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 69, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 67, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 64, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 63, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 61, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 59, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 57, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 56, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 4 us, extra_cost is 0
----------- CHECK WITH D 54, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 53, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 52, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 50, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 49, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 48, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 47, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 46, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 6 us, extra_cost is 0
----------- CHECK WITH D 45, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 44, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 43, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 42, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 41, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 40, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 39, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 38, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 37, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 36, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 35, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 34, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 33, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 32, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 31, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 30, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 4 us, extra_cost is 0
----------- CHECK WITH D 29, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 28, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 27, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 26, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 25, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 24, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 23, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 22, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 21, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 20, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 1 us, extra_cost is 0
----------- CHECK WITH D 19, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 18, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 17, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 16, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 15, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 14, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 13, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 12, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 11, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 10, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 9, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 8, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 7, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 6, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 5, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 4, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 3, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 2, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
----------- CHECK WITH D 1, max_dp is 0
max cur budget is 0.00
----------------------------------------------- Tpp is 0.000000, solve_dp duration is 2 us, extra_cost is 0
**************************************************** Check for MBS: 2
[1ff2ae71d04c:3173 :0:3173] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xaaa000000029)
==== backtrace (tid:   3173) ====
 0 0x00000000000300c0 find_tmp_degrees()  ???:0
 1 0x00000000000505a4 SailorPlanner::get_plans_num_gpus_dp()  ???:0
 2 0x000000000005450c SailorPlanner::get_sorted_plans()  ???:0
 3 0x000000000008590c pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<std::vector<Config, std::allocator<Config> >, SailorPlanner, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > > > > >, float, double, pybind11::name, pybind11::is_method, pybind11::sibling>(std::vector<Config, std::allocator<Config> > (SailorPlanner::*)(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > > > > >, float, double), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(SailorPlanner*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > > > > >, float, double)#1}, std::vector<Config, std::allocator<Config> >, SailorPlanner*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > > > > >, float, double, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<std::vector<Config, std::allocator<Config> >, SailorPlanner, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > > > > >, float, double, pybind11::name, pybind11::is_method, pybind11::sibling>(std::vector<Config, std::allocator<Config> > (SailorPlanner::*)(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > > > > >, float, double), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(SailorPlanner*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > > > > >, float, double)#1}&&, std::vector<Config, std::allocator<Config> > (*)(SailorPlanner*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> > > > > >, float, double), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  :0
 4 0x00000000000791d8 pybind11::cpp_function::dispatcher()  :0
 5 0x0000000000103c34 PyObject_CallFunctionObjArgs()  ???:0
 6 0x00000000000fa240 _PyObject_MakeTpCall()  ???:0
 7 0x0000000000113c9c PyMethod_New()  ???:0
 8 0x00000000000f0d2c _PyEval_EvalFrameDefault()  ???:0
 9 0x00000000001139a8 PyMethod_New()  ???:0
10 0x00000000000ed8fc _PyEval_EvalFrameDefault()  ???:0
11 0x00000000001048c8 _PyFunction_Vectorcall()  ???:0
12 0x00000000000ed8fc _PyEval_EvalFrameDefault()  ???:0
13 0x00000000001048c8 _PyFunction_Vectorcall()  ???:0
14 0x00000000000ec8a4 _PyEval_EvalFrameDefault()  ???:0
15 0x00000000001e8250 PyEval_EvalCode()  ???:0
16 0x00000000001e80d4 PyEval_EvalCode()  ???:0
17 0x000000000021b3ac PyUnicode_Tailmatch()  ???:0
18 0x0000000000213ab8 PyInit__collections()  ???:0
19 0x000000000021b05c PyUnicode_Tailmatch()  ???:0
20 0x000000000021a1c4 _PyRun_SimpleFileObject()  ???:0
21 0x0000000000219d90 _PyRun_AnyFileObject()  ???:0
22 0x000000000020a7b0 Py_RunMain()  ???:0
23 0x00000000001d9208 Py_BytesMain()  ???:0
24 0x00000000000273fc __libc_init_first()  ???:0
25 0x00000000000274cc __libc_start_main()  ???:0
26 0x00000000001d90f0 _start()  ???:0
=================================
python /root/sailor/sailor/Planner/simulations/simulator.py --sailor_path /root --trace_file /root/sailor/sailor/Planner/simulations/configs/gpu_trace_scaled.csv --basic_cluster_config_json /root/sailor/sailor/Planner/simulations/configs/basic_cluster_config.json --training_config_json /root/sailor/sailor/Planner/simulations/configs/training_config_llama_3.json   --result_dir_path ae_results/planner/test --objective throughput --planner SAILOR  --sailor_profile_file_dir /root/sailor/sailor/Planner/sailor_planner/profiles/LLAMA-3-8/ --quotas_dict /root/sailor/sailor/Planner/sailor_planner/dummy_quotas_dict.json --simulator_profile_file /root/sailor/sailor/Planner/simulations/profiles_tmp.json
Piper Output & Error Log
[PIPER] Call the planning function
35 ideals
595 ideal pairs
running DP...
[863] timeSpentInGetAllTPSsForIdealPair = 0.000596
[864] timeSpentInDPLoop = 0.000391
Solution is found   ...
Cannot partition the graph due to mem usage! Return empty result ...
---------------------------------------- Evaluate baselines with {'A100-40_us-central1-a': 32}  --------------------------------
A100-40 us-central1-a
test_cluster_config is {'gpu_type': 'A100-40', 'num_nodes': 8, 'gpus_per_node': 4, 'mem_per_gpu': 42338615296, 'zone': 'us-central1-a'}
Max Devices is 32
Throughput is 0.0, Search_time is 0.22597599029541016
---------------------------------------- Evaluate baselines with {'A100-40_us-central1-a': 80}  --------------------------------
A100-40 us-central1-a
test_cluster_config is {'gpu_type': 'A100-40', 'num_nodes': 20, 'gpus_per_node': 4, 'mem_per_gpu': 42338615296, 'zone': 'us-central1-a'}
Max Devices is 80
Throughput is 0.0, Search_time is 0.05569314956665039
---------------------------------------- Evaluate baselines with {'A100-40_us-central1-a': 128}  --------------------------------
A100-40 us-central1-a
test_cluster_config is {'gpu_type': 'A100-40', 'num_nodes': 32, 'gpus_per_node': 4, 'mem_per_gpu': 42338615296, 'zone': 'us-central1-a'}
Max Devices is 128
Throughput is 0.0, Search_time is 0.056725263595581055
-------------------- Save results in path ae_results/planner/test/Piper_LLAMA-3-8.json
python /root/sailor/sailor/Planner/simulations/simulator.py --sailor_path /root --trace_file /root/sailor/sailor/Planner/simulations/configs/gpu_trace_scaled.csv --basic_cluster_config_json /root/sailor/sailor/Planner/simulations/configs/basic_cluster_config.json --training_config_json /root/sailor/sailor/Planner/simulations/configs/training_config_llama_3.json   --result_dir_path ae_results/planner/test --objective throughput --planner Piper --planner_profile_file /root/sailor/sailor/Planner/baselines/Piper/profiles/LLAMA-3-8/A100-40/profile.json --simulator_profile_file /root/sailor/sailor/Planner/simulations/profiles_tmp.json

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions