GitHub - KontonGu/HAS-GPU

HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences

HAS-GPU is an efficient Hybrid Auto-scaling Serverless architecture with fine-grained GPU allocation for deep learning inferences. HAS-GPU proposes an agile scheduler capable of allocating GPU Streaming Multiprocessor (SM) partitions and time quotas with arbitrary granularity and enables significant vertical quota scalability at runtime. To resolve performance uncertainty introduced by massive fine-grained resource configuration spaces, we propose the Resource-aware Performance Predictor (RaPP). Furthermore, we present an adaptive hybrid auto-scaling algorithm with both horizontal and vertical scaling to ensure inference SLOs and minimize GPU costs.

Hardware and System Setup

LRZ (Leibniz Supercomputing Centre) Compute Cloud VMs (website);
NVIDIA Tesla V100 16 GB RAM; (10 VM nodes, each with 1 GPU);
Ubuntu 20.04; (Only tested)
NVIDIA Driver 535.54.03 , CUDA 12.2;
Helm v3.17.0;
Kubernetes 1.32.1;

Getting Started Guide (for Artifact Evaluation)

The artifact consists of four components: (Part-1) direct deployment of HAS-GPU, (Part 2) deployment of other systems for comparison experiments, (Part 3) the RaPP module, and (Part 4) a full from-scratch build of the HAS-GPU's hybrid auto-scaler.

Part-1 includes all scripts and procedures required to deploy the complete HAS-GPU system. Please follow the instructions and specified software versions carefully when installing the Ubuntu operating system, setting up the Kubernetes (K8s) environment, configuring NVIDIA GPU infrastructure for cloud usage, and installing other required software and dependencies.

The RaPP module's deployment is already integrated in Part-1. However, for detailed operations/information such as training and inference of RaPP models, as well as dataset construction, please refer to the separate RaPP module repository linked in Part-3.

Part-2 provides the deployment configurations for two baseline systems used in the comparative experiments described in the paper. This part is fully independent of HAS-GPU.

Part 4 describes how to build the HAS-GPU auto-scaler from scratch. This step is optional and may be skipped.

All experiments were conducted on the Compute Cloud GPU cluster at LRZ (The Leibniz Supercomputing Centre).

Step-by-Step Instructions

Part-1: HAS-GPU Deployment

1. Prepare K8s Infrastructure Environment & Dependencies

Before running any tests, you need to install the required Python packages:

Install client requirements:

$ cd validation/client
$ pip install -r requirements.txt
$ cd ../../

Install model requirements:

$ cd validation
$ bash model_download.sh
$ cd ../

Prepare other prerequisites, like k8s cluster, cloud environment for NVIDIA GPUs, etc:

Follow the instructions in k8s_infrastructure to create the Kubernetes cluster infrastructure and configure the cloud environment for NVIDIA GPUs.

2. Deploy FaSTPod system that HAS-GPU System relies on

Clone the fast-gshare repository:

$ cd validation
$ ./clone_fast_gshare.sh
$ ./install_fast_gshare.sh ./FaST-GShare
$ cd ../

3. Deploy HAS-GPU System

If you would like to collect GPU usage data, you can run the following script:

$ mkdir /data/gpu_usage
$ sudo chmod 777 /data/gpu_usage

If you prefer to use our test scripts, you may skip this part and directly turn to step 4. Otherwise, run the following script to deploy the HAS-GPU System:

$ make has_deploy

If you would like to build and push your image, refer to the Part-4 Build HAS-GPU from Scratch section (Optional).

4. Run Experiment Workload

Run tests with HAS:

$ cd validation/HAS
$ python3 has-run-test.py --output-filename <output_name> --pod-prefix <pod_name> --yaml <path_to_yaml> --namespace <namespace> --url <gateway_ip_and_port> --mode <model_mode> --requests-file <path_to_requests_file>

Example:

$ python3 has-run-test.py --output-filename test-run-resnet-1 --pod-prefix resnet --yaml yaml/resnet.yaml --namespace fast-gshare-fn --url {gateway_ip}:{gateway_port} --mode image --requests-file ../client/rps.json --model-name resnet50

Note: The test script automatically handles:

Deploying the the application and start to send requests with requests-file

Running the tests

Cleaning up resources after tests complete

Gateway URL format: http://<gateway_ip>:<gateway_port>/function/<model_name>/predict, where the gatewayport is by default 8080.

Part-2: Other systems for experimental comparison: KServe, FaST-GShare. (Optional)

1. KServe (Deploy based on new K8s environment starting from Part-1's step 1)

Install KServe and its dependencies:
```
$ ./validation/install-kserve.sh
```
Note: You may need to manually run shell commands in this script. Refer to the KServe repository for more information.

This script will install:
- Knative Serving components
- Istio as the network layer
- Cert Manager
- KServe components

Verify the installation:

$ kubectl get pods -n knative-serving
$ kubectl get pods -n kserve

Run tests with KServe:

$ cd ./validation/kserve
$ python3 kserve-run-test.py --output-filename <output_name> --pod-prefix <pod_name> --yaml <path_to_yaml> --namespace <namespace> --url <service_url> --mode <model_mode> --requests-file <path_to_requests_file>

Example:

$ python3 kserve-run-test.py --output-filename test-run-1 --pod-prefix resnet --yaml yaml/resnet.yaml --namespace default --url {gateway_url}:8080 --mode image --requests-file ../client/rps.json

Note: For tests with kserve, it is recommanded to use at least 2 GPU in the system. It's very likely to have a lot of violation with a single GPU in the cluster.

2. FaST-GShare

Clone the FaST-GShare repository (if not already done):

$ cd ./validation
$ ./clone_fast_gshare.sh
$ ./install_fast_gshare.sh ./FaST-GShare
$ cd ../
$ make fastgshare_deploy

Run tests with FaST-GShare default scaler:

$ cd ./validation/fast-gshare
$ python3 fast-gshare-run-test.py --output-filename <output_name> --pod-prefix <pod_name> --yaml <path_to_yaml> --namespace <namespace> --url <gateway_ip_and_port> --mode <model_mode> --requests-file <path_to_requests_file> --model-name <model_name>

Example:

$ python3 fast-gshare-run-test.py --output-filename test-run-1 --pod-prefix resnet-predictor --yaml yaml/resnet.yaml --namespace fast-gshare-fn --url {gateway_ip}:{gateway_port} --mode image --requests-file ../client/rps.json --model-name resnet50

Note: The test script automatically handles:

Deploying the application and start to send requests with requests-file

Running the tests

Cleaning up resources after tests complete

Output the GPU usage and requests result to csv files

Important: You need to create the namespace first:
$ kubectl create namespace fast-gshare-fn

Part-3: RaPP (Resource-aware Performance Prediction)

For detailed information on the RaPP model, including inference, training, dataset construction, and experimental results and data, please refer to the project repository: RaPP: Resource-aware Performance Prediction.

Part-4: Build HAS-GPU from Scratch (Optional)

Build HAS-GPU

$ make docker-build docker-push IMG=<some-registry>/has-function-autoscaler:tag

ex.

$ make docker-build docker-push IMG=docker.io/kontonpuku666/has-function-autoscaler:test

build and update the container in K8S environment

make docker-clean docker-build docker-push IMG=docker.io/kontonpuku666/has-function-autoscaler:test

Build FaST-GShare

Refer to the FaST-GShare repository for more information.
Build FaST-Manager Refer to the FaST-Manager repository for more information.

Uninstall

Depending on which variant you installed, you can use the appropriate uninstallation method:

Uninstall HAS:

$ kubectl delete -f {path_to_yaml} # Remove the HAS Function
$ make has_undeploy  # Uninstall the controller from the cluster

Uninstall fast-gshare:

$ ./validation/uninstall_fast_gshare.sh ./FaST-GShare

Uninstall KServe:
```
$ ./validation/uninstall_kserve.sh
```

Experiment Results

The results from our experiments can be found in the validation/result/ directory. This directory contains the following subdirectories:

HAS/: Contains results from experiments using our HAS-GPU implementation
fast-gshare/: Contains results from experiments using the FaST-GShare default scaler
kserve/: Contains results from experiments using the standard KServe implementation (with 2 GPUs)

Our tests include stress workloads for the following models: BERT, MobileNet_V3, ResNet50, ShuffleNet_V2, and ConvNeXt.

Each result directory contains:

CSV files with detailed performance metrics
Log files documenting the experiment execution
Resource usage statistics in dedicated CSV files

The naming convention for result files follows the pattern: [test-number]-[model]-[test-type].[extension]

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
api/v1		api/v1
bin		bin
cmd		cmd
config		config
figures		figures
hack		hack
install		install
internal/controller		internal/controller
konton_test		konton_test
test		test
validation		validation
Dockerfile		Dockerfile
HAS-GPU_slides.pdf		HAS-GPU_slides.pdf
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum
guide_step_by_step.pdf		guide_step_by_step.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences

Hardware and System Setup

Getting Started Guide (for Artifact Evaluation)

Step-by-Step Instructions

Part-1: HAS-GPU Deployment

1. Prepare K8s Infrastructure Environment & Dependencies

2. Deploy FaSTPod system that HAS-GPU System relies on

3. Deploy HAS-GPU System

4. Run Experiment Workload

Part-2: Other systems for experimental comparison: KServe, FaST-GShare. (Optional)

1. KServe (Deploy based on new K8s environment starting from Part-1's step 1)

2. FaST-GShare

Part-3: RaPP (Resource-aware Performance Prediction)

Part-4: Build HAS-GPU from Scratch (Optional)

Uninstall

Experiment Results

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences

Hardware and System Setup

Getting Started Guide (for Artifact Evaluation)

Step-by-Step Instructions

Part-1: HAS-GPU Deployment

1. Prepare K8s Infrastructure Environment & Dependencies

2. Deploy FaSTPod system that HAS-GPU System relies on

3. Deploy HAS-GPU System

4. Run Experiment Workload

Part-2: Other systems for experimental comparison: KServe, FaST-GShare. (Optional)

1. KServe (Deploy based on new K8s environment starting from Part-1's step 1)

2. FaST-GShare

Part-3: RaPP (Resource-aware Performance Prediction)

Part-4: Build HAS-GPU from Scratch (Optional)

Uninstall

Experiment Results

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages