Skip to content

dilipwarrier/eval_ai

Repository files navigation

Steps for running AI eval tests

This directory has download instructions for software stacks for LLM AI models.

The instructions below are for a Linux OS.

Python virtual environment setup

If you do not have Python installed already, install it as follows.

sudo apt-get install python3

Create a .venv directory and activate your Python virtual environment as follows.

python3 -m venv .venv
source .venv/bin/activate

The virtual environment is now activated. You should see the text (.venv) in parentheses appear on the left of your command prompt.

You must always run this step before executing any Python scripts in this repository, else you will encounter run-time errors.

Python package installation

Install all the Python packages from requirements.txt as follows.

python -m pip install --upgrade pip
pip install -r requirements.txt

This will install the necessary Python packages in your virtual environment. Due to the large number of packages, this will take some time.

Running vllm

Currently, this will only work on a machine that has a GPU.

After activating your virtual environment, run the following.

python basic_vllm.py

This script will test a default LLM with some basic prompts.

Installing and running llm-d

First time installation and startup

Checkout the v0.5.1-rc4 release of llm-d.

tag="v0.5.1-rc.4"
git clone https://github.com/llm-d/llm-d.git
(cd llm-d && git fetch --tags -f && git checkout ${tag})

Download the shell script to install the pre-requisites for llm-d. This will install the Kubernetes command-line controller kubectl, a YAML parser yq, etc.

# Define the gateway path variable
CLIENT_SETUP_PATH="llm-d/guides/prereq/client-setup"

# Run the script to install all dependencies
(cd "$CLIENT_SETUP_PATH" && ./install-deps.sh)

Once, the installation is completed, you should get a message “All tools installed successfully.”

Next, install and start minikube to run the Kubernetes cluster locally and set it up within a Docker run-time environment.

Note that this step assumes a Linux OS, an AMD64 CPU, and one or more Nvidia GPUs. The commands will be different if you have a different configuration.

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

Update Docker to recognize the Nvidia GPU on the machine and download the Docker image for llm-d. Note that this is a large download and will take time.

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo usermod -aG docker $USER && newgrp docker
docker pull ghcr.io/llm-d/llm-d-cuda:v0.5.1-rc.4

Start a Minikube node with a Docker container and access to the GPUs. Push the image you just downloaded from the host Docker engine into Minikube’s Docker node.

minikube start \
  --driver=docker \
  --container-runtime=docker \
  --gpus=all \
  --addons=nvidia-device-plugin \
  --memory=16384 \
  --disk-size=50g \
  --cpus=4

# Save the large llm-d-cuda image locally and copy it into the minikube
# Docker session.
docker save ghcr.io/llm-d/llm-d-cuda:v0.5.1-rc.4 -o llm-d-rc4.tar
eval $(minikube docker-env) && docker load -i llm-d-rc4.tar && eval $(minikube docker-env -u)
rm llm-d-rc4.tar

If everything went well, you should see a Minikube node with status “READY” when you run the following command.

kubectl get nodes

Next, setup an account on huggingface.co and get an access token.

Save that token in a special file for future use and export it to the current shell.

export HF_TOKEN=<your_Huggingface_token> >> ~/.tokens

# Restrict permissions: 600 means Read/Write for owner only (no access for group/others)
chmod 600 ~/.tokens

# Append the sourcing command to your .bashrc
echo 'if [ -f ~/.tokens ]; then source ~/.tokens; fi' >> ~/.bashrc

# Source it now for the current session
source ~/.bashrc

Create the llm-d namespace and add a secret for the Kubernetes cluster using the token.

kubectl create namespace llm-d

export HF_TOKEN_NAME=${HF_TOKEN_NAME:-llm-d-hf-token}
kubectl create secret generic ${HF_TOKEN_NAME} \
    --from-literal="HF_TOKEN=${HF_TOKEN}" \
    --namespace "llm-d" \
    --dry-run=client -o yaml | kubectl apply -f -

Install the istio gateway provider.

# Define the gateway path variable
GATEWAY_PATH="llm-d/guides/prereq/gateway-provider"

# 1. Run the dependency script to create the
# Custom Resource Definitions (CRDs) for the gateway
(cd "$GATEWAY_PATH" && ./install-gateway-provider-dependencies.sh)

# 2. Apply the helmfile using the path
helmfile --quiet --file "$GATEWAY_PATH/istio.helmfile.yaml" apply

If all went well, the first command below should show at least one API resource each of KIND InferencePool and InferenceObjective. The second command should show a pod with status “Running”.

kubectl api-resources | grep -iE "KIND|inference.networking.*k8s"
kubectl get pods -n istio-system

llm-d requires monitoring services to be set up. With the following commands, set up Prometheus as the monitoring service.

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml

Confirm that two monitoring CRDs have been created.

kubectl get crds | grep monitoring

Now, we are ready to run one of the llm-d services. We will use “Inference Scheduling” as our chosen example.

The default configuration uses a large model and assumes the availability of several GPU cores. We use an updated yaml file that uses a Qwen2.5 LLM model and scales down GPU requirements from the default.

This assumes that you have the eval_ai and llm-d repositories checked out at the same level. If not, adjust the path for the diff file accordingly.

pushd llm-d
diff_path="../eval_ai/ms-inference-scheduling-values.diff"

# Check if the hardware patch file exists
if [[ -f "$diff_path" ]]; then
    echo "Applying hardware patches..."
    patch -p1 < "$diff_path"
else
    echo "---------------------------------------------------------"
    echo "ERROR: Patch file not found at: $diff_path"
    echo "Please ensure the .diff file is present in the eval_ai directory."
    echo "---------------------------------------------------------"
    popd
fi

popd

Apply the helmfile now to load the inference engine.

# Define the path variable
INFERENCE_GUIDE_PATH="llm-d/guides/inference-scheduling"

# Bring up pods for inference scheduling
helmfile --quiet -f "$INFERENCE_GUIDE_PATH/helmfile.yaml.gotmpl" \
         apply -n llm-d

This process will take a few minutes to load the model parameters into RAM and get ready to serve prompt requests.

Ensure that there are 3 pods showing “1/1” in the READY column and “Running” in the STATUS column when you run the following command.

kubectl get pods -n llm-d

Finally, install the httproute so that the service can be accessed from an external port.

# Define the path variable
INFERENCE_GUIDE_PATH="llm-d/guides/inference-scheduling"

# Install HTTP routes
kubectl apply -f "$INFERENCE_GUIDE_PATH/httproute.yaml" -n llm-d

Now, setup port forwarding so that requests received on port 8080 of the machine get forwarded to port 80 in the Minikube deployment.

# Start the tunnel in the background
kubectl port-forward -n llm-d svc/infra-inference-scheduling-inference-gateway-istio 8080:80 > /dev/null 2>&1 &

echo "Tunnel established from http://localhost:8080 to port 80 of Minikube"

Run a test prompt against the vllmserve chat completions API.

curl -s http://localhost:8080/v1/chat/completions \
     -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a concise academic historian."},
      {"role": "user", "content": "Write a 100-word essay on the Enlightenment movement."}
    ],
    "max_tokens": 300,
    "temperature": 0.5
  }' | jq -r '.choices[0].message.content'

You should see the prompt response from the LLM.

At this stage, you can run split_vllm.py with the “–remote” option to generate LLM output.

Clean shutdown

When you’re done with your experiments, run the following to cleanup and release the resources.

# Kill port-forwarding process
pkill -f "kubectl port-forward"

# Scale down deployment to 0
kubectl scale deployment -n llm-d --all --replicas=0

# Destroy inference resources
INFERENCE_GUIDE_PATH="llm-d/guides/inference-scheduling"
helmfile --quiet --file "$INFERENCE_GUIDE_PATH/helmfile.yaml.gotmpl" destroy -n llm-d

# Destroy gateway resources
GATEWAY_PATH="llm-d/guides/prereq/gateway-provider"
helmfile --quiet --file "$GATEWAY_PATH/istio.helmfile.yaml" destroy

# Release minikube node
minikube stop

Clean startup

After a clean shutdown, run the following for a clean startup.

# Start a Minikube node with a Docker container and access to Nvidia GPUs
minikube start \
         --driver=docker \
         --container-runtime=docker \
         --gpus=all \
         --addons=nvidia-device-plugin \
         --memory=16384 \
         --disk-size=50g \
         --cpus=4

# Save the large llm-d-cuda image locally and copy it into the minikube
# Docker session.
docker save ghcr.io/llm-d/llm-d-cuda:v0.5.1-rc.4 -o llm-d-rc4.tar
eval $(minikube docker-env) && docker load -i llm-d-rc4.tar && eval $(minikube docker-env -u)
rm llm-d-rc4.tar

# Create a new namespace
kubectl create namespace llm-d

# Create a secret for Huggingface in the namespace
export HF_TOKEN_NAME=${HF_TOKEN_NAME:-llm-d-hf-token}
kubectl create secret generic ${HF_TOKEN_NAME} \
        --from-literal="HF_TOKEN=${HF_TOKEN}" \
        --namespace "llm-d" \
        --dry-run=client -o yaml | kubectl apply -f -

# Create the istio gateway pods
GATEWAY_PATH="llm-d/guides/prereq/gateway-provider"

# Run the dependency script to create the
# Custom Resource Definitions (CRDs) for the gateway
(cd "$GATEWAY_PATH" && ./install-gateway-provider-dependencies.sh)

# Apply the helmfile using the path
helmfile --quiet --file "$GATEWAY_PATH/istio.helmfile.yaml" apply

# Setup Prometheus as the monitoring service
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml

# Define the Inference guide path variable
INFERENCE_GUIDE_PATH="llm-d/guides/inference-scheduling"

# Bring up pods for inference scheduling
helmfile --quiet -f "$INFERENCE_GUIDE_PATH/helmfile.yaml.gotmpl" \
         apply -n llm-d

# Install HTTP routes
kubectl apply -f "$INFERENCE_GUIDE_PATH/httproute.yaml" -n llm-d

# Setup a port forwarding tunnel
kubectl port-forward -n llm-d svc/infra-inference-scheduling-inference-gateway-istio 8080:80 > /dev/null 2>&1 &

echo "Tunnel established from http://localhost:8080 to port 80 of Minikube"

Ensure that there are 3 pods showing “1/1” in the READY column and “Running” in the STATUS column when you run the following command.

kubectl get pods -n llm-d

Now, you’re ready to post prompts as HTTP requests to localhost:8080 and get LLM results.

Switching between the remote and local modes of split_vllm

split_vllm can operate in either a local mode or a remote mode. In the local mode, it starts an LLM engine from scratch. In the remote mode, it relies upon an LLM being served on an IP host and port.

llm-d can instantiate an inference scheduler port and that will act as that server. To achieve this, go through the llm-d startup instructions above. Then, run the following.

python split_vllm.py --remote --prompt "Write a 100-word essay on the Enlightenment movement"

To switch to the local mode, stop port forwarding and scale down the llm-d deployments.

# Kill port-forwarding process
pkill -f "kubectl port-forward"

# Scale down deployment to 0
kubectl scale deployment -n llm-d --all --replicas=0

After that, run the following.

python split_vllm.py --prompt "Write a 100-word essay on the Enlightenment movement"

To switch back to remote mode, scale up the deployments again, start port forwarding, and wait for the inference scheduler pod to get to “1/1” READY state.

Reference links

The instructions in the section above are adapted from the llm-d QuickStart guide.

Promptfoo installation

Install Node.js

We use NVM (Node Version Manager) to handle Node.js versions on Ubuntu WSL.

# Update package list and install curl
sudo apt update && sudo apt install -y curl

# Download and install nvm
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash

# Load nvm into the current session
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"

# Install the Long Term Support (LTS) version of Node.js
nvm install --lts

Install promptfoo

Next, we install promptfoo globally for CLI access.

# Install promptfoo globally using npm
npm install -g promptfoo

# Verify the installation
promptfoo --version

Ollama installation

Install Ollama and pull a model. In the example below, I pulled the 8B Llama3 model.

sudo apt-get install zstd
curl -fsSL https://ollama.com/install.sh | sh

# Note that this is the 8B model (5 GB file size)
ollama pull llama3.1:8b

Running tests

Run the tests as follows.

promptfoo eval --config basic_local_llama3_tests.yaml

This will run the tests in the named YAML file.

Downloading Llama model files and running sample scripts

Go to https://llama.meta.com/llama-downloads/, accept the terms, and select the llama 3.2 1B and llama 3.1 8B models.

You will then receive an email with a URL for each model type. You need to use this URL within 48 hours in the following steps, else it will expire.

List the available llama models as follows.

llama model list --show-all

Find the model ID for your model (left-most column in the table). Then, download the appropriate model as follows.

llama model download --source meta --model-id llama3.2-1B

When prompted, enter the URL that you received by email. The process will now commence and download the requested model to your computer. Note that the model you chose to download must be a model that you accepted the terms for earlier. Otherwise, you will get a download error.

Next, download the llama3.1 model.

llama model download --source meta --model-id llama3.1-8B

The models get downloaded to ~/.llama.

Then, run the sample llama3 chat completion script as follows.

torchrun --nproc_per_node 1 llama3_sample_completion.py ~/.llama/checkpoints/Llama3.2-1B

Installing huggingface CLI

Install huggingface to download models that vllm will use.

curl -LsSf https://hf.co/cli/install.sh | bash

Next, setup an account on huggingface.co and get an access token.

Then, type the following and enter your access token when asked.

hf auth login

Download the facebook/opt-125m model as follows.

hf download --repo-type model facebook/opt-125m

If it is correctly downloaded, the following command will show you the model in the hf cache.

hf cache scan

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors