diff --git a/demos/continuous_batching/long_context/README.md b/demos/continuous_batching/long_context/README.md index 2ab0d64181..7c481e4c2f 100644 --- a/demos/continuous_batching/long_context/README.md +++ b/demos/continuous_batching/long_context/README.md @@ -22,69 +22,83 @@ Compression reduces this memory usage, enabling longer prompts or more parallel Let's demonstrate all the optimizations combined and test it with the real life scenario of sending multiple various questions in the same context. It will illustrate the gain from the prefix caching on the first token latency, improved second token latency thanks to prompt lookup and moderate memory consumption despite very long prompts and parallel execution. -Export the model Qwen/Qwen2.5-7B-Instruct-1M which has the max context length of 1 million tokens! - +Prepare models directory: ```bash -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt mkdir models -python export_model.py text_generation --source_model Qwen/Qwen2.5-7B-Instruct-1M --weight-format int4 --config_file_path models/config.json --model_repository_path models ``` -Start OVMS: +::::{tab-set} +:::{tab-item} CPU +:sync:CPU ```bash -docker run -it --rm -u $(id -u) -p 8000:8000 -v $(pwd)/models/:/models:rw openvino/model_server:latest --rest_port 8000 --source_model Qwen/Qwen2.5-7B-Instruct-1M --model_repository_path /models --task text_generation --enable_prefix_caching true --kv_cache_precision u8 --target_device CPU +docker run --user $(id -u):$(id -g) -d --rm -v $(pwd)/models:/models:rw -p 8000:8000 openvino/model_server:latest --rest_port 8000 --model_repository_path /models --source_model OpenVINO/gpt-oss-20b-int4-ov --tool_parser gptoss --reasoning_parser gptoss --task text_generation --enable_prefix_caching true ``` - -## Dataset for experiments - -To test the performance using vllm benchmarking script, let's create a custom dataset with long shared context and a set of questions in each request. That way we can create a dataset with identical very long context with different queries related to the context. That is a common scenario for RAG applications which generates response based on a complete knowledge base. To make this experiment similar to real live, the context is not synthetic but build with the content of Don Quixote story with 10 different questions related to the story. Because the context is reused, it is a perfect case for benefitting from prefix caching. - +::: +::: {tab-item} GPU +:sync: GPU ```bash -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/continuous_batching/long_context/custom_dataset.py -o custom_dataset.py -pip install requests transformers -python custom_dataset.py --limit_context_tokens 50000 +docker run --user $(id -u):$(id -g) -d --rm -v $(pwd)/models:/models:rw -p 8000:8000 openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/gpt-oss-20b-int4-ov --tool_parser gptoss --reasoning_parser gptoss --task text_generation --enable_prefix_caching true --target_device GPU ``` - -It will create a file called `dataset.jsonl` with 10 requests of shared context body limited to 50000 tokens. +::: +:::{tab-item} NPU +```bash +docker run --user $(id -u):$(id -g) -d --rm -v $(pwd)/models:/models:rw -p 8000:8000 openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --target_device NPU --task text_generation --enable_prefix_caching true --max_prompt_len 16000 --tool_parser hermes3 --plugin_config "{\"NPUW_LLM_PREFILL_ATTENTION_HINT\": \"PYRAMID\"}" +``` +**Note:** It's recommended to set `--max_prompt_len` value to as low as possible. This will improve performence, but limit number of tokens model will accept. +::: +:::: ## Testing performance -Let's check the performance +Using `vllm` benchmark it's possible to check performence of the model with desired context lenght. It's also available set prefix parameters check performence benefit from prefix caching. ```bash -git clone --branch v0.9.1 --depth 1 https://github.com/vllm-project/vllm -cd vllm -pip3 install -r requirements/cpu.txt . --extra-index-url https://download.pytorch.org/whl/cpu -python benchmarks/benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model Qwen/Qwen2.5-7B-Instruct-1M --dataset-name custom --dataset-path ../dataset.jsonl --num-prompts 10 --max-concurrency 1 --custom-output-len 50 -============ Serving Benchmark Result ============ -Successful requests: 10 -Benchmark duration (s): 31.44 -Total input tokens: 500414 -Total generated tokens: 500 -Request throughput (req/s): 0.32 -Output token throughput (tok/s): 15.91 -Total Token throughput (tok/s): 15934.81 ----------------Time to First Token---------------- -Mean TTFT (ms): 1551.46 -Median TTFT (ms): 518.46 -P99 TTFT (ms): 3260.48 +pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu +vllm bench serve --backend openai --base-url http://localhost:8000/ --endpoint v3/completions --model OpenVINO/gpt-oss-20b-int4-ov --tokenizer openai/gpt-oss-20b --prefix-repetition-prefix-len 50000 --prefix-repetition-suffix-len 10 --prefix-repetition-output-len 20 --prefix-repetition-num-prefixes 1 --num-prompts 2 --max_concurrency 1 --dataset-name prefix_repetition --num-warmups 1 ``` -The results shown above, despite very long context, have much lower TTFT latency with prefix caching. As long as the beginning of the request prompt is reused, KV cache can be also reused to speed up prompt processing. - ## Performance Comparison Table +::::{tab-set} +:::{tab-item} CPU +Platform: Intel(R) Xeon(R) Platinum 8480+ +| Context Length (tokens) | TTFT No Caching (ms) | TTFT Prefix Caching (ms) | KV Cache Usage (GB) | +|------------------------|------------------|---------------------|-----------------------| +| 1,000 | 4 420 | 190.84 | 0.03 | +| 2,500 | 9 627 | 272.56 | 0.07 | +| 5,000 | 17 736 | 369.66 | 0.1 | +| 10,000 | 36 684 | 680.28 | 0.2 | +| 25,000 | 100 807 | 1570.07 | 0.6 | +| 50,000 | 287 788 | 5133.87 | 1.3 | + +::: +:::{tab-item} iGPU +:sync: GPU +Platform: Intel(R) Core(TM) Ultra 5 338H | Context Length (tokens) | TTFT No Caching (ms) | TTFT Prefix Caching (ms) | KV Cache Usage (GB) | -|------------------------|----------------------|--------------------------|---------------------| -| 1,000 | 785 | 141 | 0.1 | -| 5,000 | 4160 | 172 | 0.2 | -| 10,000 | 9570 | 217 | 0.4 | -| 50,000 | 152,589 | 795 | 1.5 | -| 100,000 | 624,713 | 1097 | 3.1 | -| 200,000 | | 5406 | 6.2 | - -The results show that the cache usage grows linearly with the context length. -First token generation without prefix caching is growing significantly with the prompt size. +|------------------------|------------------|---------------------|-----------------------| +| 1,000 | 1 729 | 279.75 | 0.03 | +| 2,500 | 3 752 | 367.02 | 0.07 | +| 5,000 | 7 215 | 364.82 | 0.1 | +| 10,000 | 17 380 | 599.86 | 0.2 | +| 25,000 | 59 201 | 991.01 | 0.6 | +| 50,000 | 160 138 | 2305.10 | 1.3 | + +::: +:::{tab-item} NPU +:sync: NPU +Platform: Intel(R) Core(TM) Ultra 5 338H +| Context Length (tokens) | TTFT No Caching (ms) | TTFT Prefix Caching (ms) | +|------------------------|------------------|---------------------| +| 500 | 1521.75 | 1489.22 | +| 1,000 | 3061.18 | 1729.39 | +| 2,000 | 3072.92 | 1806.56 | +| 4,000 | 6697.62 | 2421.26 | +| 8,000 | 16046.92 | 3232.11 | +| 16,000 | 53378.22 | 6585.93 | +::: +:::: + +The results show that the cache usage grows exponentialy with the context length. Prefix caching is very effective in reducing the first token generation making the long context calls practical even on slower HW. ## Testing accuracy @@ -92,23 +106,10 @@ Prefix caching is very effective in reducing the first token generation making t Testing accuracy for use cases with long context can be done via [lm-eval_harness](../accuracy/README.md). The only difference is that the configured testing task should include a relevant dataset. -For example: -``` -lm-eval --model local-chat-completions --tasks longbench_gov_report --model_args model=Qwen/Qwen2.5-7B-Instruct-1M,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000 --verbosity DEBUG --seed 1 --apply_chat_template -``` - -Such experiment can confirm the impact on accuracy from the model quantization and KV cache compression. - -## Cache Precision Comparison - -| Cache Precision | Plugin Config | Accuracy (longbench_gov_report, concurrency 50) | Max Cache Usage (GB) | Duration (s for 100 requests) | -|-----------------|--------------|-----------------------------------------------|----------------------|-------------------------------| -| INT8 | "KV_CACHE_PRECISION":"u8" | 0.3374 | 11 | 41m6.993s | -| BF16 | "KV_CACHE_PRECISION":"bf16" | 0.3297 | 20 | 40m15.359s | -| FP32 | "KV_CACHE_PRECISION":"FP32","EXECUTION_MODE_HINT": "ACCURACY" | 0.331 | 37 | 105m15.876s | +## Cache Precision -The results in an experiment captured on Xeon Gen4 server show that KV cache compression has minimal impact on accuracy and significantly reduces memory consumption. -Slower execution with FP32 precision is a result of disabled AMX acceleration. +KV cache compression has minimal impact on accuracy and significantly reduces memory consumption and benchmark time. +It's recommended to use default KV cache precision which is INT8. ## Recommendations @@ -116,7 +117,7 @@ Enable prefix caching feature with `--enable_prefix_caching` parameter when you Use KV cache compression as INT8 which is the default setting. -Set the KV cache size via `--cache_size` parameter based on the available memory, expected concurrency and context length. It will improve the performance. +Set the KV cache size via `--cache_size` parameter based on the available memory, expected concurrency and context length or use default value (`0`) to make it dynamic. It will improve the performance. **Note** You can force reducing the concurrency on the server using a parameter `--rest_workers` which by default allows number of connections the same like number of CPU cores. Alternatively the limit can be set on the model level in `--max_num_seqs`. diff --git a/demos/continuous_batching/long_context/custom_dataset.py b/demos/continuous_batching/long_context/custom_dataset.py deleted file mode 100644 index bc166d8978..0000000000 --- a/demos/continuous_batching/long_context/custom_dataset.py +++ /dev/null @@ -1,99 +0,0 @@ -# -# Copyright (c) 2025 Intel Corporation -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - - -# This script generates a dataset of long context examples for performance evaluation -import os -import json -import requests -from transformers import AutoTokenizer -import argparse - -# function to download a file from a URL and convert it to text -def download_file(url): - output_path = os.path.basename(url) - headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"} - response = requests.get(url, headers=headers) - if response.status_code == 200: - with open(output_path, "wb") as file: - file.write(response.content) - print(f"File downloaded and saved as {output_path}") - else: - print(f"Failed to download file. Status code: {response.status_code}") - if url.endswith(".txt"): - with open(output_path, "r", encoding="utf-8") as file: - text = file.read() - print(f"Text file read successfully. Length of text: {len(text)} characters") - return text - elif url.endswith(".pdf"): - with open(output_path, "rb") as file: - pdf = PyPDF2.PdfReader(file) - text = "" - for page in pdf.pages: - text += page.extract_text() - return text - else: - raise ValueError("Unsupported file type. Only .txt and .pdf files are supported.") - -parser = argparse.ArgumentParser(description="Generate a dataset of long context examples.") -parser.add_argument("--file_url", type=str, default="https://ota.bodleian.ox.ac.uk/repository/xmlui/bitstream/handle/20.500.12024/2011/donquix-2011.txt", help="URL of the file to download") -parser.add_argument("--model_name", type=str, default="Qwen/Qwen2.5-7B-Instruct-1M", help="Model name for the tokenizer") -parser.add_argument("--limit_context_tokens", type=int, default=50000, help="Maximum number of tokens to use for the context") -args = parser.parse_args() - -file_url = args.file_url -model_name = args.model_name -limit_context_tokens = args.limit_context_tokens - -text = download_file(file_url) - -# Initialize the tokenizer -tokenizer = AutoTokenizer.from_pretrained(model_name) - -# Tokenize the text -tokens = tokenizer(text)['input_ids'] -print(f"Number of tokens: {len(tokens)}") - -if limit_context_tokens is not None: - if len(tokens) > limit_context_tokens: - tokens = tokens[:limit_context_tokens] - print(f"Tokens truncated to {limit_context_tokens} tokens") - text = tokenizer.decode(tokens) - -list_of_questions = [ - "Summarize the text in few sentences.", - "What are the main points discussed in the text?", - "What is the main theme of the text?", - "What are the key arguments presented in the text?", - "Who is the main character in the text?", - "Describe shortly the main character.", - "What was the most funny part of the text?", - "What was the most sad part of the text?", - "What was the most interesting part of the text?", - "Summarize shortly the first paragraph of the text.", -] -dataset = "" -for question in list_of_questions: - prompt = f"For the given CONTEXT answer the QUESTION. \n CONTEXT: {text}\n QUESTION {question}\n" - item = {"prompt": prompt } - dataset += json.dumps(item, ensure_ascii=False) + "\n" - - -# Save the dataset to a JSON file -output_file = "dataset.jsonl" -with open(output_file, "w", encoding="utf-8") as file: - file.write(dataset) -print(f"Dataset saved to {output_file}")