FEAT: Xavier: Share KV cache between VLLM replicas by ChengjieLi28 · Pull Request #2732 · xorbitsai/inference
Xavier: Share KV cache between VLLM replicas
Naming
It is derived from Professor X (Charles Francis Xavier) in the Marvel Comics X-Men series. The project name starts with "X," and like Professor X, who possesses a powerful mind that controls information, this metaphorically refers to the project managing the data scheduling in vllm.
Purpose
In vllm with multiple replicas, some long prompts have a lengthy prefill time. If other replicas have already computed the results, they can be directly transferred and used.
Usage
Simply add the parameter enable_xavier=True when starting the vllm model.
Test
Using this script to generate a long prompt for LLM (about 9k+ prompt token):
from faker import Faker
import pandas as pd
def gen_data(lines: int):
faker = Faker()
data = {
"Name": [faker.name() for _ in range(lines)],
"Age": [faker.random_int(min=15, max=80) for _ in range(lines)],
"Occupation": [faker.job() for _ in range(lines)],
"Country": [faker.country() for _ in range(lines)],
"Email": [faker.email() for _ in range(lines)],
"Address": [faker.address() for _ in range(lines)],
"Phone Number": [faker.phone_number() for _ in range(lines)]
}
df = pd.DataFrame(data)
markdown_table = df.to_markdown(index=False)
return markdown_table
LONG_PROMPT = "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n" + f"""
{gen_data(100)}
"""
q1 = "Question: What is the name and country of ID 23? Your answer: The name and country of ID 23 are "
q2 = "Question: What is the name and country of ID 96? Your answer: The name and country of ID 96 are "
Use LONG_PROMPT+q1 and LONG_PROMPT+q2 as prompts to interact with the model separately for each query.
Test Results:
- Env two RTX 3090TI with nvlink
- Qwen2.5-instruct 7B with 2 replicas (one replica on one card)
First query (without cache, just calculating) E2E time:
LONG_PROMPT+q1: ~2.96 s
Second query (with transferring) E2E time:
LONG_PROMPT+q2: ~1.33 s
Limitations
- Rollback for xinference is not currently supported (it will be supported in the future)
- Enabling Xavier means enabling vllm's
enable_prefix_caching. The vllm version needs to be >= 0.6.5 - Gloo cannot recognize the
0.0.0.0address, so when starting xinference, you need to use the actual IP address, for example:xinference-local -H 192.168.xx.xx.