FEAT: Xavier: Share KV cache between VLLM replicas by ChengjieLi28 · Pull Request #2732

FEAT: Xavier: Share KV cache between VLLM replicas by ChengjieLi28 · Pull Request #2732 · xorbitsai/inference

Xavier: Share KV cache between VLLM replicas

Naming

It is derived from Professor X (Charles Francis Xavier) in the Marvel Comics X-Men series. The project name starts with "X," and like Professor X, who possesses a powerful mind that controls information, this metaphorically refers to the project managing the data scheduling in vllm.

Purpose

In vllm with multiple replicas, some long prompts have a lengthy prefill time. If other replicas have already computed the results, they can be directly transferred and used.

Usage

Simply add the parameter enable_xavier=True when starting the vllm model.

Test

Using this script to generate a long prompt for LLM (about 9k+ prompt token):

from faker import Faker
import pandas as pd


def gen_data(lines: int):
    faker = Faker()
    data = {
        "Name": [faker.name() for _ in range(lines)],
        "Age": [faker.random_int(min=15, max=80) for _ in range(lines)],
        "Occupation": [faker.job() for _ in range(lines)],
        "Country": [faker.country() for _ in range(lines)],
        "Email": [faker.email() for _ in range(lines)],
        "Address": [faker.address() for _ in range(lines)],
        "Phone Number": [faker.phone_number() for _ in range(lines)]
    }
    df = pd.DataFrame(data)
    markdown_table = df.to_markdown(index=False)
    return markdown_table

LONG_PROMPT = "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n" + f"""
{gen_data(100)}
"""
q1 = "Question: What is the name and country of ID 23? Your answer: The name and country of ID 23 are "
q2 = "Question: What is the name and country of ID 96? Your answer: The name and country of ID 96 are "

Use LONG_PROMPT+q1 and LONG_PROMPT+q2 as prompts to interact with the model separately for each query.

Test Results:

Env two RTX 3090TI with nvlink
Qwen2.5-instruct 7B with 2 replicas (one replica on one card)

First query (without cache, just calculating) E2E time:
LONG_PROMPT+q1: ~2.96 s
Second query (with transferring) E2E time:
LONG_PROMPT+q2: ~1.33 s

Limitations

Rollback for xinference is not currently supported (it will be supported in the future)
Enabling Xavier means enabling vllm's enable_prefix_caching. The vllm version needs to be >= 0.6.5
Gloo cannot recognize the 0.0.0.0 address, so when starting xinference, you need to use the actual IP address, for example: xinference-local -H 192.168.xx.xx.