MInference/experiments at main · microsoft/MInference

Experimemts

Offline Kernel-Aware Sparse Pattern Search

You can use the following scripts to search for the optimal head sparse pattern:

cd experiments/infinite_bench
python run_infinitebench.py \
    --task kv_retrieval \
    --model_name_or_path gradientai/Llama-3-8B-Instruct-262k \
    --data_dir ./data \
    --output_dir ./results \
    --max_seq_length 30000 \
    --rewrite \
    --is_search \
    --start_example_id 3 \
    --topk_dims_file_path Llama_3_8B_Instruct_262k_kv_out_v32_fit_o_best_pattern.json \
    --num_eval_examples 20 --topk 1 --starting_layer 0 --attn_type minference

MInference Benchmark Experiments

Note

All experiments were run on a single A100 GPU with 80GB of VRAM.

Environment parameters:

CUDA 12.3
Triton 2.1.0

End-to-End Benchmark

To demonstrate the efficiency of our method, we conducted end-to-end latency tests using the LLaMA-3-8B-Instruct-1M model. The prompts were trimmed to different target token numbers, and we measured the pre-filling stage latency without using KV cache.

Tip

To benefit from fast kernel implementations, we recommend installing SGLang or vLLM. for sglang

uv pip install "sglang[all]>=0.4.6.post4"

for vllm

uv pip install "vllm>=0.9.0"
uv pip install git+https://github.com/vllm-project/flash-attention

Download the prompt:

wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt

Run a single context window size test using one method:

# If the context window is greater than 700K, you need to enable kv_cache_cpu.
python experiments/benchmarks/benchmark_e2e.py --attn_type minference --context_window 1_000_000 --kv_cache_cpu
python experiments/benchmarks/benchmark_e2e.py --attn_type minference_with_dense --context_window 1_000_000 --kv_cache_cpu

python experiments/benchmarks/benchmark_e2e.py --attn_type minference --context_window 500_000

Run all latency experiments using different methods:

python experiments/benchmarks/benchmark_e2e.py --run_benchmark

After that, you should get the end-to-end latency results like this:

        FlashAttention-2 A-Shape   InfLLM          MInference          MInference w/ SGLang
1K      0.54565         1.07110         2.94495         2.96450         1.25478
10K     0.97590         1.18339         2.21052         2.77618         2.43798
50K     8.52933         5.47972         14.63624        7.54537         6.19598
100K    24.88319        10.86379        27.67215        13.98508        10.81580
200K    79.39184        21.61490        55.64703        26.81303        20.62303
300K    169.62441       32.44844        80.74326        41.09374        31.01629
500K    456.78353       54.15910        167.91472       66.27691        51.96293
1000K   1765.56387      107.85639       328.58551       179.12031       112.37610

Tip

Based on our tests, a single A100 can support up to 1.8M context prompts during the pre-filling stage using LLaMA-3-8B-4M with bf16.

End-to-End Benchmark using vLLM

And we also built the End-to-End benchmark using vLLM. You can run it using the following scripts:

Download the prompt:

wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt

Run a single context window size test using one method:

python experiments/benchmarks/benchmark_e2e_vllm.py --attn_type minference --context_window 100_000

Here are some vLLM latency data, for reference, on an A100:

        FlashAttention-2  MInference
1K      0.08062           3.01744
10K     0.83215           2.76216
50K     7.71675           7.53989
100K    21.73080          14.08111
128K    32.86252          18.82662

Please note that the current vLLM version only supports MInference and FlashAttention modes. Due to vLLM's PageAttention management of the KV cache, a single A100 can handle a maximum context window size of 130k.

Micro-Benchmark

MInference Downstream Tasks Experiments

Note

All of these experiments were run on one A100 GPUs with 80GB of VRAM. You may need to modify commands to fit your own computing environment (e.g., changing the batch size, the max memory per GPU, the number of GPUs, etc)

InfiniteBench

InfiniteBench is a benchmark tailored for evaluating the capabilities of language models to process, understand, and reason over super long contexts.

InfiniteBench consists of the following tasks: kv_retrieval, longbook_choice_eng, math_find, longbook_qa_chn, longbook_qa_eng, longdialogue_qa_eng, code_debug, longbook_sum_eng, number_string, passkey.

Run InfiniteBench with MInference:

bash experiments/infinite_bench/run_infinitebench.sh gradientai/Llama-3-8B-Instruct-262k 160000 -1 minference

Experimental results

Methods	longbook_sum_eng	longbook_qa_eng	longbook_choice_eng	longdialogue_qa_eng	longbook_qa_chn	code_debug	math_find	passkey	number_string	kv_retrieval	Avg.
Full Attention	20.2	12.4	67.3	6.0	12.9	22.1	26.6	100.0	100.0	14.4	38.2
Ours	20.5	12.9	65.9	7.5	12.5	22.3	33.1	100.0	100.0	12.8	38.8

RULER

The RULER benchmark is a challenging long-context LLMs benchmark contains four task categories: Retrieval, Multi-hop Tracing, Aggregation, and Question Answering (QA). The retrieval tasks extend the needle-in-a-haystack (NIAH) test; Multi-hop Tracing involves resolving coreference chains; Aggregation requires identifying frequently occurring words; and QA adapts existing datasets with added distracting information to test models on extended contexts.

To run the RULER benchmark, you need first install the requirements:

pip install Cython
pip install nemo-toolkit[all]==1.21 --no-deps
pip install -r experiments/ruler/requirements.txt

Download required data files:

cd experiments/ruler/data/synthetic/json/

# download Paul Graham Essays for the needle test
python download_paulgraham_essay.py

# download SQuAD and HotpotQA for the QA test
bash download_qa_dataset.sh

# you will need nltk.download('punkt') as well
python -c "import nltk; nltk.download('punkt')"

Run RULER with MInference and Llama-3-8B-Instruct-262k:

# under the experiments/ruler/ directory
bash run.sh

The default output dir results/ruler, you use ROOT_DIR in run.sh to change the output dir.

Experimental results

Models	Claimed	Effective	4K	8K	16K	32K	64K	128K	Avg.
Full-Attention	262K	16K	97.2	91.8	87.3	80.8	77.4	72.2	84.4
Ours	-	32K	97.7	91.2	88.5	85.0	82.3	77.6	87.0

PPL

We use continues 100K text chunks samples from PG-19 for the perplexity evaluation.

To run the PPL test, simply run:

bash experiments/ppl/run_ppl.sh

The result will be saved at results/long-ppl/, and the visualization will be saved as results/long-ppl/long-ppl-viz.png.

Experimental results

Needle in A Haystack

The Needle In A Haystack test is an evaluation method that randomly inserts key information into long texts to form prompts for large language models (LLMs). The test aims to detect whether large models can extract such key information from extensive texts, thereby assessing the models’ capabilities in processing and understanding long documents.

Run the Needle in A Haystack test:

bash experiments/needle_in_a_haystack/run_needle.sh

The results will be saved under ./needle directory.

Our experimental results