Experimemts
- Offline Kernel-Aware Sparse Pattern Search
- MInference Benchmark Experiments
- MInference Downstream Tasks Experiments
Offline Kernel-Aware Sparse Pattern Search
You can use the following scripts to search for the optimal head sparse pattern:
cd experiments/infinite_bench
python run_infinitebench.py \
--task kv_retrieval \
--model_name_or_path gradientai/Llama-3-8B-Instruct-262k \
--data_dir ./data \
--output_dir ./results \
--max_seq_length 30000 \
--rewrite \
--is_search \
--start_example_id 3 \
--topk_dims_file_path Llama_3_8B_Instruct_262k_kv_out_v32_fit_o_best_pattern.json \
--num_eval_examples 20 --topk 1 --starting_layer 0 --attn_type minference
MInference Benchmark Experiments
Note
All experiments were run on a single A100 GPU with 80GB of VRAM.
Environment parameters:
- CUDA 12.3
- Triton 2.1.0
End-to-End Benchmark
To demonstrate the efficiency of our method, we conducted end-to-end latency tests using the LLaMA-3-8B-Instruct-1M model. The prompts were trimmed to different target token numbers, and we measured the pre-filling stage latency without using KV cache.
Tip
To benefit from fast kernel implementations, we recommend installing SGLang or vLLM. for sglang
uv pip install "sglang[all]>=0.4.6.post4"for vllm
uv pip install "vllm>=0.9.0"
uv pip install git+https://github.com/vllm-project/flash-attention- Download the prompt:
wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt
- Run a single context window size test using one method:
# If the context window is greater than 700K, you need to enable kv_cache_cpu.
python experiments/benchmarks/benchmark_e2e.py --attn_type minference --context_window 1_000_000 --kv_cache_cpu
python experiments/benchmarks/benchmark_e2e.py --attn_type minference_with_dense --context_window 1_000_000 --kv_cache_cpu
python experiments/benchmarks/benchmark_e2e.py --attn_type minference --context_window 500_000- Run all latency experiments using different methods:
python experiments/benchmarks/benchmark_e2e.py --run_benchmark
- After that, you should get the end-to-end latency results like this:
FlashAttention-2 A-Shape InfLLM MInference MInference w/ SGLang 1K 0.54565 1.07110 2.94495 2.96450 1.25478 10K 0.97590 1.18339 2.21052 2.77618 2.43798 50K 8.52933 5.47972 14.63624 7.54537 6.19598 100K 24.88319 10.86379 27.67215 13.98508 10.81580 200K 79.39184 21.61490 55.64703 26.81303 20.62303 300K 169.62441 32.44844 80.74326 41.09374 31.01629 500K 456.78353 54.15910 167.91472 66.27691 51.96293 1000K 1765.56387 107.85639 328.58551 179.12031 112.37610
Tip
Based on our tests, a single A100 can support up to 1.8M context prompts during the pre-filling stage using LLaMA-3-8B-4M with bf16.
End-to-End Benchmark using vLLM
And we also built the End-to-End benchmark using vLLM. You can run it using the following scripts:
- Download the prompt:
wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt
- Run a single context window size test using one method:
python experiments/benchmarks/benchmark_e2e_vllm.py --attn_type minference --context_window 100_000
Here are some vLLM latency data, for reference, on an A100:
FlashAttention-2 MInference 1K 0.08062 3.01744 10K 0.83215 2.76216 50K 7.71675 7.53989 100K 21.73080 14.08111 128K 32.86252 18.82662
Please note that the current vLLM version only supports MInference and FlashAttention modes. Due to vLLM's PageAttention management of the KV cache, a single A100 can handle a maximum context window size of 130k.
Micro-Benchmark
MInference Downstream Tasks Experiments
Note
All of these experiments were run on one A100 GPUs with 80GB of VRAM. You may need to modify commands to fit your own computing environment (e.g., changing the batch size, the max memory per GPU, the number of GPUs, etc)
InfiniteBench
InfiniteBench is a benchmark tailored for evaluating the capabilities of language models to process, understand, and reason over super long contexts.
InfiniteBench consists of the following tasks: kv_retrieval, longbook_choice_eng, math_find, longbook_qa_chn, longbook_qa_eng, longdialogue_qa_eng, code_debug, longbook_sum_eng, number_string, passkey.
- Run InfiniteBench with
MInference:
bash experiments/infinite_bench/run_infinitebench.sh gradientai/Llama-3-8B-Instruct-262k 160000 -1 minference
- Experimental results
| Methods | longbook_sum_eng | longbook_qa_eng | longbook_choice_eng | longdialogue_qa_eng | longbook_qa_chn | code_debug | math_find | passkey | number_string | kv_retrieval | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Full Attention | 20.2 | 12.4 | 67.3 | 6.0 | 12.9 | 22.1 | 26.6 | 100.0 | 100.0 | 14.4 | 38.2 |
| Ours | 20.5 | 12.9 | 65.9 | 7.5 | 12.5 | 22.3 | 33.1 | 100.0 | 100.0 | 12.8 | 38.8 |
RULER
The RULER benchmark is a challenging long-context LLMs benchmark contains four task categories: Retrieval, Multi-hop Tracing, Aggregation, and Question Answering (QA). The retrieval tasks extend the needle-in-a-haystack (NIAH) test; Multi-hop Tracing involves resolving coreference chains; Aggregation requires identifying frequently occurring words; and QA adapts existing datasets with added distracting information to test models on extended contexts.
To run the RULER benchmark, you need first install the requirements:
pip install Cython pip install nemo-toolkit[all]==1.21 --no-deps pip install -r experiments/ruler/requirements.txt
- Download required data files:
cd experiments/ruler/data/synthetic/json/ # download Paul Graham Essays for the needle test python download_paulgraham_essay.py # download SQuAD and HotpotQA for the QA test bash download_qa_dataset.sh # you will need nltk.download('punkt') as well python -c "import nltk; nltk.download('punkt')"
- Run RULER with
MInferenceandLlama-3-8B-Instruct-262k:
# under the experiments/ruler/ directory
bash run.shThe default output dir results/ruler, you use ROOT_DIR in run.sh to change the output dir.
- Experimental results
| Models | Claimed | Effective | 4K | 8K | 16K | 32K | 64K | 128K | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Full-Attention | 262K | 16K | 97.2 | 91.8 | 87.3 | 80.8 | 77.4 | 72.2 | 84.4 |
| Ours | - | 32K | 97.7 | 91.2 | 88.5 | 85.0 | 82.3 | 77.6 | 87.0 |
PPL
We use continues 100K text chunks samples from PG-19 for the perplexity evaluation.
- To run the PPL test, simply run:
bash experiments/ppl/run_ppl.sh
The result will be saved at results/long-ppl/, and the visualization will be saved as results/long-ppl/long-ppl-viz.png.
- Experimental results
Needle in A Haystack
The Needle In A Haystack test is an evaluation method that randomly inserts key information into long texts to form prompts for large language models (LLMs). The test aims to detect whether large models can extract such key information from extensive texts, thereby assessing the models’ capabilities in processing and understanding long documents.
- Run the Needle in A Haystack test:
bash experiments/needle_in_a_haystack/run_needle.sh
The results will be saved under ./needle directory.
- Our experimental results

