Python package
KV cache management for efficient attention computation during inference.
This package provides implementations for managing key-value caches used in transformer models. The paged attention implementation enables efficient memory management by fragmenting cache memory into pages, allowing for better memory utilization and support for prefix caching.
Functions
load_kv_manager: Load and initialize a KV cache manager.available_port: Find an available TCP port for transfer engine communication.
Modules
registry: KV cache manager factory functions and utilities.
Packages
paged_kv_cache: Paged attention KV cache implementation.
Classes
PagedKVCacheManager: Manager for paged KV cache with data and tensor parallelism support.KVTransferEngine: Manages KV cache transfers between devices in distributed settings.KVTransferEngineMetadata: Metadata for KV cache transfer engine configuration.TransferReqData: Data structure for KV cache transfer requests.