Zen Engine
Native AI inference engine for Zen models - Blazingly fast LLM, embedding, and multimodal inference.
| GitHub | Zen AI | HuggingFace |
Zen Engine is a high-performance, Rust-native inference engine powering the entire Zen model family:
- 🚀 Blazingly Fast: Native Rust implementation for maximum performance
- 🔮 All Zen Models: Optimized for zen-nano, zen-eco, zen-agent, zen-director, zen-musician
- 🎯 Multimodal Native: text↔text, text+vision↔text, text→speech, text→music, text→video
- 🌐 APIs: Rust, Python, OpenAI HTTP server (Chat Completions, Embeddings API)
- 🔗 MCP Support: Connect to external tools and services automatically
- ⚡ Performance: ISQ, PagedAttention, FlashAttention, MLX support for Apple Silicon
- 🔧 Default Port: 3690 (Zen model serving)
Quick Start 🚀
Installation
# Clone the repository git clone https://github.com/zenlm/zen-engine.git cd zen-engine # Build the engine cargo build --release # Or install via cargo cargo install --git https://github.com/zenlm/zen-engine
Running Zen Models
# Start zen-nano (0.6B) zen-engine serve --model zenlm/zen-nano-0.6b --port 3690 # Start zen-eco instruct (4B) zen-engine serve --model zenlm/zen-eco-4b-instruct --port 3690 # Start zen-agent with tools (4B) zen-engine serve --model zenlm/zen-agent-4b --enable-mcp --port 3690 # Start zen-musician (7B) for music generation zen-engine serve --model zenlm/zen-musician-7b --port 3690 # With MLX on Apple Silicon zen-engine serve --model zenlm/zen-nano-0.6b --backend mlx # With GGUF quantization zen-engine serve --model zenlm/zen-eco-4b-instruct-Q4_K_M.gguf
Chat Completions API
# Chat with zen-eco curl -X POST http://localhost:3690/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "zen-eco-4b-instruct", "messages": [ {"role": "user", "content": "Explain quantum computing"} ], "temperature": 0.7, "max_tokens": 500 }'
Embeddings API (Zen Embedding)
# Generate embeddings with Zen Embedding curl -X POST http://localhost:3690/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "zen-embedding-8b", "input": "Zen models are fast and efficient" }'
Supported Zen Models
Zen Model Family
zen-nano (0.6B)
- Architecture: Zen MoDE 0.6B
- Formats: PyTorch, MLX, GGUF (Q2_K, Q4_K_M, Q8_0, F16)
- Use Case: Edge devices, mobile, low-latency inference
- Memory: 1-2GB
zen-eco (4B)
- Variants: instruct, thinking, agent
- Architecture: Zen MoDE 4B
- Formats: PyTorch, MLX, GGUF (Q2_K, Q4_K_M, Q5_K_M, Q8_0, F16)
- Use Case: General purpose, balanced performance
- Memory: 4-8GB
zen-agent (4B)
- Architecture: Zen MoDE 4B + tool-calling
- Dataset: Salesforce/xlam-function-calling-60k
- Formats: PyTorch, MLX, GGUF
- Use Case: Tool use, function calling, API integration
- Memory: 4-8GB
zen-director (5B)
- Architecture: Wan2.2-TI2V-5B
- Modality: Text/Image → Video
- Formats: PyTorch (Diffusion)
- Use Case: Video generation
- Memory: 12-16GB
zen-musician (7B)
- Architecture: YuE-s1-7B
- Modality: Lyrics → Music (vocals + accompaniment)
- Formats: PyTorch, LoRA adapters
- Use Case: Music generation
- Memory: 16-24GB
Embedding Models (Optimized)
- zen-embedding-8b: 4096 dims, #1 on MTEB multilingual
- zen-embedding-4b: 2048 dims, balanced performance
- zen-embedding-0.6b: 1024 dims, lightweight
Reranker Models
- zen-reranker-4b: Superior reranking quality
- zen-reranker-0.6b: Lightweight reranking
Model Management
Pull Models
# Pull from HuggingFace zen-engine pull zenlm/zen-nano-0.6b --source huggingface # Pull GGUF format zen-engine pull zenlm/zen-eco-4b-instruct-Q4_K_M.gguf # Pull MLX format (Apple Silicon) zen-engine pull zenlm/zen-nano-0.6b --source mlx # List downloaded models zen-engine list # Delete a model zen-engine delete zen-nano-0.6b
Performance Features
Apple Silicon Optimization (MLX)
- 44K+ tokens/second on M-series chips
- Optimized for zen-nano and zen-eco
- Reduced memory usage
- Native 4.5-bit quantization
CUDA Support
- Multi-GPU inference
- Flash Attention v2
- Tensor parallelism
- Optimized for RTX 3060+
Quantization Support
- GGUF: Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16, F32
- Dynamic: Runtime quantization
- ISQ: In-Situ Quantization
- BitDelta: Efficient delta compression
- AWQ/GPTQ: Post-training quantization
API Compatibility
OpenAI Compatible
/v1/chat/completions- Chat with zen models/v1/embeddings- Generate embeddings/v1/models- List available models- Drop-in replacement for OpenAI SDK
Ollama Compatible (Port 11434)
For backward compatibility with Ollama clients:
zen-engine serve --ollama-compat --port 11434
Integration Examples
Python SDK
from zen_engine import ZenEngine # Initialize engine engine = ZenEngine( model="zenlm/zen-eco-4b-instruct", device="cuda", quantization="Q4_K_M" ) # Generate text response = engine.generate( "Explain the theory of relativity", max_tokens=500, temperature=0.7 ) print(response) # Generate embeddings embeddings = engine.embed([ "First sentence", "Second sentence" ])
Rust API
use zen_engine::{ZenEngine, ModelConfig}; #[tokio::main] async fn main() -> anyhow::Result<()> { // Load model let engine = ZenEngine::new(ModelConfig { model: "zenlm/zen-nano-0.6b".to_string(), device: "cuda:0".to_string(), quantization: Some("Q4_K_M".to_string()), ..Default::default() }).await?; // Generate let response = engine.generate( "Write a haiku about AI", None ).await?; println!("{}", response); Ok(()) }
OpenAI SDK (Drop-in)
from openai import OpenAI # Point to zen-engine client = OpenAI( base_url="http://localhost:3690/v1", api_key="not-needed" ) # Use any zen model response = client.chat.completions.create( model="zen-eco-4b-instruct", messages=[ {"role": "user", "content": "Hello!"} ] ) print(response.choices[0].message.content)
Performance Benchmarks
Inference Speed (tokens/second)
| Model | Format | Device | Speed |
|---|---|---|---|
| zen-nano-0.6b | MLX | M3 Max | 44,000 |
| zen-nano-0.6b | GGUF Q4 | M3 Max | 32,000 |
| zen-nano-0.6b | GGUF Q4 | RTX 4090 | 28,000 |
| zen-eco-4b | MLX | M3 Max | 18,000 |
| zen-eco-4b | GGUF Q4 | RTX 4090 | 12,000 |
| zen-eco-4b | GGUF Q4 | RTX 3060 | 5,500 |
Memory Usage
| Model | Format | VRAM/RAM |
|---|---|---|
| zen-nano-0.6b | F16 | 1.2GB |
| zen-nano-0.6b | Q4_K_M | 0.4GB |
| zen-eco-4b | F16 | 8GB |
| zen-eco-4b | Q4_K_M | 2.3GB |
| zen-eco-4b | Q2_K | 1.6GB |
Development
Building from Source
# Clone repository git clone https://github.com/zenlm/zen-engine.git cd zen-engine # Basic build cargo build --release # With CUDA support (Linux) cargo build --release --features "cuda flash-attn cudnn" # With Metal support (macOS) cargo build --release --features metal # With all features cargo build --release --features "cuda metal flash-attn"
Running Tests
# Run all tests cargo test --workspace # Run specific package tests cargo test -p zen-engine-core cargo test -p zen-engine-quant # Run benchmarks cargo bench
Code Quality
# Format code cargo fmt --all # Run clippy cargo clippy --workspace --tests -- -D warnings # Check formatting cargo fmt --all -- --check
Architecture
Workspace Structure
zen-engine-core/- Core inference engine, Zen model implementationszen-engine-server/- CLI binary and HTTP serverzen-engine-python/- Python bindings (PyO3)zen-engine-vision/- Vision model support (zen-director)zen-engine-audio/- Audio processing (zen-musician)zen-engine-quant/- Quantization (GGUF, AWQ, GPTQ, BitDelta)zen-engine-paged-attn/- PagedAttention implementationzen-engine-mcp/- Model Context Protocol (zen-agent)
Key Features
- Unified Model Loading: All Zen models load through standardized pipelines
- Multi-Backend: CUDA, Metal, CPU with automatic device selection
- Zero-Copy: Efficient memory sharing between Rust and Python
- Async First: Tokio-based async runtime for concurrent requests
- Type Safety: Strongly typed Rust with Python bindings
Configuration
Environment Variables
# Server configuration export ZEN_ENGINE_PORT=3690 export ZEN_ENGINE_HOST="0.0.0.0" # Model paths export ZEN_MODELS_PATH="/path/to/models" export ZEN_CACHE_DIR="/path/to/cache" # Device configuration export ZEN_DEVICE="cuda:0" export ZEN_DEVICE_MAP="auto" # Performance export ZEN_BATCH_SIZE=32 export ZEN_MAX_CONCURRENT=10
Config File (zen-engine.toml)
[server] port = 3690 host = "0.0.0.0" max_concurrent = 10 [models] cache_dir = "/path/to/cache" default_quantization = "Q4_K_M" [cuda] devices = [0, 1] flash_attn = true [mlx] enabled = true
Deployment
Docker
# Build image docker build -t zen-engine . # Run container docker run -p 3690:3690 \ -v ./models:/models \ zen-engine serve --model zenlm/zen-eco-4b-instruct
Kubernetes
apiVersion: apps/v1 kind: Deployment metadata: name: zen-engine spec: replicas: 3 template: spec: containers: - name: zen-engine image: zenlm/zen-engine:latest ports: - containerPort: 3690 env: - name: ZEN_DEVICE value: "cuda:0" resources: limits: nvidia.com/gpu: 1
Troubleshooting
Common Issues
Out of Memory
# Use smaller quantization zen-engine serve --model zen-eco-4b-Q2_K.gguf # Enable CPU offload zen-engine serve --model zen-eco-4b --cpu-offload
Slow Inference
# Enable FlashAttention zen-engine serve --model zen-eco-4b --flash-attn # Use MLX on Apple Silicon zen-engine serve --model zen-eco-4b --backend mlx
CUDA Errors
# Check CUDA version nvidia-smi # Rebuild with correct CUDA cargo clean cargo build --release --features "cuda flash-attn"
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
Credits
Zen Engine is built on mistral.rs by Eric Buehler. We thank the mistral.rs team for their excellent work on Rust-native LLM inference.
License
Apache 2.0 License - see LICENSE for details.
Citation
@misc{zenengine2025, title={Zen Engine: High-Performance Inference for Zen Models}, author={Zen AI Team}, year={2025}, howpublished={\url{https://github.com/zenlm/zen-engine}} }
Links
- GitHub: https://github.com/zenlm/zen-engine
- Organization: https://github.com/zenlm
- HuggingFace Models: https://huggingface.co/zenlm
- Zen Gym (Training): https://github.com/zenlm/zen-gym
- Zen Musician: https://github.com/zenlm/zen-musician
- Zen 3D: https://github.com/zenlm/zen-3d
Zen Engine - Blazingly fast inference for all Zen models
Part of the Zen AI ecosystem.