GitHub - zenlm/engine: High-performance inference engine

Zen Engine

Native AI inference engine for Zen models - Blazingly fast LLM, embedding, and multimodal inference.

| GitHub | Zen AI | HuggingFace |

Zen Engine is a high-performance, Rust-native inference engine powering the entire Zen model family:

🚀 Blazingly Fast: Native Rust implementation for maximum performance
🔮 All Zen Models: Optimized for zen-nano, zen-eco, zen-agent, zen-director, zen-musician
🎯 Multimodal Native: text↔text, text+vision↔text, text→speech, text→music, text→video
🌐 APIs: Rust, Python, OpenAI HTTP server (Chat Completions, Embeddings API)
🔗 MCP Support: Connect to external tools and services automatically
⚡ Performance: ISQ, PagedAttention, FlashAttention, MLX support for Apple Silicon
🔧 Default Port: 3690 (Zen model serving)

Quick Start 🚀

Installation

# Clone the repository
git clone https://github.com/zenlm/zen-engine.git
cd zen-engine

# Build the engine
cargo build --release

# Or install via cargo
cargo install --git https://github.com/zenlm/zen-engine

Running Zen Models

# Start zen-nano (0.6B)
zen-engine serve --model zenlm/zen-nano-0.6b --port 3690

# Start zen-eco instruct (4B)
zen-engine serve --model zenlm/zen-eco-4b-instruct --port 3690

# Start zen-agent with tools (4B)
zen-engine serve --model zenlm/zen-agent-4b --enable-mcp --port 3690

# Start zen-musician (7B) for music generation
zen-engine serve --model zenlm/zen-musician-7b --port 3690

# With MLX on Apple Silicon
zen-engine serve --model zenlm/zen-nano-0.6b --backend mlx

# With GGUF quantization
zen-engine serve --model zenlm/zen-eco-4b-instruct-Q4_K_M.gguf

Chat Completions API

# Chat with zen-eco
curl -X POST http://localhost:3690/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zen-eco-4b-instruct",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Embeddings API (Zen Embedding)

# Generate embeddings with Zen Embedding
curl -X POST http://localhost:3690/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zen-embedding-8b",
    "input": "Zen models are fast and efficient"
  }'

Supported Zen Models

Zen Model Family

zen-nano (0.6B)

Architecture: Zen MoDE 0.6B
Formats: PyTorch, MLX, GGUF (Q2_K, Q4_K_M, Q8_0, F16)
Use Case: Edge devices, mobile, low-latency inference
Memory: 1-2GB

zen-eco (4B)

Variants: instruct, thinking, agent
Architecture: Zen MoDE 4B
Formats: PyTorch, MLX, GGUF (Q2_K, Q4_K_M, Q5_K_M, Q8_0, F16)
Use Case: General purpose, balanced performance
Memory: 4-8GB

zen-agent (4B)

Architecture: Zen MoDE 4B + tool-calling
Dataset: Salesforce/xlam-function-calling-60k
Formats: PyTorch, MLX, GGUF
Use Case: Tool use, function calling, API integration
Memory: 4-8GB

zen-director (5B)

Architecture: Wan2.2-TI2V-5B
Modality: Text/Image → Video
Formats: PyTorch (Diffusion)
Use Case: Video generation
Memory: 12-16GB

zen-musician (7B)

Architecture: YuE-s1-7B
Modality: Lyrics → Music (vocals + accompaniment)
Formats: PyTorch, LoRA adapters
Use Case: Music generation
Memory: 16-24GB

Embedding Models (Optimized)

zen-embedding-8b: 4096 dims, #1 on MTEB multilingual
zen-embedding-4b: 2048 dims, balanced performance
zen-embedding-0.6b: 1024 dims, lightweight

Reranker Models

zen-reranker-4b: Superior reranking quality
zen-reranker-0.6b: Lightweight reranking

Model Management

Pull Models

# Pull from HuggingFace
zen-engine pull zenlm/zen-nano-0.6b --source huggingface

# Pull GGUF format
zen-engine pull zenlm/zen-eco-4b-instruct-Q4_K_M.gguf

# Pull MLX format (Apple Silicon)
zen-engine pull zenlm/zen-nano-0.6b --source mlx

# List downloaded models
zen-engine list

# Delete a model
zen-engine delete zen-nano-0.6b

Performance Features

Apple Silicon Optimization (MLX)

44K+ tokens/second on M-series chips
Optimized for zen-nano and zen-eco
Reduced memory usage
Native 4.5-bit quantization

CUDA Support

Multi-GPU inference
Flash Attention v2
Tensor parallelism
Optimized for RTX 3060+

Quantization Support

GGUF: Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16, F32
Dynamic: Runtime quantization
ISQ: In-Situ Quantization
BitDelta: Efficient delta compression
AWQ/GPTQ: Post-training quantization

API Compatibility

OpenAI Compatible

/v1/chat/completions - Chat with zen models
/v1/embeddings - Generate embeddings
/v1/models - List available models
Drop-in replacement for OpenAI SDK

Ollama Compatible (Port 11434)

For backward compatibility with Ollama clients:

zen-engine serve --ollama-compat --port 11434

Integration Examples

Python SDK

from zen_engine import ZenEngine

# Initialize engine
engine = ZenEngine(
    model="zenlm/zen-eco-4b-instruct",
    device="cuda",
    quantization="Q4_K_M"
)

# Generate text
response = engine.generate(
    "Explain the theory of relativity",
    max_tokens=500,
    temperature=0.7
)
print(response)

# Generate embeddings
embeddings = engine.embed([
    "First sentence",
    "Second sentence"
])

Rust API

use zen_engine::{ZenEngine, ModelConfig};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Load model
    let engine = ZenEngine::new(ModelConfig {
        model: "zenlm/zen-nano-0.6b".to_string(),
        device: "cuda:0".to_string(),
        quantization: Some("Q4_K_M".to_string()),
        ..Default::default()
    }).await?;

    // Generate
    let response = engine.generate(
        "Write a haiku about AI",
        None
    ).await?;

    println!("{}", response);
    Ok(())
}

OpenAI SDK (Drop-in)

from openai import OpenAI

# Point to zen-engine
client = OpenAI(
    base_url="http://localhost:3690/v1",
    api_key="not-needed"
)

# Use any zen model
response = client.chat.completions.create(
    model="zen-eco-4b-instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)
print(response.choices[0].message.content)

Performance Benchmarks

Inference Speed (tokens/second)

Model	Format	Device	Speed
zen-nano-0.6b	MLX	M3 Max	44,000
zen-nano-0.6b	GGUF Q4	M3 Max	32,000
zen-nano-0.6b	GGUF Q4	RTX 4090	28,000
zen-eco-4b	MLX	M3 Max	18,000
zen-eco-4b	GGUF Q4	RTX 4090	12,000
zen-eco-4b	GGUF Q4	RTX 3060	5,500

Memory Usage

Model	Format	VRAM/RAM
zen-nano-0.6b	F16	1.2GB
zen-nano-0.6b	Q4_K_M	0.4GB
zen-eco-4b	F16	8GB
zen-eco-4b	Q4_K_M	2.3GB
zen-eco-4b	Q2_K	1.6GB

Development

Building from Source

# Clone repository
git clone https://github.com/zenlm/zen-engine.git
cd zen-engine

# Basic build
cargo build --release

# With CUDA support (Linux)
cargo build --release --features "cuda flash-attn cudnn"

# With Metal support (macOS)
cargo build --release --features metal

# With all features
cargo build --release --features "cuda metal flash-attn"

Running Tests

# Run all tests
cargo test --workspace

# Run specific package tests
cargo test -p zen-engine-core
cargo test -p zen-engine-quant

# Run benchmarks
cargo bench

Code Quality

# Format code
cargo fmt --all

# Run clippy
cargo clippy --workspace --tests -- -D warnings

# Check formatting
cargo fmt --all -- --check

Architecture

Workspace Structure

zen-engine-core/ - Core inference engine, Zen model implementations
zen-engine-server/ - CLI binary and HTTP server
zen-engine-python/ - Python bindings (PyO3)
zen-engine-vision/ - Vision model support (zen-director)
zen-engine-audio/ - Audio processing (zen-musician)
zen-engine-quant/ - Quantization (GGUF, AWQ, GPTQ, BitDelta)
zen-engine-paged-attn/ - PagedAttention implementation
zen-engine-mcp/ - Model Context Protocol (zen-agent)

Key Features

Unified Model Loading: All Zen models load through standardized pipelines
Multi-Backend: CUDA, Metal, CPU with automatic device selection
Zero-Copy: Efficient memory sharing between Rust and Python
Async First: Tokio-based async runtime for concurrent requests
Type Safety: Strongly typed Rust with Python bindings

Configuration

Environment Variables

# Server configuration
export ZEN_ENGINE_PORT=3690
export ZEN_ENGINE_HOST="0.0.0.0"

# Model paths
export ZEN_MODELS_PATH="/path/to/models"
export ZEN_CACHE_DIR="/path/to/cache"

# Device configuration
export ZEN_DEVICE="cuda:0"
export ZEN_DEVICE_MAP="auto"

# Performance
export ZEN_BATCH_SIZE=32
export ZEN_MAX_CONCURRENT=10

Config File (zen-engine.toml)

[server]
port = 3690
host = "0.0.0.0"
max_concurrent = 10

[models]
cache_dir = "/path/to/cache"
default_quantization = "Q4_K_M"

[cuda]
devices = [0, 1]
flash_attn = true

[mlx]
enabled = true

Deployment

Docker

# Build image
docker build -t zen-engine .

# Run container
docker run -p 3690:3690 \
  -v ./models:/models \
  zen-engine serve --model zenlm/zen-eco-4b-instruct

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zen-engine
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: zen-engine
        image: zenlm/zen-engine:latest
        ports:
        - containerPort: 3690
        env:
        - name: ZEN_DEVICE
          value: "cuda:0"
        resources:
          limits:
            nvidia.com/gpu: 1

Troubleshooting

Common Issues

Out of Memory

# Use smaller quantization
zen-engine serve --model zen-eco-4b-Q2_K.gguf

# Enable CPU offload
zen-engine serve --model zen-eco-4b --cpu-offload

Slow Inference

# Enable FlashAttention
zen-engine serve --model zen-eco-4b --flash-attn

# Use MLX on Apple Silicon
zen-engine serve --model zen-eco-4b --backend mlx

CUDA Errors

# Check CUDA version
nvidia-smi

# Rebuild with correct CUDA
cargo clean
cargo build --release --features "cuda flash-attn"

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Credits

Zen Engine is built on mistral.rs by Eric Buehler. We thank the mistral.rs team for their excellent work on Rust-native LLM inference.

License

Apache 2.0 License - see LICENSE for details.

Citation

@misc{zenengine2025,
  title={Zen Engine: High-Performance Inference for Zen Models},
  author={Zen AI Team},
  year={2025},
  howpublished={\url{https://github.com/zenlm/zen-engine}}
}