GitHub - dehora/mlx-server

Local LLM inference server for Apple Silicon using vllm-mlx. Serves MLX-quantized models via an OpenAI-compatible API.

Setup

Requires Python 3.13+ and uv.

Usage

The server starts on port 8082 by default, serving mlx-community/Qwen3.5-27B-6bit.

Override with environment variables:

MLX_MODEL=mlx-community/Qwen3.5-27B-4bit MLX_PORT=8080 ./serve.sh

Benchmarks

Qwen3.5-27B generation throughput (isolated 3-run averages, M3 Max 96GB):

Backend Quantization tok/s
MLX 4bit 16.8
MLX 6bit 11.8
MLX mxfp8 9.4
Ollama Q4_K_M 9.6

MLX 6bit is the default — best balance of quality and throughput (+23% over ollama Q4_K_M).