GitHub - dehora/mlx-server

Local LLM inference server for Apple Silicon using vllm-mlx. Serves MLX-quantized models via an OpenAI-compatible API.

Setup

Requires Python 3.13+ and uv.

The server starts on port 8082 by default, serving mlx-community/Qwen3.5-27B-6bit.

Override with environment variables:

MLX_MODEL=mlx-community/Qwen3.5-27B-4bit MLX_PORT=8080 ./serve.sh

Qwen3.5-27B generation throughput (isolated 3-run averages, M3 Max 96GB):

MLX 6bit is the default — best balance of quality and throughput (+23% over ollama Q4_K_M).