Local LLM inference server for Apple Silicon using vllm-mlx. Serves MLX-quantized models via an OpenAI-compatible API.
Setup
Requires Python 3.13+ and uv.
Usage
The server starts on port 8082 by default, serving mlx-community/Qwen3.5-27B-6bit.
Override with environment variables:
MLX_MODEL=mlx-community/Qwen3.5-27B-4bit MLX_PORT=8080 ./serve.sh
Benchmarks
Qwen3.5-27B generation throughput (isolated 3-run averages, M3 Max 96GB):
| Backend | Quantization | tok/s |
|---|---|---|
| MLX | 4bit | 16.8 |
| MLX | 6bit | 11.8 |
| MLX | mxfp8 | 9.4 |
| Ollama | Q4_K_M | 9.6 |
MLX 6bit is the default — best balance of quality and throughput (+23% over ollama Q4_K_M).