API Server Example
Start an OpenAI-compatible HTTP inference server powered by Zerfoo.
Prerequisites
- Go 1.25+
- A GGUF model file (e.g., Gemma 3 1B or Llama 3.2 1B)
Downloading a test model
pip install huggingface-hub huggingface-cli download google/gemma-3-1b-it-qat-q4_0-gguf \ --local-dir ./models
Build
go build -o api-server ./examples/api-server/
Run
./api-server ./models/gemma-3-1b-it-qat-q4_0.gguf
With a custom port:
./api-server -port 9090 ./models/gemma-3-1b-it-qat-q4_0.gguf
With GPU acceleration:
./api-server -device cuda ./models/gemma-3-1b-it-qat-q4_0.gguf
Testing with curl
Chat completion
curl -s http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-3-1b-it", "messages": [{"role": "user", "content": "What is 2+2?"}], "temperature": 0.7, "max_tokens": 128 }' | jq .
Text completion
curl -s http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-3-1b-it", "prompt": "The capital of France is", "max_tokens": 64 }' | jq .
Streaming
curl -N http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-3-1b-it", "messages": [{"role": "user", "content": "Write a haiku about Go."}], "stream": true }'
List models
curl -s http://localhost:8080/v1/models | jq .
Endpoints
| Method | Path | Description |
|---|---|---|
| POST | /v1/chat/completions |
Chat completion (OpenAI-compatible) |
| POST | /v1/completions |
Text completion |
| POST | /v1/embeddings |
Text embeddings |
| GET | /v1/models |
List loaded models |
| GET | /openapi.yaml |
OpenAPI specification |
| GET | /metrics |
Prometheus metrics |