mLLM - Local Voice Assistant for iOS
A privacy-focused, offline-first voice assistant app for iOS that runs language models locally on your device.
Features
- Voice Input: Tap-to-talk with live transcription using Apple's Speech Framework
- Local LLM Inference: Run TinyLLaMA or compatible models directly on device
- Voice Output: Natural text-to-speech using Apple's AVSpeechSynthesizer
- Tool Support: Wikipedia search integration for information retrieval
- Multi-Language: German (de-DE) and English (en-US) support
- Privacy First: All processing happens locally except optional web search
Screenshots
The app features:
- Chat interface with user/assistant message bubbles
- Animated microphone button with state indicators
- Real-time status display (Listening/Thinking/Speaking/Idle)
- Settings panel for language and configuration
Requirements
- iOS 17.0+
- Xcode 15.0+
- iPhone with A12 chip or newer (Metal 3 support)
- ~2GB free storage for model files
- ~6GB available RAM recommended
Quick Start
1. Clone the Repository
git clone https://github.com/Benjamin1333/mLLM.git
cd mLLM2. Open in Xcode
3. Download a Model
Download a compatible GGUF model (recommended: TinyLlama 1.1B Q4_K_M):
# Example: Download TinyLlama 1.1B Chat Q4_K_M (~670MB)
curl -L -o model.gguf https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf4. Install the Model
Option A: Via Files App (Recommended)
- Build and run the app on your device
- Open the Files app on iOS
- Navigate to: On My iPhone → Voice Assistant
- Copy
model.ggufto this folder
Option B: Via Xcode
- Connect your device
- Open Window → Devices and Simulators
- Select your device and the mLLM app
- Drag the model file into the app's Documents folder
5. Run the App
- Select your development team in Xcode
- Build and run on your device (Cmd+R)
- Grant microphone and speech recognition permissions
- Tap "Load Model" in settings if not auto-loaded
- Tap the microphone and start talking!
Supported Model Formats
| Format | Extension | Backend | Recommended Quantization |
|---|---|---|---|
| GGUF | .gguf |
llama.cpp | Q4_K_M, Q5_K_M |
Recommended Models
| Model | Size | RAM Usage | Quality |
|---|---|---|---|
| TinyLlama 1.1B Q4_K_M | ~670MB | ~1.5GB | Good for basic tasks |
| TinyLlama 1.1B Q5_K_M | ~800MB | ~1.8GB | Better quality |
| Phi-2 Q4_K_M | ~1.6GB | ~3GB | Higher quality |
Project Structure
mLLM/
├── mLLM.xcodeproj/
├── mLLM/
│ ├── App/
│ │ └── mLLMApp.swift # App entry point
│ ├── Views/
│ │ ├── ContentView.swift # Main view
│ │ ├── ChatView.swift # Message list
│ │ ├── MicrophoneButton.swift # Animated mic button
│ │ ├── StatusView.swift # State indicator
│ │ └── SettingsView.swift # Settings & model picker
│ ├── ViewModels/
│ │ └── AssistantViewModel.swift # Main coordinator
│ ├── Services/
│ │ ├── SpeechToTextService.swift
│ │ ├── TextToSpeechService.swift
│ │ └── LLMService/
│ │ ├── LLMBackend.swift # Backend protocol
│ │ ├── LlamaCppBackend.swift # Primary backend
│ │ ├── MLXBackend.swift # Stub for future
│ │ ├── PromptBuilder.swift
│ │ └── ToolRouter.swift
│ ├── Tools/
│ │ ├── Tool.swift # Tool protocol
│ │ └── WikipediaSearchTool.swift
│ ├── Models/
│ │ ├── ChatMessage.swift
│ │ └── AppState.swift
│ └── Resources/
│ ├── Info.plist
│ └── Assets.xcassets/
└── README.md
MLX on iOS: Status & Fallback
Current Status
MLX is NOT available on iOS. MLX (Apple's machine learning framework) is designed exclusively for macOS with Apple Silicon. Key limitations:
- MLX Swift bindings target macOS 14+ only
- iOS lacks the unified memory architecture MLX requires
- No official iOS binaries from Apple
Our Approach
This project uses llama.cpp with Metal acceleration as the primary backend:
- ✅ Well-tested on iOS devices
- ✅ Efficient Metal GPU acceleration
- ✅ Supports GGUF quantized models
- ✅ Active community maintenance
The code is structured with a LLMBackend protocol, allowing future MLX integration if Apple releases iOS support:
protocol LLMBackend { func loadModel(from path: String, config: LLMConfiguration) async throws func generate(prompt: String, maxTokens: Int, ...) async throws -> String func cancelGeneration() }
Tool System
The assistant supports tools for extended capabilities:
Available Tools
| Tool | Description | Network Required |
|---|---|---|
web_search |
Wikipedia article summaries | Yes |
Tool Call Format
The LLM can request tools using this format:
<tool_call>{"name":"web_search","args":{"query":"Berlin history"}}</tool_call>
Adding Custom Tools
Implement the Tool protocol:
protocol Tool { var name: String { get } var description: String { get } var requiresNetwork: Bool { get } func run(args: [String: Any]) async throws -> String }
Memory & Performance
RAM Usage Guidelines
| Model Size | Quantization | Approx. RAM | Context Size |
|---|---|---|---|
| 1.1B | Q4_K_M | ~1.5GB | 2048 |
| 1.1B | Q5_K_M | ~1.8GB | 2048 |
| 2.7B | Q4_K_M | ~2.5GB | 1024 |
Optimization Tips
- Use Q4_K_M quantization for best memory/quality balance
- Limit context size to 1024-2048 tokens
- Clear conversation periodically to free memory
- Close background apps before intensive sessions
Token Generation Speed
Expected performance on modern iPhones (A15+):
- TinyLlama 1.1B Q4: ~15-25 tokens/sec
- Phi-2 Q4: ~8-15 tokens/sec
Troubleshooting
Permissions Issues
Microphone access denied:
- Go to Settings → Privacy & Security → Microphone
- Enable access for "Voice Assistant"
Speech recognition denied:
- Go to Settings → Privacy & Security → Speech Recognition
- Enable access for "Voice Assistant"
Audio Issues
No sound during TTS:
- Check device is not in silent mode
- Verify volume is turned up
- Check Settings → Accessibility → Spoken Content
Recording conflicts:
- Close other apps using the microphone
- Restart the app if audio session fails
Model Loading Issues
"Model file not found":
- Ensure file is named
model.gguf - Verify it's in the app's Documents folder
- Check file isn't corrupted (compare file size)
"Out of memory":
- Use a smaller model (Q4 instead of Q5)
- Reduce context size in settings
- Restart the app to clear memory
"Invalid model format":
- Only GGUF format is supported
- Ensure compatible quantization (Q4_K_M, Q5_K_M, Q8_0)
Performance Issues
Slow generation:
- Use Q4_K_M quantization
- Reduce max tokens in settings
- Close background apps
App freezes:
- The first load may take 10-30 seconds
- Tap "Stop" to cancel long operations
Integrating llama.cpp
To enable actual LLM inference (instead of simulation mode), integrate llama.cpp:
Option 1: Swift Package Manager
Add to your Package.swift or Xcode:
https://github.com/ggerganov/llama.cpp
Option 2: Manual Integration
- Clone llama.cpp repository
- Build the iOS framework
- Add to Xcode project
- Define
LLAMA_CPP_AVAILABLEcompilation flag
See llama.cpp iOS instructions for details.
Privacy
- Speech Recognition: Processed on-device using Apple's Speech Framework
- LLM Inference: 100% local, no data sent to servers
- Web Search: Only when explicitly using the search tool
- No Analytics: No tracking or telemetry
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
License
MIT License - see LICENSE for details.