GitHub - Benjamin1333/mLLM

mLLM - Local Voice Assistant for iOS

A privacy-focused, offline-first voice assistant app for iOS that runs language models locally on your device.

Features

Voice Input: Tap-to-talk with live transcription using Apple's Speech Framework
Local LLM Inference: Run TinyLLaMA or compatible models directly on device
Voice Output: Natural text-to-speech using Apple's AVSpeechSynthesizer
Tool Support: Wikipedia search integration for information retrieval
Multi-Language: German (de-DE) and English (en-US) support
Privacy First: All processing happens locally except optional web search

Screenshots

The app features:

Chat interface with user/assistant message bubbles
Animated microphone button with state indicators
Real-time status display (Listening/Thinking/Speaking/Idle)
Settings panel for language and configuration

Requirements

iOS 17.0+
Xcode 15.0+
iPhone with A12 chip or newer (Metal 3 support)
~2GB free storage for model files
~6GB available RAM recommended

Quick Start

1. Clone the Repository

git clone https://github.com/Benjamin1333/mLLM.git
cd mLLM

2. Open in Xcode

3. Download a Model

Download a compatible GGUF model (recommended: TinyLlama 1.1B Q4_K_M):

# Example: Download TinyLlama 1.1B Chat Q4_K_M (~670MB)
curl -L -o model.gguf https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

4. Install the Model

Option A: Via Files App (Recommended)

Build and run the app on your device
Open the Files app on iOS
Navigate to: On My iPhone → Voice Assistant
Copy model.gguf to this folder

Option B: Via Xcode

Connect your device
Open Window → Devices and Simulators
Select your device and the mLLM app
Drag the model file into the app's Documents folder

5. Run the App

Select your development team in Xcode
Build and run on your device (Cmd+R)
Grant microphone and speech recognition permissions
Tap "Load Model" in settings if not auto-loaded
Tap the microphone and start talking!

Supported Model Formats

Format	Extension	Backend	Recommended Quantization
GGUF	`.gguf`	llama.cpp	Q4_K_M, Q5_K_M

Recommended Models

Model	Size	RAM Usage	Quality
TinyLlama 1.1B Q4_K_M	~670MB	~1.5GB	Good for basic tasks
TinyLlama 1.1B Q5_K_M	~800MB	~1.8GB	Better quality
Phi-2 Q4_K_M	~1.6GB	~3GB	Higher quality

Project Structure

mLLM/
├── mLLM.xcodeproj/
├── mLLM/
│   ├── App/
│   │   └── mLLMApp.swift           # App entry point
│   ├── Views/
│   │   ├── ContentView.swift       # Main view
│   │   ├── ChatView.swift          # Message list
│   │   ├── MicrophoneButton.swift  # Animated mic button
│   │   ├── StatusView.swift        # State indicator
│   │   └── SettingsView.swift      # Settings & model picker
│   ├── ViewModels/
│   │   └── AssistantViewModel.swift # Main coordinator
│   ├── Services/
│   │   ├── SpeechToTextService.swift
│   │   ├── TextToSpeechService.swift
│   │   └── LLMService/
│   │       ├── LLMBackend.swift    # Backend protocol
│   │       ├── LlamaCppBackend.swift # Primary backend
│   │       ├── MLXBackend.swift    # Stub for future
│   │       ├── PromptBuilder.swift
│   │       └── ToolRouter.swift
│   ├── Tools/
│   │   ├── Tool.swift              # Tool protocol
│   │   └── WikipediaSearchTool.swift
│   ├── Models/
│   │   ├── ChatMessage.swift
│   │   └── AppState.swift
│   └── Resources/
│       ├── Info.plist
│       └── Assets.xcassets/
└── README.md

MLX on iOS: Status & Fallback

Current Status

MLX is NOT available on iOS. MLX (Apple's machine learning framework) is designed exclusively for macOS with Apple Silicon. Key limitations:

MLX Swift bindings target macOS 14+ only
iOS lacks the unified memory architecture MLX requires
No official iOS binaries from Apple

Our Approach

This project uses llama.cpp with Metal acceleration as the primary backend:

✅ Well-tested on iOS devices
✅ Efficient Metal GPU acceleration
✅ Supports GGUF quantized models
✅ Active community maintenance

The code is structured with a LLMBackend protocol, allowing future MLX integration if Apple releases iOS support:

protocol LLMBackend {
    func loadModel(from path: String, config: LLMConfiguration) async throws
    func generate(prompt: String, maxTokens: Int, ...) async throws -> String
    func cancelGeneration()
}

Tool System

The assistant supports tools for extended capabilities:

Available Tools

Tool	Description	Network Required
`web_search`	Wikipedia article summaries	Yes

Tool Call Format

The LLM can request tools using this format:

<tool_call>{"name":"web_search","args":{"query":"Berlin history"}}</tool_call>

Adding Custom Tools

Implement the Tool protocol:

protocol Tool {
    var name: String { get }
    var description: String { get }
    var requiresNetwork: Bool { get }
    func run(args: [String: Any]) async throws -> String
}

Memory & Performance

RAM Usage Guidelines

Model Size	Quantization	Approx. RAM	Context Size
1.1B	Q4_K_M	~1.5GB	2048
1.1B	Q5_K_M	~1.8GB	2048
2.7B	Q4_K_M	~2.5GB	1024

Optimization Tips

Use Q4_K_M quantization for best memory/quality balance
Limit context size to 1024-2048 tokens
Clear conversation periodically to free memory
Close background apps before intensive sessions

Token Generation Speed

Expected performance on modern iPhones (A15+):

TinyLlama 1.1B Q4: ~15-25 tokens/sec
Phi-2 Q4: ~8-15 tokens/sec

Troubleshooting

Permissions Issues

Microphone access denied:

Go to Settings → Privacy & Security → Microphone
Enable access for "Voice Assistant"

Speech recognition denied:

Go to Settings → Privacy & Security → Speech Recognition
Enable access for "Voice Assistant"

Audio Issues

No sound during TTS:

Check device is not in silent mode
Verify volume is turned up
Check Settings → Accessibility → Spoken Content

Recording conflicts:

Close other apps using the microphone
Restart the app if audio session fails

Model Loading Issues

"Model file not found":

Ensure file is named model.gguf
Verify it's in the app's Documents folder
Check file isn't corrupted (compare file size)

"Out of memory":

Use a smaller model (Q4 instead of Q5)
Reduce context size in settings
Restart the app to clear memory

"Invalid model format":

Only GGUF format is supported
Ensure compatible quantization (Q4_K_M, Q5_K_M, Q8_0)

Performance Issues

Slow generation:

Use Q4_K_M quantization
Reduce max tokens in settings
Close background apps

App freezes:

The first load may take 10-30 seconds
Tap "Stop" to cancel long operations

Integrating llama.cpp

To enable actual LLM inference (instead of simulation mode), integrate llama.cpp:

Option 1: Swift Package Manager

Add to your Package.swift or Xcode:

https://github.com/ggerganov/llama.cpp

Option 2: Manual Integration

Clone llama.cpp repository
Build the iOS framework
Add to Xcode project
Define LLAMA_CPP_AVAILABLE compilation flag

See llama.cpp iOS instructions for details.

Privacy

Speech Recognition: Processed on-device using Apple's Speech Framework
LLM Inference: 100% local, no data sent to servers
Web Search: Only when explicitly using the search tool
No Analytics: No tracking or telemetry

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Submit a pull request

License

MIT License - see LICENSE for details.

Acknowledgments

llama.cpp - LLM inference engine
TinyLlama - Compact language model
Apple Speech Framework - Speech recognition
Wikipedia API - Knowledge retrieval