GitHub - Benjamin1333/mLLM

mLLM - Local Voice Assistant for iOS

A privacy-focused, offline-first voice assistant app for iOS that runs language models locally on your device.

Features

  • Voice Input: Tap-to-talk with live transcription using Apple's Speech Framework
  • Local LLM Inference: Run TinyLLaMA or compatible models directly on device
  • Voice Output: Natural text-to-speech using Apple's AVSpeechSynthesizer
  • Tool Support: Wikipedia search integration for information retrieval
  • Multi-Language: German (de-DE) and English (en-US) support
  • Privacy First: All processing happens locally except optional web search

Screenshots

The app features:

  • Chat interface with user/assistant message bubbles
  • Animated microphone button with state indicators
  • Real-time status display (Listening/Thinking/Speaking/Idle)
  • Settings panel for language and configuration

Requirements

  • iOS 17.0+
  • Xcode 15.0+
  • iPhone with A12 chip or newer (Metal 3 support)
  • ~2GB free storage for model files
  • ~6GB available RAM recommended

Quick Start

1. Clone the Repository

git clone https://github.com/Benjamin1333/mLLM.git
cd mLLM

2. Open in Xcode

3. Download a Model

Download a compatible GGUF model (recommended: TinyLlama 1.1B Q4_K_M):

# Example: Download TinyLlama 1.1B Chat Q4_K_M (~670MB)
curl -L -o model.gguf https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

4. Install the Model

Option A: Via Files App (Recommended)

  1. Build and run the app on your device
  2. Open the Files app on iOS
  3. Navigate to: On My iPhone → Voice Assistant
  4. Copy model.gguf to this folder

Option B: Via Xcode

  1. Connect your device
  2. Open Window → Devices and Simulators
  3. Select your device and the mLLM app
  4. Drag the model file into the app's Documents folder

5. Run the App

  1. Select your development team in Xcode
  2. Build and run on your device (Cmd+R)
  3. Grant microphone and speech recognition permissions
  4. Tap "Load Model" in settings if not auto-loaded
  5. Tap the microphone and start talking!

Supported Model Formats

Format Extension Backend Recommended Quantization
GGUF .gguf llama.cpp Q4_K_M, Q5_K_M

Recommended Models

Model Size RAM Usage Quality
TinyLlama 1.1B Q4_K_M ~670MB ~1.5GB Good for basic tasks
TinyLlama 1.1B Q5_K_M ~800MB ~1.8GB Better quality
Phi-2 Q4_K_M ~1.6GB ~3GB Higher quality

Project Structure

mLLM/
├── mLLM.xcodeproj/
├── mLLM/
│   ├── App/
│   │   └── mLLMApp.swift           # App entry point
│   ├── Views/
│   │   ├── ContentView.swift       # Main view
│   │   ├── ChatView.swift          # Message list
│   │   ├── MicrophoneButton.swift  # Animated mic button
│   │   ├── StatusView.swift        # State indicator
│   │   └── SettingsView.swift      # Settings & model picker
│   ├── ViewModels/
│   │   └── AssistantViewModel.swift # Main coordinator
│   ├── Services/
│   │   ├── SpeechToTextService.swift
│   │   ├── TextToSpeechService.swift
│   │   └── LLMService/
│   │       ├── LLMBackend.swift    # Backend protocol
│   │       ├── LlamaCppBackend.swift # Primary backend
│   │       ├── MLXBackend.swift    # Stub for future
│   │       ├── PromptBuilder.swift
│   │       └── ToolRouter.swift
│   ├── Tools/
│   │   ├── Tool.swift              # Tool protocol
│   │   └── WikipediaSearchTool.swift
│   ├── Models/
│   │   ├── ChatMessage.swift
│   │   └── AppState.swift
│   └── Resources/
│       ├── Info.plist
│       └── Assets.xcassets/
└── README.md

MLX on iOS: Status & Fallback

Current Status

MLX is NOT available on iOS. MLX (Apple's machine learning framework) is designed exclusively for macOS with Apple Silicon. Key limitations:

  1. MLX Swift bindings target macOS 14+ only
  2. iOS lacks the unified memory architecture MLX requires
  3. No official iOS binaries from Apple

Our Approach

This project uses llama.cpp with Metal acceleration as the primary backend:

  • ✅ Well-tested on iOS devices
  • ✅ Efficient Metal GPU acceleration
  • ✅ Supports GGUF quantized models
  • ✅ Active community maintenance

The code is structured with a LLMBackend protocol, allowing future MLX integration if Apple releases iOS support:

protocol LLMBackend {
    func loadModel(from path: String, config: LLMConfiguration) async throws
    func generate(prompt: String, maxTokens: Int, ...) async throws -> String
    func cancelGeneration()
}

Tool System

The assistant supports tools for extended capabilities:

Available Tools

Tool Description Network Required
web_search Wikipedia article summaries Yes

Tool Call Format

The LLM can request tools using this format:

<tool_call>{"name":"web_search","args":{"query":"Berlin history"}}</tool_call>

Adding Custom Tools

Implement the Tool protocol:

protocol Tool {
    var name: String { get }
    var description: String { get }
    var requiresNetwork: Bool { get }
    func run(args: [String: Any]) async throws -> String
}

Memory & Performance

RAM Usage Guidelines

Model Size Quantization Approx. RAM Context Size
1.1B Q4_K_M ~1.5GB 2048
1.1B Q5_K_M ~1.8GB 2048
2.7B Q4_K_M ~2.5GB 1024

Optimization Tips

  1. Use Q4_K_M quantization for best memory/quality balance
  2. Limit context size to 1024-2048 tokens
  3. Clear conversation periodically to free memory
  4. Close background apps before intensive sessions

Token Generation Speed

Expected performance on modern iPhones (A15+):

  • TinyLlama 1.1B Q4: ~15-25 tokens/sec
  • Phi-2 Q4: ~8-15 tokens/sec

Troubleshooting

Permissions Issues

Microphone access denied:

  1. Go to Settings → Privacy & Security → Microphone
  2. Enable access for "Voice Assistant"

Speech recognition denied:

  1. Go to Settings → Privacy & Security → Speech Recognition
  2. Enable access for "Voice Assistant"

Audio Issues

No sound during TTS:

  • Check device is not in silent mode
  • Verify volume is turned up
  • Check Settings → Accessibility → Spoken Content

Recording conflicts:

  • Close other apps using the microphone
  • Restart the app if audio session fails

Model Loading Issues

"Model file not found":

  • Ensure file is named model.gguf
  • Verify it's in the app's Documents folder
  • Check file isn't corrupted (compare file size)

"Out of memory":

  • Use a smaller model (Q4 instead of Q5)
  • Reduce context size in settings
  • Restart the app to clear memory

"Invalid model format":

  • Only GGUF format is supported
  • Ensure compatible quantization (Q4_K_M, Q5_K_M, Q8_0)

Performance Issues

Slow generation:

  • Use Q4_K_M quantization
  • Reduce max tokens in settings
  • Close background apps

App freezes:

  • The first load may take 10-30 seconds
  • Tap "Stop" to cancel long operations

Integrating llama.cpp

To enable actual LLM inference (instead of simulation mode), integrate llama.cpp:

Option 1: Swift Package Manager

Add to your Package.swift or Xcode:

https://github.com/ggerganov/llama.cpp

Option 2: Manual Integration

  1. Clone llama.cpp repository
  2. Build the iOS framework
  3. Add to Xcode project
  4. Define LLAMA_CPP_AVAILABLE compilation flag

See llama.cpp iOS instructions for details.

Privacy

  • Speech Recognition: Processed on-device using Apple's Speech Framework
  • LLM Inference: 100% local, no data sent to servers
  • Web Search: Only when explicitly using the search tool
  • No Analytics: No tracking or telemetry

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

License

MIT License - see LICENSE for details.

Acknowledgments

  • llama.cpp - LLM inference engine
  • TinyLlama - Compact language model
  • Apple Speech Framework - Speech recognition
  • Wikipedia API - Knowledge retrieval