Complete Local RAG Setup Guide - Cross-Platform (Windows/Mac/Linux)
What We Built
A 100% local, ZERO-COST RAG system that replaces:
- OpenAI Embeddings ($0.13/million tokens) → Ollama/SentenceTransformers (FREE)
- Anthropic Claude ($0.25-1.25/million tokens) → Ollama Local LLMs (FREE)
- ChromaDB → LanceDB (10x faster)
Prerequisites Installed
- Python 3.9+ (via system or UV)
- 8GB RAM minimum (16GB recommended, 32GB optimal)
- OS: Windows 10/11, macOS 10.15+, Linux (Ubuntu 20.04+)
- Shell: PowerShell 5.1+ (Windows) or Bash (Mac/Linux)
Step-by-Step Setup Process
Quick Start (One Command)
Windows (PowerShell)
Mac/Linux (Bash)
chmod +x setup_complete.sh ./setup_complete.sh
Manual Setup Process
1. Install UV Package Manager
Windows (PowerShell)
Mac/Linux (Bash)
./setup_uv.sh # Or directly: curl -LsSf https://astral.sh/uv/install.sh | sh
2. Install Ollama
Windows
# Download and run installer: # https://github.com/ollama/ollama/releases/latest/download/OllamaSetup.exe # Or use script: .\install_ollama_windows.ps1
Mac
# Using Homebrew brew install ollama # Or using script ./install_ollama_unix.sh
Linux
# Official installer curl -fsSL https://ollama.ai/install.sh | sh # Or using script ./install_ollama_unix.sh
3. Start Ollama Service
Windows (PowerShell)
# Direct path & "$env:LOCALAPPDATA\Programs\Ollama\ollama.exe" serve # Or if in PATH ollama serve # Or use batch file .\run_ollama_direct.bat
Mac/Linux (Bash)
ollama serve # Or in background nohup ollama serve > /tmp/ollama.log 2>&1 & # Or use script ./run_ollama_direct.sh
Keep this running in a separate terminal!
4. Pull Required Models
All Platforms
# Pull embedding model (274MB) ollama pull nomic-embed-text # Pull LLM based on your RAM: # 8GB RAM - Mistral 7B (4.4GB) ollama pull mistral # 16GB RAM - Llama 2 13B ollama pull llama2:13b # 32GB+ RAM - Mixtral (best quality) ollama pull mixtral
Windows (if ollama not in PATH)
$ollama = "$env:LOCALAPPDATA\Programs\Ollama\ollama.exe" & $ollama pull nomic-embed-text & $ollama pull mistral
5. Install Local RAG Dependencies
Using UV (recommended - all platforms)
uv pip install lancedb pyarrow sentence-transformers
Using pip (fallback)
pip install lancedb pyarrow sentence-transformers
6. Test Everything Works
All Platforms
# Run comprehensive test
python test_local_rag.pyTest Ollama API
# List models curl http://localhost:11434/api/tags # Test embedding curl -X POST http://localhost:11434/api/embeddings \ -H "Content-Type: application/json" \ -d '{"model":"nomic-embed-text","prompt":"test"}' # Test LLM curl -X POST http://localhost:11434/api/generate \ -H "Content-Type: application/json" \ -d '{"model":"mistral","prompt":"Hello","stream":false}'
File Structure Created
RAG Shit/
├── src/
│ ├── vector_store_lancedb.py # LanceDB implementation
│ ├── embeddings_local.py # Ollama/ST embeddings
│ ├── llm_local.py # Ollama LLM wrapper
│ └── rag_pipeline_local.py # Complete local pipeline
├── config/
│ ├── local_rag_config.yaml # Configuration file
│ └── settings.py # Python settings
├── Windows Scripts/
│ ├── install_ollama_windows.ps1 # Ollama installer
│ ├── setup_ollama_models.ps1 # Model downloader
│ ├── setup_complete.ps1 # One-click setup
│ ├── run_ollama_direct.bat # Direct launcher
│ └── setup_uv.ps1 # UV installer
├── Unix Scripts/
│ ├── install_ollama_unix.sh # Ollama installer
│ ├── setup_ollama_models.sh # Model downloader
│ ├── setup_complete.sh # One-click setup
│ ├── run_ollama_direct.sh # Direct launcher
│ └── setup_uv.sh # UV installer
├── test_local_rag.py # Test suite
├── LOCAL_RAG_SETUP.md # Complete guide
├── QUICKSTART_LOCAL.md # Quick start
└── MIGRATION_GUIDE.md # Migration from APIs
Configuration Files
1. Ollama Models Configuration
Models are stored in: %USERPROFILE%\.ollama\models
Available models after setup:
nomic-embed-text- 768-dim embeddingsmistral:7b- 7B parameter LLM
2. LanceDB Configuration
Data stored in: ./data/lancedb/
Features enabled:
- Vector search with ANN index
- Hybrid search capability
- Automatic versioning
3. Environment Variables
No API keys needed! But if you want fallback to APIs:
# .env file (OPTIONAL - only for fallback) OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-...
Usage Examples
Basic Usage
from src.rag_pipeline_local import LocalRAGPipeline # Initialize pipeline rag = LocalRAGPipeline( llm_model="mistral:7b", embedding_model="nomic-embed-text" ) # Add documents docs = ["Your document text here..."] rag.add_documents(docs) # Query - ZERO COST! response = rag.query("What is this about?") print(f"Answer: {response.answer}") print(f"Cost: ${response.cost}") # Always $0.00!
With Sentence Transformers (No Ollama needed for embeddings)
rag = LocalRAGPipeline( llm_model="mistral:7b", embedding_model="all-MiniLM-L6-v2", use_sentence_transformers=True )
Troubleshooting
Issue: "ollama: command not found"
Solution: Use full path or add to PATH:
$env:Path += ";$env:LOCALAPPDATA\Programs\Ollama"
Issue: "Ollama service not running"
Solution: Start in new window:
Start-Process "$env:LOCALAPPDATA\Programs\Ollama\ollama.exe" -ArgumentList "serve"
Issue: "Model not found"
Solution: Pull the model:
& "$env:LOCALAPPDATA\Programs\Ollama\ollama.exe" pull mistral
Issue: "UV command not found"
Solution: Run setup and restart PowerShell:
.\setup_uv.ps1
# Then restart PowerShellPerformance Metrics
On 32GB RAM system with Mistral 7B:
- Embedding generation: ~0.2s per document
- Vector search: ~0.05s for 10k documents
- LLM generation: ~2-5s for 500 tokens
- Total query time: ~3-6s
- Cost: $0.00
Model Recommendations by RAM
| RAM | Embedding Model | LLM Model | Quality |
|---|---|---|---|
| 4GB | all-MiniLM-L6-v2 (ST) | phi-2 | Basic |
| 8GB | nomic-embed-text | mistral:7b | Good |
| 16GB | nomic-embed-text | llama2:13b | Excellent |
| 32GB+ | mxbai-embed-large | mixtral:8x7b | Best |
Cost Comparison
| Operation | Old (APIs) | New (Local) | Savings |
|---|---|---|---|
| 1M embeddings | $130 | $0 | 100% |
| 1M LLM tokens | $250-1250 | $0 | 100% |
| Monthly (100 queries/day) | ~$11.40 | $0 | 100% |
| Yearly | ~$137 | $0 | 100% |
Next Steps
-
Optimize for your hardware:
- GPU? Install CUDA-enabled llama.cpp
- More RAM? Use larger models
- SSD? Move LanceDB to SSD for faster search
-
Production deployment:
- Set up Ollama as Windows service
- Add monitoring/logging
- Implement caching layer
-
Advanced features:
- Implement reranking
- Add hybrid search
- Use quantized models for speed
The Bottom Line
You now have a complete local RAG system that:
- Costs nothing to run
- Protects your data (100% local)
- Runs offline (no internet needed)
- Has no rate limits
- Performs comparably to cloud solutions
Stop paying for API calls. This is the way.