System Overview
graph TB
subgraph "Document Ingestion"
A[Documents] --> B[Chunking]
B --> C[Embeddings]
C --> D[Vector Store]
end
subgraph "Query Processing"
E[User Query] --> F[Query Embedding]
F --> G[Vector Search]
G --> H[Context Retrieval]
end
subgraph "Response Generation"
H --> I[Context + Query]
I --> J[Local LLM]
J --> K[Response]
end
D -.-> G
Component Architecture
Core Components
python_example/
├── src/
│ ├── rag_pipeline_local.py # Main orchestrator
│ ├── vector_store_lancedb.py # Vector storage
│ ├── embeddings_local.py # Text embeddings
│ ├── llm_local.py # LLM generation
│ └── chunking.py # Document processing
Deployment Scenarios
Scenario 1: Personal Desktop (Most Common)
Hardware Requirements:
- CPU: 4+ cores
- RAM: 8-16GB
- Storage: 20GB SSD
- GPU: Optional (speeds up inference)
Configuration:
rag = LocalRAGPipeline( llm_model="mistral:latest", # 7B model embedding_model="nomic-embed-text:latest", chunk_size=512, use_sentence_transformers=False )
Performance:
- Embedding: ~0.2s/doc
- Search: <0.05s
- Generation: 2-5s
- Total query: 3-6s
Scenario 2: High-Performance Workstation
Hardware Requirements:
- CPU: 8+ cores
- RAM: 32GB+
- Storage: 100GB NVMe SSD
- GPU: NVIDIA RTX 3060+ or Apple M1+
Configuration:
rag = LocalRAGPipeline( llm_model="mixtral:8x7b", # MoE model embedding_model="mxbai-embed-large", chunk_size=1024, chunk_overlap=100 ) # Enable GPU acceleration llm = LlamaCppLLM( model_path="models/mixtral.gguf", n_gpu_layers=35 # Offload to GPU )
Performance:
- Embedding: ~0.1s/doc
- Search: <0.02s
- Generation: 1-3s
- Total query: 2-4s
Scenario 3: Lightweight Laptop
Hardware Requirements:
- CPU: 2+ cores
- RAM: 4-8GB
- Storage: 10GB
- GPU: Not required
Configuration:
# Use sentence transformers (no Ollama needed) rag = LocalRAGPipeline( llm_model="phi", # 2.7B model embedding_model="all-MiniLM-L6-v2", chunk_size=256, use_sentence_transformers=True )
Performance:
- Embedding: ~0.1s/doc
- Search: <0.1s
- Generation: 5-10s
- Total query: 6-12s
Scenario 4: Server/Docker Deployment
Hardware Requirements:
- CPU: 8+ cores
- RAM: 16GB+
- Storage: 50GB
- GPU: Optional
Docker Compose Configuration:
version: '3.8' services: ollama: image: ollama/ollama:latest ports: - "11434:11434" volumes: - ollama_data:/root/.ollama deploy: resources: limits: memory: 8G command: serve rag-api: build: . ports: - "8000:8000" environment: - OLLAMA_BASE_URL=http://ollama:11434 - LANCEDB_DATA_DIR=/data/lancedb volumes: - ./data:/data depends_on: - ollama volumes: ollama_data:
API Configuration:
# FastAPI wrapper for RAG from fastapi import FastAPI from src.rag_pipeline_local import LocalRAGPipeline app = FastAPI() rag = LocalRAGPipeline() @app.post("/query") async def query(text: str): response = rag.query(text) return {"answer": response.answer, "sources": response.sources}
Scenario 5: Edge Device (Raspberry Pi)
Hardware Requirements:
- CPU: ARM64 4+ cores
- RAM: 4-8GB
- Storage: 32GB SD card
- GPU: Not applicable
Configuration:
# Ultra-lightweight setup rag = LocalRAGPipeline( llm_model=None, # Use only retrieval embedding_model="all-MiniLM-L6-v2", chunk_size=128, use_sentence_transformers=True ) # Retrieval-only mode def retrieve_only(query): results = rag.vector_store.search(query, top_k=3) return results # Return relevant chunks without generation
Data Flow Architecture
1. Document Ingestion Pipeline
Documents ↓ TextChunker/MarkdownChunker ├── chunk_size: 512 ├── chunk_overlap: 50 └── separators: ["\n\n", "\n", ". ", " "] ↓ Embeddings (OllamaEmbeddings/SentenceTransformers) ├── model: nomic-embed-text ├── dimensions: 768 └── cache: ./data/embedding_cache/ ↓ LanceDBVectorStore ├── storage: ./data/lancedb/ ├── index: IVF-PQ └── schema: dynamic
2. Query Processing Pipeline
User Query ↓ Query Embedding ├── same model as documents └── cached if repeated ↓ Vector Search ├── metric: L2/cosine ├── top_k: 5 └── hybrid: optional ↓ Context Assembly ├── ranked by similarity └── metadata included
3. Response Generation Pipeline
Context + Query ↓ Prompt Template ├── system_prompt: customizable ├── context: retrieved chunks └── query: user question ↓ Local LLM (Ollama) ├── model: mistral/llama2/mixtral ├── temperature: 0.7 └── max_tokens: 1000 ↓ Response ├── answer: generated text ├── sources: chunk references └── metadata: timing, model
Storage Architecture
File System Layout
data/
├── lancedb/ # Vector database
│ ├── documents.lance/ # Main collection
│ └── _versions/ # Version history
├── embedding_cache/ # Cached embeddings
│ └── *.json # MD5-named cache files
├── models/ # Optional local models
│ └── *.gguf # Quantized models
└── logs/ # Application logs
Database Schema
LanceDB Table Structure:
{
"id": str, # UUID
"text": str, # Original chunk text
"vector": List[float], # Embedding vector
"metadata": str, # JSON metadata
"timestamp": datetime # Creation time
}Performance Optimization
1. Caching Strategy
# Three-tier caching L1: In-memory LRU cache (recent queries) L2: Embedding cache (disk-based) L3: LanceDB built-in caching
2. Batching Strategy
# Document batching BATCH_SIZE = 100 # Documents per batch EMBEDDING_BATCH = 32 # Parallel embeddings SEARCH_BATCH = 10 # Concurrent searches
3. Index Configuration
# ANN Index for fast search index_config = { "type": "IVF_PQ", "num_partitions": 256, "num_sub_vectors": 96, "metric": "L2", "nprobes": 20 }
Scaling Architecture
Horizontal Scaling
# Multiple RAG instances with shared storage instances = [] for i in range(num_workers): rag = LocalRAGPipeline( collection_name=f"worker_{i}", shared_cache=True ) instances.append(rag) # Load balancer def route_query(query): worker = hash(query) % num_workers return instances[worker].query(query)
Vertical Scaling
# GPU acceleration if torch.cuda.is_available(): device = "cuda" n_gpu_layers = 35 elif torch.backends.mps.is_available(): device = "mps" n_gpu_layers = 1 else: device = "cpu" n_gpu_layers = 0
Security Architecture
Data Privacy
# All data stays local - No external API calls - No telemetry - No cloud storage - Encrypted cache (optional)
Access Control
# Simple authentication wrapper from functools import wraps def require_auth(f): @wraps(f) def decorated(*args, **kwargs): # Check local auth token if not verify_token(): raise Unauthorized() return f(*args, **kwargs) return decorated
Monitoring Architecture
Metrics Collection
metrics = { "queries_per_second": 0, "avg_response_time": 0, "cache_hit_rate": 0, "documents_indexed": 0, "storage_used_gb": 0 }
Health Checks
def health_check(): checks = { "ollama": check_ollama_service(), "lancedb": check_vector_store(), "disk_space": check_disk_space(), "memory": check_memory_usage() } return all(checks.values())
Deployment Patterns
1. Standalone Application
- Single user
- Desktop GUI
- Local storage
2. API Service
- Multiple users
- REST/GraphQL API
- Shared storage
3. Embedded System
- IoT devices
- Edge computing
- Minimal resources
4. Distributed System
- Multiple nodes
- Load balancing
- Fault tolerance
Technology Stack
Core Technologies
- Python 3.9+: Main language
- Ollama: LLM inference
- LanceDB: Vector storage
- PyArrow: Data processing
- Sentence Transformers: Embeddings
Optional Technologies
- FastAPI: API framework
- Docker: Containerization
- Redis: Additional caching
- PostgreSQL: Metadata storage
- Nginx: Reverse proxy
Best Practices
- Model Selection: Choose based on available RAM
- Chunk Size: Adjust based on document type
- Caching: Enable for production use
- Indexing: Create after bulk ingestion
- Monitoring: Track performance metrics
- Backup: Regular LanceDB backups
- Updates: Keep Ollama models updated
Future Architecture Considerations
Planned Enhancements
- Multi-modal support (images, audio)
- Streaming responses
- Real-time document updates
- Federated learning
- Model fine-tuning
Potential Integrations
- Langchain compatibility
- LlamaIndex support
- Gradio UI
- Streamlit dashboard
- Jupyter integration
Key Insight: This architecture prioritizes zero cost, complete privacy, and maximum flexibility while maintaining production-grade performance. Every design decision supports these goals.