GitHub - SSaksit23/vectorize

Unified Knowledge Base Implementation

Project Overview

This is a comprehensive organizational knowledge base that centralizes all internal company knowledge, including process documentation, best practices, project archives, and employee expertise. The system provides advanced search capabilities using both traditional keyword search and modern semantic search powered by LLM embeddings.

Features

🔍 Advanced Search

Keyword Search: Traditional text matching with fuzzy search
Semantic Search: AI-powered understanding using sentence embeddings
Hybrid Search: Combines both approaches for optimal results
Real-time search suggestions and auto-complete

📊 Data Sources

Confluence pages and spaces
Jira tickets and projects
SharePoint documents
Network drives and file shares
Git repositories (README files and wikis)
Internal SQL databases

🚀 Key Capabilities

Natural language querying
Content versioning and history
User and group-based access control
Analytics and usage reporting
Automated content synchronization
Intelligent content recommendations

Technology Stack

Backend

API: FastAPI (Python)
Database: PostgreSQL with pgvector extension
Search Engine: Elasticsearch
Cache: Redis
ETL: Apache Airflow
LLM: OpenAI API + Sentence Transformers

Frontend

Framework: React.js
Styling: CSS3 with modern design
State Management: React Context/Hooks
HTTP Client: Axios

Infrastructure

Containerization: Docker & Docker Compose
Orchestration: Kubernetes
Cloud: AWS/GCP ready
Infrastructure as Code: Terraform
Monitoring: Prometheus + Grafana

Quick Start

Prerequisites

Docker and Docker Compose
Node.js 18+ (for frontend development)
Python 3.9+ (for backend development)

1. Clone and Setup

cd knowledge-base
cp .env.example .env
# Edit .env with your configuration

2. Start Services

# Start all services
docker-compose up -d

# Check service status
docker-compose ps

3. Initialize Database

# Run database migrations
docker-compose exec backend alembic upgrade head

# Create initial admin user
docker-compose exec backend python scripts/create_admin.py

4. Access Applications

Frontend: http://localhost:3000
Backend API: http://localhost:8000
API Docs: http://localhost:8000/docs
Airflow: http://localhost:8080
Elasticsearch: http://localhost:9200

Development Setup

Backend Development

cd backend
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
uvicorn app:app --reload --host 0.0.0.0 --port 8000

Frontend Development

cd frontend
npm install
npm start

ETL Development

cd etl
# Set up Airflow
export AIRFLOW_HOME=$(pwd)
airflow db init
airflow users create --username admin --password admin --firstname Admin --lastname User --role Admin --email admin@example.com
airflow webserver -p 8080

Project Structure

knowledge-base/
├── README.md
├── requirements.txt
├── docker-compose.yml
├── .env.example
├── config/                  # Configuration files
├── backend/                 # FastAPI backend
│   ├── models/             # Database models
│   ├── api/                # API endpoints
│   ├── services/           # Business logic
│   └── utils/              # Utility functions
├── etl/                    # Data pipeline
│   ├── dags/               # Airflow DAGs
│   ├── extractors/         # Data extractors
│   └── transformers/       # Data transformers
├── frontend/               # React frontend
│   └── src/
│       ├── components/     # React components
│       ├── services/       # API services
│       └── utils/          # Utility functions
├── infrastructure/         # Deployment configs
│   ├── terraform/          # Infrastructure as Code
│   └── kubernetes/         # K8s manifests
└── tests/                  # Test suites
    ├── unit/
    ├── integration/
    └── e2e/

Configuration

Environment Variables

Copy .env.example to .env and configure:

# Database
DATABASE_URL=postgresql://kb_user:kb_password@localhost:5432/knowledge_base

# Elasticsearch
ELASTICSEARCH_URL=http://localhost:9200

# Redis
REDIS_URL=redis://localhost:6379

# LLM API Keys
OPENAI_API_KEY=your_openai_api_key_here

# Data Source Credentials
CONFLUENCE_URL=https://your-company.atlassian.net
CONFLUENCE_USERNAME=your_username
CONFLUENCE_API_TOKEN=your_api_token

JIRA_URL=https://your-company.atlassian.net
JIRA_USERNAME=your_username
JIRA_API_TOKEN=your_api_token

SHAREPOINT_URL=https://your-company.sharepoint.com
SHAREPOINT_CLIENT_ID=your_client_id
SHAREPOINT_CLIENT_SECRET=your_client_secret

Data Pipeline

ETL Process

Extract: Pull data from various sources (Confluence, Jira, etc.)
Transform: Clean, normalize, and chunk content
Load: Store in PostgreSQL and index in Elasticsearch
Embed: Generate vector embeddings for semantic search

Scheduling

Full Sync: Weekly (Sundays at 2 AM)
Incremental Sync: Daily (Every 4 hours)
Real-time: Webhook-triggered updates

API Documentation

Search Endpoints

POST /api/search/ - Perform search with various types
GET /api/search/suggestions - Get search suggestions
GET /api/search/analytics - Search analytics

Document Endpoints

GET /api/documents/ - List documents
GET /api/documents/{id} - Get document details
POST /api/documents/ - Create new document
PUT /api/documents/{id} - Update document

Authentication

POST /api/auth/login - User login
POST /api/auth/logout - User logout
GET /api/auth/me - Get current user

Deployment

Production Deployment

# Build and deploy
docker-compose -f docker-compose.prod.yml up -d

# Or use Kubernetes
kubectl apply -f infrastructure/kubernetes/

Monitoring

Health checks: /health
Metrics: /metrics (Prometheus format)
Logs: Centralized logging with structured JSON

Testing

Run Tests

# Backend tests
cd backend
pytest tests/

# Frontend tests
cd frontend
npm test

# Integration tests
docker-compose -f docker-compose.test.yml up --abort-on-container-exit

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

Project Phases

Phase 1: Foundation (Weeks 1-3) ✅

Infrastructure setup
Database schema design
Basic LLM integration

Phase 2: Core Services (Weeks 4-7) 🔄

Embedding service
Natural language query parser
Vector search service

Phase 3: Advanced Features (Weeks 8-11) ⏳

Intelligent query router
Content generation engine
Real-time embedding updates

Phase 4: Production (Weeks 12-15) ⏳

Performance optimization
Security hardening
Production deployment

Support

For questions and support:

Technical Issues: Create an issue in this repository
Documentation: Check the /docs folder
API Questions: Visit http://localhost:8000/docs

License

This project is proprietary software for internal company use only.