GitHub - SSaksit23/vectorize_db

Unified Knowledge Base Implementation

Project Overview

This is a comprehensive organizational knowledge base that centralizes all internal company knowledge, including process documentation, best practices, project archives, and employee expertise. The system provides advanced search capabilities using both traditional keyword search and modern semantic search powered by LLM embeddings.

Features

🔍 Advanced Search

  • Keyword Search: Traditional text matching with fuzzy search
  • Semantic Search: AI-powered understanding using sentence embeddings
  • Hybrid Search: Combines both approaches for optimal results
  • Real-time search suggestions and auto-complete

📊 Data Sources

  • Confluence pages and spaces
  • Jira tickets and projects
  • SharePoint documents
  • Network drives and file shares
  • Git repositories (README files and wikis)
  • Internal SQL databases

🚀 Key Capabilities

  • Natural language querying
  • Content versioning and history
  • User and group-based access control
  • Analytics and usage reporting
  • Automated content synchronization
  • Intelligent content recommendations

Technology Stack

Backend

  • API: FastAPI (Python)
  • Database: PostgreSQL with pgvector extension
  • Search Engine: Elasticsearch
  • Cache: Redis
  • ETL: Apache Airflow
  • LLM: OpenAI API + Sentence Transformers

Frontend

  • Framework: React.js
  • Styling: CSS3 with modern design
  • State Management: React Context/Hooks
  • HTTP Client: Axios

Infrastructure

  • Containerization: Docker & Docker Compose
  • Orchestration: Kubernetes
  • Cloud: AWS/GCP ready
  • Infrastructure as Code: Terraform
  • Monitoring: Prometheus + Grafana

Quick Start

Prerequisites

  • Docker and Docker Compose
  • Node.js 18+ (for frontend development)
  • Python 3.9+ (for backend development)

1. Clone and Setup

cd knowledge-base
cp .env.example .env
# Edit .env with your configuration

2. Start Services

# Start all services
docker-compose up -d

# Check service status
docker-compose ps

3. Initialize Database

# Run database migrations
docker-compose exec backend alembic upgrade head

# Create initial admin user
docker-compose exec backend python scripts/create_admin.py

4. Access Applications

Development Setup

Backend Development

cd backend
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
uvicorn app:app --reload --host 0.0.0.0 --port 8000

Frontend Development

cd frontend
npm install
npm start

ETL Development

cd etl
# Set up Airflow
export AIRFLOW_HOME=$(pwd)
airflow db init
airflow users create --username admin --password admin --firstname Admin --lastname User --role Admin --email admin@example.com
airflow webserver -p 8080

Project Structure

knowledge-base/
├── README.md
├── requirements.txt
├── docker-compose.yml
├── .env.example
├── config/                  # Configuration files
├── backend/                 # FastAPI backend
│   ├── models/             # Database models
│   ├── api/                # API endpoints
│   ├── services/           # Business logic
│   └── utils/              # Utility functions
├── etl/                    # Data pipeline
│   ├── dags/               # Airflow DAGs
│   ├── extractors/         # Data extractors
│   └── transformers/       # Data transformers
├── frontend/               # React frontend
│   └── src/
│       ├── components/     # React components
│       ├── services/       # API services
│       └── utils/          # Utility functions
├── infrastructure/         # Deployment configs
│   ├── terraform/          # Infrastructure as Code
│   └── kubernetes/         # K8s manifests
└── tests/                  # Test suites
    ├── unit/
    ├── integration/
    └── e2e/

Configuration

Environment Variables

Copy .env.example to .env and configure:

# Database
DATABASE_URL=postgresql://kb_user:kb_password@localhost:5432/knowledge_base

# Elasticsearch
ELASTICSEARCH_URL=http://localhost:9200

# Redis
REDIS_URL=redis://localhost:6379

# LLM API Keys
OPENAI_API_KEY=your_openai_api_key_here

# Data Source Credentials
CONFLUENCE_URL=https://your-company.atlassian.net
CONFLUENCE_USERNAME=your_username
CONFLUENCE_API_TOKEN=your_api_token

JIRA_URL=https://your-company.atlassian.net
JIRA_USERNAME=your_username
JIRA_API_TOKEN=your_api_token

SHAREPOINT_URL=https://your-company.sharepoint.com
SHAREPOINT_CLIENT_ID=your_client_id
SHAREPOINT_CLIENT_SECRET=your_client_secret

Data Pipeline

ETL Process

  1. Extract: Pull data from various sources (Confluence, Jira, etc.)
  2. Transform: Clean, normalize, and chunk content
  3. Load: Store in PostgreSQL and index in Elasticsearch
  4. Embed: Generate vector embeddings for semantic search

Scheduling

  • Full Sync: Weekly (Sundays at 2 AM)
  • Incremental Sync: Daily (Every 4 hours)
  • Real-time: Webhook-triggered updates

API Documentation

Search Endpoints

  • POST /api/search/ - Perform search with various types
  • GET /api/search/suggestions - Get search suggestions
  • GET /api/search/analytics - Search analytics

Document Endpoints

  • GET /api/documents/ - List documents
  • GET /api/documents/{id} - Get document details
  • POST /api/documents/ - Create new document
  • PUT /api/documents/{id} - Update document

Authentication

  • POST /api/auth/login - User login
  • POST /api/auth/logout - User logout
  • GET /api/auth/me - Get current user

Deployment

Production Deployment

# Build and deploy
docker-compose -f docker-compose.prod.yml up -d

# Or use Kubernetes
kubectl apply -f infrastructure/kubernetes/

Monitoring

  • Health checks: /health
  • Metrics: /metrics (Prometheus format)
  • Logs: Centralized logging with structured JSON

Testing

Run Tests

# Backend tests
cd backend
pytest tests/

# Frontend tests
cd frontend
npm test

# Integration tests
docker-compose -f docker-compose.test.yml up --abort-on-container-exit

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

Project Phases

Phase 1: Foundation (Weeks 1-3) ✅

  • Infrastructure setup
  • Database schema design
  • Basic LLM integration

Phase 2: Core Services (Weeks 4-7) 🔄

  • Embedding service
  • Natural language query parser
  • Vector search service

Phase 3: Advanced Features (Weeks 8-11) ⏳

  • Intelligent query router
  • Content generation engine
  • Real-time embedding updates

Phase 4: Production (Weeks 12-15) ⏳

  • Performance optimization
  • Security hardening
  • Production deployment

Support

For questions and support:

  • Technical Issues: Create an issue in this repository
  • Documentation: Check the /docs folder
  • API Questions: Visit http://localhost:8000/docs

License

This project is proprietary software for internal company use only.