SuData - Comprehensive Data Pipeline System
๐ Overview
SuData is a comprehensive data pipeline system that automates the collection, processing, and monitoring of streaming data from multiple sources. Built with modern DevOps practices, it features automated CI/CD pipelines, multi-environment deployments, and robust monitoring capabilities.
Architecture
+-------------------+ +-------------------+ +-------------------+
| Telegram Scraper | | TikTok Scraper | | YouTube Scraper |
| | | | | |
+---------+---------+ +---------+---------+ +---------+---------+
| | |
| (Raw Data) | (Raw Data) | (Raw Data)
v v v
+---------------------------------------------------------------------+
| Refinery Service (Python) |
| (Processes & Refines Data) |
+---------------------------------------------------------------------+
|
| (Refined Data)
v
+---------------------------------------------------------------------+
| Output / Storage |
+---------------------------------------------------------------------+
| ^
| | (Health Status)
v |
+---------------------------------------------------------------------+
| SuData Symphony (Orchestration & Monitoring) |
| (Starts, Stops, Manages Health Checks) |
+---------------------------------------------------------------------+
^
| (Health Status)
|
+-------------------+
| Dashboard |
| |
+-------------------+
๐ฏ Features
Core Functionality
- Multi-Source Data Collection: Automated scraping from TikTok and YouTube
- Real-time Processing: Intelligent data refinement and transformation
- Web Dashboard: Interactive monitoring and visualization interface
- Health Monitoring: Comprehensive system health checks and alerts
DevOps & CI/CD
- Automated Testing: Unit, integration, and E2E test suites
- Multi-Environment Support: Development, staging, and production environments
- Docker Containerization: Consistent deployment across environments
- Automated Deployments: GitHub Actions-powered CI/CD pipeline
- Rollback Capabilities: Automated rollback on deployment failures
- Quality Gates: Code quality, security scanning, and coverage requirements
๐ ๏ธ Technology Stack
Frontend
- Next.js 14: React-based web framework
- TypeScript: Type-safe JavaScript development
- Tailwind CSS: Utility-first CSS framework
- Chart.js: Data visualization library
Backend Services
- Python 3.11+: Core processing language
- FastAPI: High-performance API framework
- AsyncIO: Asynchronous programming support
- Pandas: Data manipulation and analysis
Virtual Environments & Large Files
Each Python service (Telegram, TikTok, YouTube Scrapers, and Refinery Service) utilizes its own dedicated virtual environment (e.g., tele-venv, tt-venv, yt-venv, .venv respectively). These virtual environments, along with other large binary files (like dnnl.lib and torch_cpu.dll), are intentionally excluded from Git via the .gitignore file to prevent issues with GitHub's file size limits. When setting up the project, pnpm install will handle the creation and population of these virtual environments.
DevOps & Infrastructure
- Docker: Containerization platform
- GitHub Actions: CI/CD automation
- GitHub Container Registry: Docker image storage
- pnpm: Fast, disk space efficient package manager
Testing & Quality
- Vitest: Unit testing for frontend
- Jest: JavaScript testing framework
- Pytest: Python testing framework
- Playwright: End-to-end testing
- ESLint: JavaScript/TypeScript linting
- Flake8: Python code linting
๐ Quick Start
Prerequisites
- Node.js 18+
- Python 3.11+
- Docker & Docker Compose
- pnpm
Installation
-
Clone the repository
git clone https://github.com/O96a/SuData.git cd sudata -
Install dependencies
-
Start services using SuData Symphony (Recommended)
Navigate to the project root and run:
python scripts/sudata-symphony.py
Note: SuData Symphony now includes automated port cleanup at startup to prevent conflicts. The
--no-confirmflag is enabled by default for non-interactive startup.
๐ผ Service Management & Monitoring
SuData Symphony - Service Orchestration
The SuData Symphony is our production-ready orchestration system that manages all services with proper startup order, health monitoring, and graceful shutdown.
Features:
- ๐ Smart Startup: Automatic service discovery and ordered startup
- ๐ Health Monitoring: Real-time health checks for all services
- ๐ฏ User-Friendly: Interactive confirmation and colored logging
- ๐ก๏ธ Graceful Shutdown: Proper cleanup and resource management
- ๐ง Easy Configuration: YAML-based service configuration
- ๐ Auto-Recovery: Automatic restart of failed services
- ๐ Process Monitoring: Continuous health checks and status tracking
Quick Commands:
# Start all services python scripts/sudata-symphony.py # Start with specific log level python scripts/sudata-symphony.py --log-level INFO # To gracefully stop all services, press Ctrl+C in the terminal running SuData Symphony. # Service status can be monitored via the Dashboard.
Service Manager
The Service Manager handles the lifecycle of all services with the following capabilities:
- Service Lifecycle Management: Start, stop, and restart services
- Health Monitoring: Continuous health checks via
/healthendpoints - Automatic Recovery: Configurable auto-restart for failed services
- Logging: Centralized logging for all services
- Cross-Platform: Works on both Windows and Unix-based systems
Service Configuration
Services are configured in scripts/config/services.yaml with options for:
- Virtual environment management
- Custom startup commands
- Environment variables
- Health check endpoints
- Auto-recovery settings
For detailed configuration options, see Service Manager Documentation.
Monitoring Dashboard
Access the monitoring dashboard at http://localhost:3007 to view:
- Service status and health
- Resource usage (CPU, memory)
- Logs and error messages
- Performance metrics
Service Ports
| Service | Port | Description |
|---|---|---|
| TikTok Scraper | 3000 | TikTok data collection |
| YouTube Scraper | 3002 | YouTube data collection |
| Dashboard | 3007 | Monitoring interface |
| Refinery Service | 3004 | Data processing |
| Telegram Scraper | 3005 | Telegram data collection |
Health Checks
Each service exposes a health check endpoint at /health that returns:
- Service status (healthy/unhealthy)
- Uptime
- Version information
- Dependencies status
Example health check response:
{
"status": "healthy",
"timestamp": "2023-01-01T00:00:00Z",
"version": "1.0.0",
"details": {
"database": "connected",
"disk_space": "sufficient"
}
}Logs
Service logs are available in the following locations:
- Console Output: Direct output from each service
- Log Files: Stored in
logs/directory at the project root (e.g.,/SuData/logs/telegram_scraper.log) - Dashboard: Real-time log viewing in the monitoring interface
Alerts
Configure alerts for:
- Service failures
- Resource constraints
- Performance degradation
- Failed health checks
Alerts can be sent to:
- Slack/Teams
- Discord
- Custom webhooks
- Access the dashboard
- Development: http://localhost:3000
- Staging: http://localhost:3001
Development Setup
-
Install Python and JavaScript dependencies
# From the project root, pnpm will install all dependencies for all services pnpm install -
Run tests
-
Start development servers
๐งช Testing
Test Suites
- Unit Tests:
pnpm run test:unit - Integration Tests:
pnpm run test:integration - End-to-End Tests:
pnpm run test:e2e - Performance Tests:
pnpm run test:performance - Coverage Reports:
pnpm run test:coverage
Test Coverage
๐ Deployment
Deployment Process
Deployment is currently a manual process. Automated CI/CD pipelines are configured via GitHub Actions for continuous integration and testing.
-
Staging Deployment (Automatic)
- Triggered on merge to
mainbranch - Automated build, test, and deploy
- Health checks and notifications
- Triggered on merge to
-
Production Deployment (Manual)
- Manual workflow dispatch
- Requires confirmation input
- Comprehensive health checks
- Automated rollback on failure
๐ Monitoring & Health Checks
Health API Endpoints
Each service provides comprehensive health monitoring endpoints:
| Service | Port | Endpoints | Status |
|---|---|---|---|
| Telegram Scraper | 3005 | /health, /health/detailed, / |
โ Implemented |
| TikTok Scraper | 3000 | /health, /health/detailed, / |
โ Implemented |
| YouTube Scraper | 3002 | /health, /health/detailed, / |
โ Implemented |
| Refinery Service | 3004 | /health |
โ Available |
| Dashboard | 3007 | Dashboard UI + service monitoring | โ Running |
Health API Features
Each health endpoint provides:
- Basic Health: Service operational status (healthy/error/stopped)
- Detailed Statistics:
- Number of monitored channels/streamers
- Recent output files (last 24 hours)
- Error count and activity metrics
- File details (size, modification time, age)
- Service Information: Version, uptime, and configuration details
Dashboard Integration
The web dashboard automatically monitors all service health endpoints:
- ๐ข Green: Service running and healthy
- ๐ก Amber: Service status unknown (health endpoint unavailable)
- ๐ด Red: Service error or health check failed
- โช Gray: Service stopped
Health Check Examples
# Check Telegram scraper health curl http://localhost:3005/health # Check TikTok scraper detailed stats curl http://localhost:3000/health/detailed # Check YouTube scraper status curl http://localhost:3002/health # View all services in dashboard open http://localhost:3007
Monitoring Features
- Service Health: Real-time health status monitoring with automatic refresh
- Performance Metrics: File processing statistics and error tracking
- Resource Utilization: Service uptime and activity monitoring
- Error Tracking: Recent error count and log analysis
- Deployment Status: Service availability and health endpoint monitoring
๐ง Configuration
Environment Variables
Development
NODE_ENV=development LOG_LEVEL=DEBUG DASHBOARD_PORT=3007
Staging
NODE_ENV=staging LOG_LEVEL=DEBUG DASHBOARD_PORT=3007 REFINERY_PORT=3004
Production
NODE_ENV=production LOG_LEVEL=INFO DASHBOARD_PORT=3007 REFINERY_PORT=3004
๐ Documentation
Architecture & Design
Operations & Deployment
Development
๐ Security
Security Features
- Dependency Scanning: Automated vulnerability detection
- Container Security: Image scanning and security policies
- Secrets Management: Secure handling of sensitive configuration
- Access Control: Role-based access and authentication
Security Practices
- No hardcoded secrets in code
- Regular dependency updates
- Container image scanning
- Secure environment variable management
๐ค Contributing
Development Workflow
- Fork the repository
- Create a feature branch
- Make changes with tests
- Submit a pull request
- CI pipeline validates changes
- Code review and merge
Pull Request Requirements
- โ All tests pass
- โ Code coverage โฅ 70%
- โ Linting and formatting checks pass
- โ Security scans pass
- โ Documentation updated
๐ Performance
Benchmarks
- Dashboard Load Time: < 2 seconds
- API Response Time: < 1 second
- Data Processing: 500+ records/minute
- System Uptime: 99.9% target
Optimization Features
- Docker multi-stage builds
- Efficient dependency caching
- Parallel test execution
- Resource optimization
๐ Troubleshooting
Common Issues
-
Services Not Starting If services are not starting, check the individual service logs in the
logs/directory at the project root for specific error messages. -
Health Check Failures If services are failing health checks, ensure they are running and accessible on their configured ports. You can manually check a service's health endpoint using
curl:curl http://localhost:[PORT]/health
Replace
[PORT]with the service's actual port (e.g.,3005for Telegram Scraper). -
Telegram Scraper Authentication Prompt If the Telegram Scraper prompts for phone number/bot token, it means the authentication session file is missing or invalid. You need to perform a one-time interactive authentication:
- Ensure
sudata-symphony.pyis not running. - Navigate to
apps/telegram-scraperin your terminal. - Run
tele-venv/Scripts/python.exe main.py(Windows) or./tele-venv/bin/python main.py(Linux/Mac). - Follow the prompts to enter your phone number, verification code, and 2FA password (if applicable).
- Once authenticated, a session file will be created. You can then restart
sudata-symphony.py.
- Ensure
-
Dashboard Showing Red/Errors Despite Services Running If
sudata-symphony.pyreports services as healthy but the dashboard shows errors, it might be a caching issue or a mismatch in the dashboard's internal health check URLs. Try:- Hard refreshing your browser (Ctrl+F5 or Cmd+Shift+R) for
http://localhost:3007/. - Clearing your browser's cache for
http://localhost:3007/.
- Hard refreshing your browser (Ctrl+F5 or Cmd+Shift+R) for
-
Port Conflicts
sudata-symphony.pynow includes automated port cleanup at startup. If you still encounterAddress already in useerrors, manually identify and terminate the conflicting process:# Find process using a specific port (e.g., 3005) netstat -ano | findstr :3005 # Terminate the process using its PID (replace <PID>) taskkill /PID <PID> /F
Support Resources
๐ Project Status
Current Version: v1.0.0
Recent Updates
- โ CI/CD Pipeline Implementation
- โ Multi-Environment Deployment
- โ Automated Health Checks
- โ Rollback Capabilities
- โ Comprehensive Documentation
Upcoming Features
- ๐ Advanced Monitoring Dashboard
- ๐ Enhanced Security Features
- ๐ Performance Optimization
- ๐ Additional Data Sources
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
SuData 2025