GitHub - sudaverse/sudaverse: The Sudaverse ecosystem - Building Sudanese Arabic into the Heart of AI

PRESERVING SUDANESE HERITAGE IN THE DIGITAL AGE

An open-source AI ecosystem dedicated to developing, training, evaluating, and scaling models that understand and generate Sudanese dialect — built for culture, community, and the future.

🚀 Quick Start • 📚 Documentation • 🤝 Contributing • 💬 Community

📖 Table of Contents

🎯 Vision & Mission

🌟 Vision

A digital future where Sudanese dialect is fully represented across AI systems — understood, generated, and respected by modern language technologies.

🎯 Mission

To build the world's most comprehensive open-source infrastructure for Sudanese dialect NLP, ensuring linguistic preservation and technological advancement.

🔑 Core Objectives

Objective	Description
📦 Corpus Building	Develop the largest open-source Sudanese dialect dataset
🧹 Data Quality	Normalize and clean dialect data using advanced AI tools
🤖 Synthetic Generation	Create high-quality synthetic Sudanese text at scale
🧠 Model Training	Train and fine-tune state-of-the-art LLMs for Sudanese dialect
📊 Benchmarking	Establish comprehensive evaluation standards for dialect models
👥 Community	Foster an active community of developers, linguists, and contributors

✨ Key Features

🎯 Feature	📝 Description
🌍 Dialect-Aware	Purpose-built for Sudanese dialect orthography, phonology, and syntax
🔓 Open Source	Fully transparent, MIT-licensed, community-driven development
🔄 End-to-End Pipeline	Complete workflow from raw text to production-ready models
📚 Educational Focus	Curriculum-anchored tools like SudaTutor for learning applications
🧪 Research-Grade	Rigorous benchmarking and evaluation frameworks
🚀 Production-Ready	Docker support, CI/CD pipelines, and deployment guides

🏗️ Ecosystem Architecture

graph TB
    A[Raw Sudanese Text] --> B[SuData Pipeline]
    B --> C{Clean Corpus}
    C --> D[Corpus Refinery]
    D --> E[Refined Dataset]
    C --> F[Synthetic Data Gen]
    F --> E
    E --> G[Model Training]
    G --> H[Sudanese LLM]
    H --> I[Benchmark Suite]
    I --> J{Evaluation Results}
    J --> K[SudaTutor App]
    J --> L[Production Deploy]
    
    style A fill:#e1f5ff
    style C fill:#fff4e1
    style E fill:#ffe1f5
    style H fill:#e1ffe1
    style J fill:#f5e1ff

📚 Ecosystem Components

🎓 Production Systems

Project	Description	Links
🎓 SudaTutor	Educational platform with 117+ subjects. Bilingual with RAG, citations, source-grounded answers.	📦 Repo • 📖 Docs
🧹 SuData	Data normalization & curation pipeline for Sudanese dialect text. Noise removal, PII filtering.	📦 Repo • 📖 Docs
🔧 Corpus Refinery (LLMCorpusKit)	Large-scale corpus cleaning with AI-powered semantic repairs, sentence fixing.	📦 Repo • 📖 Docs
📖 Normalizer	Sudanese dialect text normalization toolkit. Dialect-aware spelling, punctuation repair.	📦 Repo • 📖 Docs

🔬 Research & Development

Project	Description	Status
🎲 Synthetic Data Generator	Regional Sudanese dialect text generator (Khartoum, Darfur, East, South)
📊 Dialect Benchmark	Comprehensive tokenizer & model benchmark for Sudanese dialect
📦 Data Hub	Central registry for Sudanese dialect datasets with metadata

🚀 Quick Start

Prerequisites

Python 3.8+
Git
Docker (optional)

Installation

# Clone the main ecosystem repository
git clone https://github.com/sudaverse/sudaverse.git
cd sudaverse

# Set up virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

Component Installation

📊 SuData Setup:

cd sudata
pip install -r requirements.txt
python run.py --input ../data/raw/ --output ../data/clean/

🎓 SudaTutor Setup:

git clone https://github.com/sudaverse/sudatutor.git
cd sudatutor
# Follow repository README for full installation

📊 Project Status

🗺️ Roadmap

2025 Q4 🚧

Release Synthetic Data Generator v1.0
Launch Sudanese Dialect Benchmark suite
Complete SudaTutor Roadmap Project
Expand documentation and tutorials

2026 Q1 📋

Develop Data Hub infrastructure
Release first fine-tuned Sudanese dialect LLM
Organize virtual community summit
Launch contributor recognition program

🤝 Contributing

We welcome contributions from developers, linguists, researchers, and Sudanese dialect speakers worldwide!

How to Contribute

💻 Code: Bug fixes, features, performance improvements, documentation
📝 Data: Raw dialect text, dialect samples, annotations, quality reviews
📚 Research: Linguistic analysis, benchmark design, evaluation protocols

Contribution Workflow

# 1. Fork the repository
git fork https://github.com/sudaverse/sudaverse.git

# 2. Create a feature branch
git checkout -b feature/your-amazing-feature

# 3. Make changes and commit
git commit -m "feat: add incredible new feature"

# 4. Push and open a Pull Request
git push origin feature/your-amazing-feature

💬 Community

Platform	Purpose	Link
💬 Discord	Real-time chat, collaboration	Join Discord
📧 Email	Official communications	info@sudaverse.com
🌐 Website	Documentation, resources	sudaverse.com

Resources

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

@misc{sudaverse2024,
  title={Sudaverse: An Open-Source Ecosystem for Sudanese Dialect NLP},
  author={Sudaverse Contributors},
  year={2024},
  url={https://github.com/sudaverse/sudaverse}
}