GitHub - sudaverse/sudaverse: The Sudaverse ecosystem - Building Sudanese Arabic into the Heart of AI

PRESERVING SUDANESE HERITAGE IN THE DIGITAL AGE

License: MIT Discord Contributors Stars

Sudaverse Logo

An open-source AI ecosystem dedicated to developing, training, evaluating, and scaling models that understand and generate Sudanese dialect β€” built for culture, community, and the future.

πŸš€ Quick Start β€’ πŸ“š Documentation β€’ 🀝 Contributing β€’ πŸ’¬ Community


πŸ“– Table of Contents


🎯 Vision & Mission

🌟 Vision

A digital future where Sudanese dialect is fully represented across AI systems β€” understood, generated, and respected by modern language technologies.

🎯 Mission

To build the world's most comprehensive open-source infrastructure for Sudanese dialect NLP, ensuring linguistic preservation and technological advancement.

πŸ”‘ Core Objectives

Objective Description
πŸ“¦ Corpus Building Develop the largest open-source Sudanese dialect dataset
🧹 Data Quality Normalize and clean dialect data using advanced AI tools
πŸ€– Synthetic Generation Create high-quality synthetic Sudanese text at scale
🧠 Model Training Train and fine-tune state-of-the-art LLMs for Sudanese dialect
πŸ“Š Benchmarking Establish comprehensive evaluation standards for dialect models
πŸ‘₯ Community Foster an active community of developers, linguists, and contributors

✨ Key Features

🎯 Feature πŸ“ Description
🌍 Dialect-Aware Purpose-built for Sudanese dialect orthography, phonology, and syntax
πŸ”“ Open Source Fully transparent, MIT-licensed, community-driven development
πŸ”„ End-to-End Pipeline Complete workflow from raw text to production-ready models
πŸ“š Educational Focus Curriculum-anchored tools like SudaTutor for learning applications
πŸ§ͺ Research-Grade Rigorous benchmarking and evaluation frameworks
πŸš€ Production-Ready Docker support, CI/CD pipelines, and deployment guides

πŸ—οΈ Ecosystem Architecture

graph TB
    A[Raw Sudanese Text] --> B[SuData Pipeline]
    B --> C{Clean Corpus}
    C --> D[Corpus Refinery]
    D --> E[Refined Dataset]
    C --> F[Synthetic Data Gen]
    F --> E
    E --> G[Model Training]
    G --> H[Sudanese LLM]
    H --> I[Benchmark Suite]
    I --> J{Evaluation Results}
    J --> K[SudaTutor App]
    J --> L[Production Deploy]
    
    style A fill:#e1f5ff
    style C fill:#fff4e1
    style E fill:#ffe1f5
    style H fill:#e1ffe1
    style J fill:#f5e1ff
Loading

πŸ“š Ecosystem Components

πŸŽ“ Production Systems

Project Description Status Links
πŸŽ“ SudaTutor Educational platform with 117+ subjects. Bilingual with RAG, citations, source-grounded answers. Production πŸ“¦ Repo β€’ πŸ“– Docs
🧹 SuData Data normalization & curation pipeline for Sudanese dialect text. Noise removal, PII filtering. Active πŸ“¦ Repo β€’ πŸ“– Docs
πŸ”§ Corpus Refinery (LLMCorpusKit) Large-scale corpus cleaning with AI-powered semantic repairs, sentence fixing. Stable πŸ“¦ Repo β€’ πŸ“– Docs
πŸ“– Normalizer Sudanese dialect text normalization toolkit. Dialect-aware spelling, punctuation repair. Beta πŸ“¦ Repo β€’ πŸ“– Docs

πŸ”¬ Research & Development

Project Description Status
🎲 Synthetic Data Generator Regional Sudanese dialect text generator (Khartoum, Darfur, East, South) Dev
πŸ“Š Dialect Benchmark Comprehensive tokenizer & model benchmark for Sudanese dialect Dev
πŸ“¦ Data Hub Central registry for Sudanese dialect datasets with metadata Planned

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • Git
  • Docker (optional)

Installation

# Clone the main ecosystem repository
git clone https://github.com/sudaverse/sudaverse.git
cd sudaverse

# Set up virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

Component Installation

πŸ“Š SuData Setup:

cd sudata
pip install -r requirements.txt
python run.py --input ../data/raw/ --output ../data/clean/

πŸŽ“ SudaTutor Setup:

git clone https://github.com/sudaverse/sudatutor.git
cd sudatutor
# Follow repository README for full installation

πŸ“Š Project Status


πŸ—ΊοΈ Roadmap

2025 Q4 🚧

  • Release Synthetic Data Generator v1.0
  • Launch Sudanese Dialect Benchmark suite
  • Complete SudaTutor Roadmap Project
  • Expand documentation and tutorials

2026 Q1 πŸ“‹

  • Develop Data Hub infrastructure
  • Release first fine-tuned Sudanese dialect LLM
  • Organize virtual community summit
  • Launch contributor recognition program

🀝 Contributing

We welcome contributions from developers, linguists, researchers, and Sudanese dialect speakers worldwide!

How to Contribute

πŸ’» Code: Bug fixes, features, performance improvements, documentation
πŸ“ Data: Raw dialect text, dialect samples, annotations, quality reviews
πŸ“š Research: Linguistic analysis, benchmark design, evaluation protocols

Contribution Workflow

# 1. Fork the repository
git fork https://github.com/sudaverse/sudaverse.git

# 2. Create a feature branch
git checkout -b feature/your-amazing-feature

# 3. Make changes and commit
git commit -m "feat: add incredible new feature"

# 4. Push and open a Pull Request
git push origin feature/your-amazing-feature

πŸ’¬ Community

Platform Purpose Link
πŸ’¬ Discord Real-time chat, collaboration Join Discord
πŸ“§ Email Official communications info@sudaverse.com
🌐 Website Documentation, resources sudaverse.com

Resources


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

@misc{sudaverse2024,
  title={Sudaverse: An Open-Source Ecosystem for Sudanese Dialect NLP},
  author={Sudaverse Contributors},
  year={2024},
  url={https://github.com/sudaverse/sudaverse}
}