PRESERVING SUDANESE HERITAGE IN THE DIGITAL AGE
An open-source AI ecosystem dedicated to developing, training, evaluating, and scaling models that understand and generate Sudanese dialect β built for culture, community, and the future.
π Quick Start β’ π Documentation β’ π€ Contributing β’ π¬ Community
π Table of Contents
- π― Vision & Mission
- β¨ Key Features
- ποΈ Ecosystem Architecture
- π Ecosystem Components
- π Quick Start
- π Project Status
- πΊοΈ Roadmap
- π€ Contributing
- π¬ Community
- π License
π― Vision & Mission
π Core Objectives
| Objective | Description |
|---|---|
| π¦ Corpus Building | Develop the largest open-source Sudanese dialect dataset |
| π§Ή Data Quality | Normalize and clean dialect data using advanced AI tools |
| π€ Synthetic Generation | Create high-quality synthetic Sudanese text at scale |
| π§ Model Training | Train and fine-tune state-of-the-art LLMs for Sudanese dialect |
| π Benchmarking | Establish comprehensive evaluation standards for dialect models |
| π₯ Community | Foster an active community of developers, linguists, and contributors |
β¨ Key Features
| π― Feature | π Description |
|---|---|
| π Dialect-Aware | Purpose-built for Sudanese dialect orthography, phonology, and syntax |
| π Open Source | Fully transparent, MIT-licensed, community-driven development |
| π End-to-End Pipeline | Complete workflow from raw text to production-ready models |
| π Educational Focus | Curriculum-anchored tools like SudaTutor for learning applications |
| π§ͺ Research-Grade | Rigorous benchmarking and evaluation frameworks |
| π Production-Ready | Docker support, CI/CD pipelines, and deployment guides |
ποΈ Ecosystem Architecture
graph TB
A[Raw Sudanese Text] --> B[SuData Pipeline]
B --> C{Clean Corpus}
C --> D[Corpus Refinery]
D --> E[Refined Dataset]
C --> F[Synthetic Data Gen]
F --> E
E --> G[Model Training]
G --> H[Sudanese LLM]
H --> I[Benchmark Suite]
I --> J{Evaluation Results}
J --> K[SudaTutor App]
J --> L[Production Deploy]
style A fill:#e1f5ff
style C fill:#fff4e1
style E fill:#ffe1f5
style H fill:#e1ffe1
style J fill:#f5e1ff
π Ecosystem Components
π Production Systems
| Project | Description | Status | Links |
|---|---|---|---|
| π SudaTutor | Educational platform with 117+ subjects. Bilingual with RAG, citations, source-grounded answers. | π¦ Repo β’ π Docs | |
| π§Ή SuData | Data normalization & curation pipeline for Sudanese dialect text. Noise removal, PII filtering. | π¦ Repo β’ π Docs | |
| π§ Corpus Refinery (LLMCorpusKit) | Large-scale corpus cleaning with AI-powered semantic repairs, sentence fixing. | π¦ Repo β’ π Docs | |
| π Normalizer | Sudanese dialect text normalization toolkit. Dialect-aware spelling, punctuation repair. | π¦ Repo β’ π Docs |
π¬ Research & Development
π Quick Start
Prerequisites
- Python 3.8+
- Git
- Docker (optional)
Installation
# Clone the main ecosystem repository git clone https://github.com/sudaverse/sudaverse.git cd sudaverse # Set up virtual environment python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
Component Installation
π SuData Setup:
cd sudata
pip install -r requirements.txt
python run.py --input ../data/raw/ --output ../data/clean/π SudaTutor Setup:
git clone https://github.com/sudaverse/sudatutor.git cd sudatutor # Follow repository README for full installation
π Project Status
πΊοΈ Roadmap
2025 Q4 π§
- Release Synthetic Data Generator v1.0
- Launch Sudanese Dialect Benchmark suite
- Complete SudaTutor Roadmap Project
- Expand documentation and tutorials
2026 Q1 π
- Develop Data Hub infrastructure
- Release first fine-tuned Sudanese dialect LLM
- Organize virtual community summit
- Launch contributor recognition program
π€ Contributing
We welcome contributions from developers, linguists, researchers, and Sudanese dialect speakers worldwide!
How to Contribute
π» Code: Bug fixes, features, performance improvements, documentation
π Data: Raw dialect text, dialect samples, annotations, quality reviews
π Research: Linguistic analysis, benchmark design, evaluation protocols
Contribution Workflow
# 1. Fork the repository git fork https://github.com/sudaverse/sudaverse.git # 2. Create a feature branch git checkout -b feature/your-amazing-feature # 3. Make changes and commit git commit -m "feat: add incredible new feature" # 4. Push and open a Pull Request git push origin feature/your-amazing-feature
π¬ Community
| Platform | Purpose | Link |
|---|---|---|
| π¬ Discord | Real-time chat, collaboration | Join Discord |
| π§ Email | Official communications | info@sudaverse.com |
| π Website | Documentation, resources | sudaverse.com |
Resources
- π Documentation
- π Tutorials & Guides
- π Issue Tracker
- π Project Board
π License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
@misc{sudaverse2024, title={Sudaverse: An Open-Source Ecosystem for Sudanese Dialect NLP}, author={Sudaverse Contributors}, year={2024}, url={https://github.com/sudaverse/sudaverse} }
