đ If you like our project, please give us a star â on GitHub for the latest update.
įŽäŊ䏿 | English
đ° 1. News
đ 2. Overview
DataFlow series is a data preparation and training system designed to parse, generate, process, and evaluate high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuning, RL training) or RAG using knowledge base cleaning.
Specifically, we are constructing diverse operators leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct pipelines, collectively forming the comprehensive DataFlow system. Additionally, we develop an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing operators on demand.
DataFlow-MM is the multimodal extension version of the awesome repo DataFlow
Quick Start
Installation
First, clone the repository and install DataFlow-MM in editable mode:
cd ./DataFlow-MM conda create -n dataflow-mm python=3.12 conda activate dataflow-mm pip install -e .
Optional Dependencies
Install additional dependencies based on your use case:
Audio environment
pip install -e ".[audio]"Image environment
pip install -e ".[image]"Initialize a DataFlow Workspace
Create and initialize a DataFlow-MM workspace:
mkdir test_dataflow
cd test_dataflow
dataflowmm initThis command will generate the basic directory structure and configuration files required to run DataFlow-MM pipelines.
Demo Data
To run the Image or Video examples, please download the corresponding demo datasets from Hugging Face (GitHub is not suitable for hosting large files):
-
Image Examples: https://huggingface.co/datasets/OpenDCAI/dataflow-demo-image
-
Video Examples: https://huggingface.co/datasets/OpenDCAI/dataflow-demo-video
-
Audio Examples: https://huggingface.co/datasets/OpenDCAI/dataflow-demo-audio
-
Image Generation Examples: https://huggingface.co/datasets/OpenDCAI/dataflow-demo-image-gen
After downloading, place the data in the "test_dataflow/example" directory as instructed in each example.

