🎉 [2025-06-28] We’re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.
DataFlow is a data preparation and training system designed to parse, generate, process and evaluate high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.
Specifically, we constructing diverse operators leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct pipelines, collectively forming the comprehensive DataFlow system. Additionally, we develop an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing operators on demand.
Current Pipelines in Dataflow are as follows:
In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the documentation for details.
-
DataFlow Agent: An intelligent assistant that performs data analysis, writes custom operators, and automatically orchestrates them into pipelines based on specific task objectives.
For environment setup and installation, please using the following commands👇
If you want to use your own GPU to inference locally, please use:
Dataflow supports Python>=3.10
You can use follwing command to check if installed correctly:
You are expected to see following outputs:
For Quick-Start and Guide, please visit our Documentation.
For Detailed Experiments setting, please visit our documentation.
The pre-training data processing pipeline was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using QuratingScorer are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.
We filted 3k record from alpaca dataset and compare it with radom selected 3k data from alpaca dataset by training it on Qwen2.5-7B. Results are:
We verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are:
We fine-tuned the Qwen2.5-Coder-14B model on the Bird dataset using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:
We sincerely appreciate MinerU's outstanding contribution, particularly its robust text extraction capabilities from PDFs and documents, which greatly facilitates data loading.
Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!
• 📮 GitHub Issues: Report bugs or suggest features
• 🔧 GitHub Pull Requests: Contribute code improvements
• 💬 Join our community groups to connect with us and other contributors!
If you use DataFlow in your research, feel free to give us a cite.
.png)








