Show HN: I made an open-source synthetic text datasets generator

4 months ago 27

Generate high-quality and diverse synthetic text datasets in minutes, not weeks.

Get initial evaluation text data instead of starting your LLM project blind.
Increase diversity and coverage of an existing dataset by generating more data.
Experiment and test quickly LLM-based application PoCs.
Make your own datasets to fine-tune and evaluate language models for your application.

🌟 Star this repo if you find this useful!

✅ Text Classification Dataset
✅ Raw Text Generation Dataset
✅ Instruction Dataset (Ultrachat-like)
✅ Multiple Choice Question (MCQ) Dataset
✅ Preference Dataset
⏳ more to come...

Currently we support the following LLM providers:

✔︎ OpenAI
✔︎ Anthropic
✔︎ Google Gemini
✔︎ Ollama (local LLM server)
⏳ more to come...

Try it in Colab:

Make sure you have created a secrets.env file with your API keys. HF token is needed if you want to push the dataset to your HF hub. Other keys depends on which LLM providers you use.

GEMINI_API_KEY=XXXX OPENAI_API_KEY=sk-XXXX ANTHROPIC_API_KEY=sk-ant-XXXXX HF_TOKEN=hf_XXXXX

from datafast.datasets import ClassificationDataset from datafast.schema.config import ClassificationDatasetConfig, PromptExpansionConfig from datafast.llms import OpenAIProvider, AnthropicProvider, GeminiProvider from dotenv import load_dotenv # Load environment variables load_dotenv("secrets.env") # <--- your API keys

# Configure the dataset for text classification config = ClassificationDatasetConfig( classes=[ {"name": "positive", "description": "Text expressing positive emotions or approval"}, {"name": "negative", "description": "Text expressing negative emotions or criticism"} ], num_samples_per_prompt=5, output_file="outdoor_activities_sentiments.jsonl", languages={ "en": "English", "fr": "French" }, prompts=[ ( "Generate {num_samples} reviews in {language_name} which are diverse " "and representative of a '{label_name}' sentiment class. " "{label_description}. The reviews should be {{style}} and in the " "context of {{context}}." ) ], expansion=PromptExpansionConfig( placeholders={ "context": ["hike review", "speedboat tour review", "outdoor climbing experience"], "style": ["brief", "detailed"] }, combinatorial=True ) )

# Create LLM providers providers = [ OpenAIProvider(model_id="gpt-4.1-mini-2025-04-14"), AnthropicProvider(model_id="claude-3-5-haiku-latest"), GeminiProvider(model_id="gemini-2.0-flash") ]

5. Generate and Push Dataset

# Generate dataset and local save dataset = ClassificationDataset(config) dataset.generate(providers) # Optional: Push to Hugging Face Hub dataset.push_to_hub( repo_id="YOUR_USERNAME/YOUR_DATASET_NAME", train_size=0.6 )

Check out our guides for different dataset types:

How to Generate a Text Classification Dataset
How to Create a Raw Text Dataset
How to Create a Preference Dataset
How to Create a Multiple Choice Question (MCQ) Dataset
How to Create an Instruction (Ultrachat) Dataset
Star and watch this github repo to get updates 🌟

Easy-to-use and simple interface 🚀
Multi-lingual datasets generation 🌍
Multiple LLMs used to boost dataset diversity 🤖
Flexible prompt: use our default prompts or provide your own custom prompts 📝
Prompt expansion: Combinatorial variation of prompts to maximize diversity 🔄
Hugging Face Integration: Push generated datasets to the Hub 🤗

Warning

This library is in its early stages of development and might change significantly.