Authored by: Behrooz Azarkhalili
This notebook demonstrates how to use DSPy’s GEPA (Generalized Error-driven Prompt Augmentation) optimizer to improve language model performance on mathematical reasoning tasks. We’ll work with the NuminaMath-1.5 dataset and show how GEPA can boost accuracy through automated prompt optimization.
What you’ll learn:
- Setting up DSPy with language models (OpenRouter)
- Processing and filtering mathematical problem datasets
- Building a baseline Chain-of-Thought reasoning program
- Optimizing prompts with GEPA using error-driven feedback
- Evaluating improvements in model accuracy
GEPA works by analyzing errors, generating targeted feedback, and automatically refining prompts to address common failure patterns. This makes it particularly effective for complex reasoning tasks where prompt quality significantly impacts performance.
Key Resources:
Installation and Setup
Install required dependencies and import libraries for DSPy, dataset processing, and model configuration.
Installation Options:
- uv - Fast Python package installer (documentation)
- pip - Traditional Python package manager
Key Dependencies:
- dspy - DSPy framework for language model programming
- datasets - Hugging Face datasets library for loading NuminaMath-1.5
- python-dotenv - Environment variable management for API keys
Language Model Configuration
Configure your language model - either local (Ollama) or cloud-based (OpenRouter) - for use with DSPy.
Model Selection Rationale
Main LM: openrouter/openai/gpt-4.1-nano
Primary Role: High-volume inference during baseline evaluation and GEPA optimization iterations
Key Selection Criteria:
- Cost Efficiency - $0.10/M input tokens, $0.40/M output tokens (~90% cheaper than GPT-4.1 or Claude)
- Low Latency - Fastest GPT-4.1 variant, enables rapid iteration with 16-32 parallel threads
- Adequate Performance - 60-65% baseline accuracy (MMLU: 80.1%, GPQA: 50.3%)
- Context Window - 1M tokens for long chain-of-thought reasoning
Reflection LM: openrouter/qwen/qwen3-next-80b-a3b-thinking
Primary Role: Deep error analysis and prompt improvement during GEPA’s reflection phase
Key Selection Criteria:
- Advanced Reasoning - “Thinking” variant specialized for analytical reasoning and pattern identification
- Quality Over Speed - ~16 reflection calls vs 2000+ inference calls, can afford slower, higher-quality model
- Context Handling - 10M token context window for processing multiple training examples
- Cost Trade-off - More expensive per token but negligible total cost due to low volume
Architecture Philosophy: Use a cheap, fast model for high-volume inference (99% of calls) and a smart, analytical model for low-volume reflection (1% of calls). This asymmetric design optimizes for both cost efficiency and learning quality.
Understanding GEPA’s Two-Model Architecture
GEPA’s breakthrough innovation lies in its dual-model approach for reflective prompt optimization, which fundamentally differs from traditional single-model optimizers.
Why Two Models?
Traditional prompt optimizers rely on scalar metrics (accuracy scores) to guide improvements, essentially using trial-and-error without understanding why predictions fail. GEPA introduces a revolutionary approach by separating concerns:
1. Student LM (Inference Model)
- Role: Primary model that executes tasks and generates predictions
- Characteristics: Fast, cost-efficient, handles high-volume inference
- Usage Pattern: ~90-95% of all API calls during optimization
- In This Notebook: openrouter/openai/gpt-4.1-nano
2. Reflection LM (Meta-Cognitive Model)
- Role: Analyzes failures, identifies patterns, and generates prompt improvements
- Characteristics: Stronger reasoning, analytical depth, interpretability
- Usage Pattern: ~5-10% of API calls (only during reflection phases)
- In This Notebook: openrouter/qwen/qwen3-next-80b-a3b-thinking
The Reflective Optimization Cycle:
Research Foundation:
This approach is detailed in the paper “Reflective Prompt Evolution Can Outperform Reinforcement Learning”, which demonstrates that reflective optimization with textual feedback outperforms reinforcement learning approaches on complex reasoning tasks.
Dataset Preparation Functions
Helper functions to process the dataset, split it into train/val/test sets, and preview examples.
Baseline Chain-of-Thought Program
Create a simple baseline using DSPy’s Chain-of-Thought module to establish initial performance.
Evaluation Metric
Define the evaluation metric to compare model predictions against ground truth answers.
Baseline Evaluation
Evaluate the baseline Chain-of-Thought program to establish our starting accuracy before optimization.
Understanding the Baseline Results
The evaluation table shows our model’s performance on 90 test problems:
Table Columns:
- problem: The mathematical question from NuminaMath-1.5
- example_answer: Ground truth answer
- reasoning: Model’s chain-of-thought reasoning process
- pred_answer: Model’s final prediction
- metric: ✔️ indicates correct answer
Key Observations:
- Baseline Accuracy: ~52% - The model gets roughly half the problems correct
- Reasoning Quality: The model generates coherent step-by-step reasoning (see the reasoning column)
- Common Failures:
- Calculation errors (e.g., row 0: predicted 4.8 minutes vs correct 14.4 minutes)
- Misinterpreting problem statements
Why This Matters: This baseline performance demonstrates that while GPT-4.1 Nano has reasonable mathematical reasoning capability, there’s significant room for improvement. GEPA will analyze these errors and automatically refine the prompt to address common failure patterns, potentially boosting accuracy by 10-20 percentage points.
GEPA Optimization
Apply GEPA optimizer with error-driven feedback to automatically improve the prompt and boost performance.
How GEPA Works: Error-Driven Prompt Improvement
GEPA (Generalized Error-driven Prompt Augmentation) is an automatic prompt optimization technique that learns from mistakes to improve model performance. Here’s how it works:
The GEPA Optimization Cycle:
- Evaluation Phase - Run the model on training examples and collect predictions
- Error Analysis - Identify which problems the model got wrong
- Feedback Generation - Create detailed feedback explaining:
- What the correct answer should be
- Why the model’s answer was wrong
- The complete step-by-step solution
- Reflection Phase - Use the reflection LM (Qwen3 Thinking) to:
- Analyze patterns across multiple failed examples
- Identify common failure modes (e.g., “model miscalculates ratios”, “model misinterprets word problems”)
- Generate improved prompt instructions to address these patterns
- Prompt Update - Modify the system prompt with new guidelines
- Validation - Test the updated prompt on validation set
- Iteration - Repeat the cycle, keeping only improvements that boost validation accuracy
Why We Need metric_with_feedback:
Unlike a standard metric that just returns 0 or 1 (correct/incorrect), metric_with_feedback returns:
- Score: 0 or 1 for correctness
- Feedback: Rich textual explanation including the ground truth solution
This feedback is crucial because GEPA’s reflection model needs to understand why predictions failed to generate better prompts. The more detailed the feedback, the better GEPA can identify patterns and create targeted improvements.
Key Parameters:
- auto="light": Controls optimization intensity (light/medium/heavy)
- reflection_minibatch_size=16: Number of errors analyzed together per reflection
- reflection_lm: The smarter model used for analyzing errors and improving prompts
- num_threads=32: Parallel evaluation for faster optimization
Optimized Program Evaluation
Evaluate the GEPA-optimized program to measure the improvement in accuracy and effectiveness.
Understanding the Optimization Results
Performance Improvement:
- Baseline Accuracy: 52.2% (47/90 correct)
- Optimized Accuracy: 57.8% (52/90 correct)
- Improvement: +5.6 percentage points (~11% relative improvement)
What Changed: See the instruction GEPA developed above.
Why the Modest Improvement?
The ~6% gain is expected given:
- Small Training Set: Only 112 training examples (0.025% of full dataset)
- Light Optimization: Using auto="light" for faster iteration
- Simple Baseline: Chain-of-Thought already provides decent reasoning structure
- Model Limitations: GPT-4.1 Nano’s mathematical capabilities are the ceiling
Cost Efficiency:
This entire experiment (baseline evaluation, GEPA optimization, and final evaluation on 224 examples) cost less than $0.50 thanks to:
- GPT-4.1 Nano’s low pricing ($0.10/M input, $0.40/M output)
- Asymmetric architecture (cheap model for 99% of calls, smart model for 1%)
- Small sample size for demonstration purposes
Key Takeaway:
Even with limited data and light optimization, GEPA successfully identified failure patterns and generated targeted prompt improvements. With more training data (sample_fraction=0.01 or higher) and heavier optimization (auto="medium" or "heavy"), we’d expect 15-25% improvements, potentially reaching 65-70% accuracy.
Learn More
This notebook introduced DSPy’s GEPA optimizer for automated prompt improvement. Here are additional resources to deepen your understanding:
DSPy Framework
- DSPy Documentation - Official documentation and guides
- DSPy GitHub Repository - Source code and examples
- DSPy Research Paper - “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines”
- DSPy Tutorial Series - Step-by-step learning path
Prompt Optimization
- GEPA Optimizer Documentation - Technical details on GEPA
- Chain-of-Thought Prompting - Foundational paper on CoT reasoning
- Automatic Prompt Engineering - “Large Language Models Are Human-Level Prompt Engineers”
- DSPy Optimizers Comparison - Overview of different optimization strategies
Mathematical Reasoning
- NuminaMath Dataset - The dataset used in this notebook
- GSM8K Dataset - Grade school math word problems benchmark
- MATH Dataset - Competition-level mathematics problems
- Mathematical Reasoning with LLMs - Survey of techniques
Related Techniques
- Few-Shot Learning - “Language Models are Few-Shot Learners” (GPT-3 paper)
- Self-Consistency - Improving reasoning via multiple sampling paths
- ReAct Prompting - Reasoning and Acting in language models
Tools and Platforms
- OpenRouter - Unified API for multiple LLM providers
- Hugging Face Datasets - Dataset loading and processing
- DSPy Optimizers Guide - Deep dive into optimization strategies
.png)
