A cost forecasting tool for LLM API calls, implementing research-based prediction algorithms to estimate token usage and costs before execution.
This project addresses the challenge of unpredictable LLM API costs in large-scale applications. While not perfect, it represents an effort to advance preflight cost estimation using insights from recent academic research on LLM response length prediction capabilities.
The tool implements a three-tier prediction system that combines heuristic analysis, statistical modeling, and research-informed algorithms to provide cost estimates with confidence intervals.
The implementation draws from several key papers:
- Response Length Perception and Sequence Scheduling (Zheng et al., 2023) - 86% throughput improvement through length perception
- Emergent Response Planning in LLMs (Dong et al., ICML 2025) - Hidden state encoding of global response attributes
- Precise Length Control in Large Language Models (Butcher et al., 2024) - LDPE achieving <3 token precision
- Zero-Shot Strategies for Length-Controllable Summarization (Retkowski & Waibel, NAACL 2025) - Length approximation strategies
| Template Sampler | Generate prompt variations | Jinja templates + CSV/JSON data |
| Tokenizer Engine | Count tokens accurately | tiktoken with model-specific encoders |
| Prediction Engine | Estimate completion length | 3-tier cascade system |
| Statistical Analysis | Quantify uncertainty | Bootstrap confidence intervals |
| Pricing Engine | Calculate costs | Multi-provider pricing with auto-updates |
Tier 1: Enhanced Heuristics
- Response type classification (8 categories)
- Length complexity analysis
- Controlled variance injection
Tier 2: Emergent Regression
- Multi-dimensional feature extraction
- L2-regularized optimization
- Historical data learning
Tier 3: Hidden State Analysis
- Global attribute encoding
- LDPE-inspired corrections
- Weighted confidence scoring
- Variables: JSON object with fixed or list values
- CSV/JSON Files: Real data for template rendering
- Variable Lengths: Synthetic text generation by character count
- Accuracy Target: MAPE threshold (0.08-0.25)
- Confidence Level: Statistical confidence (0.95-0.999)
- Tier Selection: Enable/disable prediction methods
- Bootstrap Samples: Statistical robustness (200-1000)
Providers: OpenAI, Anthropic, Google
Auto-pricing: Weekly updates from vendor APIs
| Processing Speed | <300ms for 1000 rows | ✅ |
| Accuracy (Standard) | ≤25% MAPE | ✅ |
| Accuracy (Enhanced) | ≤15% MAPE | ✅ |
| Precision Control | <3 token variance | ⚪ |
| Memory Usage | O(n) complexity | ✅ |
The tool maintains local caches in ~/.preflightllmcost/:
- prices.yaml - Model pricing data (auto-updated)
- history.db - Historical usage for regression learning
- Bootstrap Confidence Intervals: Non-parametric estimation
- Multi-metric Validation: MAPE + variance stability
- Worst-case Analysis: Conservative μ + 2σ projections
- Adaptive Regression: Improves with historical data
- Feature Engineering: Multi-dimensional token relationships
- Correlation Thresholds: Quality-controlled model selection
- Predictions are estimates based on statistical patterns
- Accuracy depends on prompt similarity to training patterns
- New model variants may require calibration period
- Complex reasoning tasks show higher variance
MIT License - See LICENSE file for details.
.png)


