Tired of shipping Gen AI features based on gut feelings and vibes?
SigmaEval is a Python framework for the statistical evaluation of Gen AI apps, agents, and bots that helps you move from "it seems to work" to making statistically rigorous statements about your AI's quality. It allows you to set and enforce objective quality bars by making statements like:
"We are confident that at least 90% of user issues coming into our customer support chatbot will be resolved with a quality score of 8/10 or higher."
"With a high degree of confidence, the median response time of our new AI-proposal generator will be lower than our 5-second SLO."
"For our internal HR bot, we confirmed that it will likely succeed in answering benefits-related questions in fewer than 4 turns in a typical conversation."
Testing Gen AI apps is challenging due to their non-deterministic outputs and the infinite space of possible user inputs. SigmaEval addresses this by replacing simple pass/fail checks with statistical evaluation. It uses an AI User Simulator to test your app against a wide variety of inputs, and then applies statistical methods to quantify your AI's performance with confidence. This is like a clinical drug trial: the goal isn't to guarantee a specific outcome for every individual but to ensure the treatment is effective for a significant portion of the population (within a certain risk tolerance).
This process transforms subjective assessments into quantitative, data-driven conclusions, giving you a reliable framework for building high-quality AI apps.
At its core, SigmaEval uses two AI agents to automate evaluation: an AI User Simulator that realistically tests your application, and an AI Judge that scores its performance. The process is as follows:
-
Define "Good": You start by defining a test scenario in plain language, including the user's goal and a clear description of the successful outcome you expect. This becomes your objective quality bar.
-
Simulate and Collect Data: The AI User Simulator acts as a test user, interacting with your application based on your scenario. It runs these interactions many times to collect a robust dataset of conversations.
-
Judge and Analyze: The AI Judge scores each conversation against your definition of success. SigmaEval then applies statistical methods to these scores to determine if your quality bar has been met with a specified level of confidence.
Or install from source:
Here is a minimal, complete example of how to use SigmaEval. First, run pip install sigmaeval-framework and set an environment variable with an API key for your chosen model (e.g., GEMINI_API_KEY or OPENAI_API_KEY). SigmaEval uses LiteLLM to support over 100+ LLM providers.
When you run this script, SigmaEval will:
- Generate a Rubric: Based on the expect_behavior, it will create a 1-10 scoring rubric for the Judge LLM.
- Simulate Conversations: It will call your app_handler 20 times (sample_size=20), each time simulating a user saying "Hello".
- Judge the Responses: For each of the 20 conversations, the judge_model will score your app's response against the rubric.
- Perform Statistical Analysis: SigmaEval will then run a hypothesis test to determine if it can be concluded, with 95% confidence (significance_level=0.05), that at least 75% of the responses scored a 7 or higher.
- Determine Pass/Fail: The script will exit with a pass or fail status based on the final assertion.
- Installation
- Hello World
- Core Concepts
- Supported LLMs
- API Reference
- Guides
- Appendix
- Development
- License
- Contributing
SigmaEval combines inferential statistics, AI-driven user simulation, and LLM-as-a-Judge evaluation. This powerful combination allows you to move beyond simple pass/fail tests and gain statistical confidence in your AI's performance.
Each scenario is defined using a ScenarioTest object with a fluent builder API. The test has three main parts that follow the familiar Given-When-Then pattern:
- .given(): This method establishes the prerequisite state and context for the User Simulator LLM. This can include the persona of the user (e.g., a new user, an expert user), the context of the conversation (e.g., a customer's order number), or any other background information.
- .when(): This method describes the specific goal or action the User Simulator LLM will try to achieve. SigmaEval uses this to guide the simulation.
- .expect_behavior() / .expect_metric(): These methods (the "Then" part of the pattern) specify the expected outcomes. Use .expect_behavior() for qualitative checks evaluated by an LLM judge, or .expect_metric() for quantitative checks on objective metrics. Both methods accept criteria to perform the statistical analysis.
This approach allows for a robust, automated evaluation of the AI's behavior against clear, human-readable standards. The full evaluation process for a ScenarioTest unfolds in three main phases: Test Setup, Data Collection, and Statistical Analysis.
SigmaEval provides different statistical criteria to evaluate your AI's performance based on the 1-10 scores from the Judge LLM or on objective metrics. You can choose the one that best fits your scenario. All assertions are available under the assertions object.
All statistical tests require a significance_level (alpha), which can be provided to the SigmaEval constructor as a default, or on a per-assertion basis. This value, typically set to 0.05, represents the probability of rejecting the null hypothesis when it is actually true (a Type I error).
This criterion helps you answer the question: "Is my AI's performance good enough, most of the time?"
It performs a one-sided hypothesis test to verify that a desired proportion of your app's responses meet a minimum quality bar.
Specifically, it checks if there is enough statistical evidence to conclude that the true proportion of scores greater than or equal to your min_score is at least your specified proportion.
This is useful for setting quality targets. For example, assertions.scores.proportion_gte(min_score=8, proportion=0.75) lets you test the hypothesis: "Are at least 75% of our responses scoring an 8 or higher?". The test passes if the collected data supports this claim with statistical confidence.
This criterion helps you answer the question: "Is the typical user experience good?" The median represents the middle-of-the-road experience, so this test is robust to a few unusually bad outcomes.
It performs a one-sided bootstrap hypothesis test to determine if the true median score is statistically greater than or equal to your threshold. Because the median is the 50th percentile, passing this test means you can be confident that at least half of all responses will meet the quality bar.
This is particularly useful for subjective qualities like helpfulness or tone. For example, assertions.scores.median_gte(threshold=8.0) tests the hypothesis: "Is the typical score at least an 8?".
This criterion is used for "lower is better" metrics like response latency. It performs a one-sided hypothesis test to verify that a desired proportion of your app's responses are fast enough.
Specifically, it checks if there is enough statistical evidence to conclude that the true proportion of metric values less than your threshold is at least your specified proportion.
This is useful for setting performance targets (e.g., Service Level Objectives). For example, assertions.metrics.proportion_lt(threshold=1.5, proportion=0.95) lets you test the hypothesis: "Are at least 95% of our responses faster than 1.5 seconds?". The test passes if the collected data supports this claim with statistical confidence.
This criterion helps you answer the question: "Is the typical performance efficient?" for "lower is better" metrics like latency or turn count. The median is robust to a few unusually slow or long-running outcomes.
It performs a one-sided bootstrap hypothesis test to determine if the true median of a metric is statistically lower than your threshold.
This is useful for evaluating the typical efficiency of your system. For example, when applied to turn count, assertions.metrics.median_lt(threshold=3.0) tests the hypothesis: "Does a typical conversation wrap up in fewer than 3 turns?".
SigmaEval provides several built-in metrics to measure objective, quantitative aspects of your AI's performance. All metrics are available under the metrics object and are namespaced by their scope: per_turn or per_conversation.
- Per-Turn Metrics: Collected for each assistant response within a conversation.
- Per-Conversation Metrics: Collected once for the entire conversation.
- Description: Measures the time (in seconds) between the application receiving a user's message and sending its response.
- Scope: Per-Turn
- Use Case: Ensuring the application feels responsive and meets performance requirements (e.g., "95% of responses should be under 1.5 seconds").
- Description: The number of characters in an assistant's response.
- Scope: Per-Turn
- Use Case: Enforcing conciseness in individual responses to prevent overly long messages (e.g., "90% of responses must be under 1000 characters").
- Description: The total number of assistant responses in a conversation.
- Scope: Per-Conversation
- Use Case: Measuring the efficiency of the AI. A lower turn count to resolve an issue is often better (e.g., "The median conversation should be less than 4 turns").
- Description: The total time (in seconds) the assistant spent processing responses for the entire conversation. This is the sum of all response latencies.
- Scope: Per-Conversation
- Use Case: Evaluating the total computational effort of the assistant over a conversation, useful for monitoring cost and overall performance.
- Description: The total number of characters in all of the assistant's responses in a conversation.
- Scope: Per-Conversation
- Use Case: Measuring the overall verbosity of the assistant. This is useful for ensuring that the total amount of text a user has to read is not excessive.
SigmaEval is agnostic to the specific model/provider used by the application under test. For the LLM-as-a-Judge component, SigmaEval uses the LiteLLM library under the hood, which provides a unified interface to many providers and models (OpenAI, Anthropic, Google, Ollama, etc.).
Each SigmaEval run performs multiple LLM calls for rubric generation, user simulation, and judging, which has direct cost implications based on the models and sample_size you choose. While thorough evaluation requires investment, you can manage costs effectively:
-
Use Different Models for Different Roles: The quality of the judge is critical for reliable scores, so it's best to use a relatively powerful model (e.g., openai/gpt-5-mini or gemini/gemini-2.5-flash) for the judge_model. The user simulation, however, is often less demanding. You can use a smaller, faster, and cheaper model (e.g., openai/gpt-5-nano, gemini/gemini-2.5-flash-lite, or a local model like ollama/llama2) for the user_simulator_model to significantly reduce costs without compromising the quality of the evaluation.
sigma_eval = SigmaEval( judge_model="gemini/gemini-2.5-flash", user_simulator_model="gemini/gemini-2.5-flash-lite", # ... other settings ) -
Start with a Small sample_size: During iterative development and debugging, use a small sample_size (e.g., 5-10) to get a quick signal on performance. This allows you to fail fast and fix issues without incurring high costs. Once you are ready for a final, statistically rigorous validation (e.g., before a release), you can increase the sample_size to a larger number (e.g., 30-100+) to achieve higher statistical confidence.
Ultimately, the cost of evaluation should be seen as a trade-off. A small investment in automated, statistical evaluation can prevent the much higher costs associated with shipping a low-quality, unreliable AI product.
To better address the "infinite input space" problem, SigmaEval's user simulator can be configured to adopt a wide variety of writing styles. This feature helps ensure your application is robust to the many ways real users communicate.
By default, for each of the sample_size evaluation runs, the user simulator will randomly adopt a different writing style by combining four independent axes:
- proficiency: The user's grasp of grammar and vocabulary (e.g., "Middle-school level," "Flawless grammar and sophisticated vocabulary").
- tone: The user's emotional disposition (e.g., "Polite and friendly," "Impatient and slightly frustrated").
- verbosity: The length and detail of the user's messages (e.g., "Terse and to-the-point," "Verbose and descriptive").
- formality: The user's adherence to formal language conventions (e.g., "Formal and professional," "Casual with slang").
This behavior is on by default and can be configured or disabled via the WritingStyleConfig object passed to the SigmaEval constructor.
This system ensures that the Given (persona) and When (goal) clauses of your ScenarioTest are always prioritized. The writing style adds a layer of realistic, stylistic variation without overriding the core of the test scenario.
SigmaEval uses Python's standard logging module to provide visibility into the evaluation process. You can control the verbosity by passing a log_level to the SigmaEval constructor.
- logging.INFO (default): Provides a high-level overview, including a progress bar for data collection.
- logging.DEBUG: Offers detailed output for troubleshooting, including LLM prompts, conversation transcripts, and judge's reasoning.
To improve robustness against transient network or API issues, SigmaEval automatically retries failed LLM calls using an exponential backoff strategy (powered by the Tenacity library). This also includes retries for malformed or unparsable LLM responses. This applies to rubric generation, user simulation, and judging calls.
The retry behavior can be customized by passing a RetryConfig object to the SigmaEval constructor. If no configuration is provided, default settings are used.
You can also run a full suite of tests by passing a list of ScenarioTest objects to the evaluate method. The tests will be run concurrently.
For more comprehensive validation, SigmaEval supports testing multiple conditions and assertions within a single ScenarioTest. This allows you to check for complex behaviors and verify multiple statistical properties in an efficient manner.
You can call .expect_behavior() or .expect_metric() multiple times on a ScenarioTest to add multiple expectations. The test will only pass if all expectations are met. Each expectation is evaluated independently (behavioral expectations get their own rubric), but they all share the same sample_size. This is useful for testing complex behaviors that have multiple success criteria.
For efficiency, the user simulation is run only once to generate a single set of conversations. This same set of conversations is then judged against each expectation, making this approach ideal for evaluating multiple facets of a single interaction. When using multiple expectations, you can provide an optional label to each one to easily identify it in the results.
You can also specify a list of criteria in a single .expect_behavior() or .expect_metric() call. The test will only pass if all assertions are met. This is useful for checking multiple statistical properties of the same set of scores or metric values.
For efficiency, the user simulation and judging are run only once to generate a single set of scores. This same set of scores is then evaluated against each criterion.
The evaluate method returns a ScenarioTestResult object (or a list of them) that contains all the information about the test run.
For a quick check, you can inspect the passed property:
Printing the result object provides a comprehensive, human-readable summary of the outcomes, which is ideal for logs:
For more detailed programmatic analysis, the object gives you full access to the nested expectation_results (including scores and reasoning) and the complete conversations list.
SigmaEval is designed to integrate seamlessly with standard Python testing libraries like pytest and unittest. Since the evaluate method returns a result object with a simple .passed boolean property, you can easily use it within your existing test suites.
Here's an example of how to use SigmaEval with pytest:
This allows you to incorporate rigorous, statistical evaluation of your AI's behavior directly into your CI/CD pipelines.
The sample_size determines the number of conversations to simulate for each ScenarioTest. It can be set globally in the SigmaEval constructor or on a per-scenario basis using the .sample_size() method. The scenario-specific value takes precedence.
It is important to note that the sample_size plays a crucial role in the outcome of the hypothesis tests used in criteria like assertions.scores.proportion_gte. A larger sample size provides more statistical evidence, making it easier to detect a true effect. With very small sample sizes (e.g., less than 10), a test might fail to achieve statistical significance (i.e., pass) even if the observed success rate in the sample is 100%. This is the expected and correct behavior, as there isn't enough data to confidently conclude that the true success rate for the entire user population is above the minimum threshold.
To ensure robust and reliable conclusions, SigmaEval uses established statistical hypothesis tests tailored to the type of evaluation being performed.
-
For Proportion-Based Criteria (e.g., proportion_gte): The framework employs a one-sided binomial test. This test is ideal for scenarios where each data point can be classified as a binary outcome (e.g., "success" or "failure," like a score being above or below a threshold). It directly evaluates whether the observed proportion of successes in your sample provides enough statistical evidence to conclude that the true proportion for all possible interactions meets your specified minimum target.
-
For Median-Based Criteria (e.g., median_gte): The framework uses a bootstrap hypothesis test. The median is a robust measure of central tendency, but its theoretical sampling distribution can be complex. Bootstrapping is a powerful, non-parametric resampling method that avoids making assumptions about the underlying distribution of the scores or metric values. By repeatedly resampling the collected data, it constructs an empirical distribution of the median, which is then used to determine if the observed median provides statistically significant evidence for the hypothesis.
This approach ensures that the framework's conclusions are statistically sound without imposing rigid assumptions on the nature of your AI's performance data.
For the ScenarioTest defined in the "Core Concepts" section, SigmaEval might generate the following 1-10 rubric for the Judge LLM:
1: Bot gives no answer or ignores the question.
2: Bot answers irrelevantly, with no mention of its functions.
3: Bot gives vague or incomplete information, missing most functions.
4: Bot names one correct function but misses the rest.
5: Bot names some functions but omits key ones or adds irrelevant ones.
6: Bot names most functions but in unclear or confusing language.
7: Bot names all required functions but with weak clarity or order.
8: Bot names all required functions clearly but without polish or flow.
9: Bot names all required functions clearly, concisely, and in a logical order.
10: Bot names all required functions clearly, concisely, in order, and with natural, helpful phrasing.
Install development dependencies:
Run tests:
Format code:
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
.png)


