Show HN: AgentCheck – Snapshot and Replay AI Agents Like Real Software
4 months ago
6
agentcheck: Trace ⋅ Replay ⋅ Test your AI agents like real software.
AgentCheck is a minimal but complete toolkit for tracing, replaying, diffing, and testing AI agent executions. Think of it as version control and testing for your AI agents.
export OPENAI_API_KEY=sk-...
# 1️⃣ Capture baseline trace
python demo/demo_agent.py --output baseline.json
# 2️⃣ Modify the prompt inside demo_agent.py (e.g. change tone)# 3️⃣ Replay with new code/model
agentcheck replay baseline.json --output new.json
# 4️⃣ See what changed
agentcheck diff baseline.json new.json
# 5️⃣ Assert the new output still mentions the user's name
agentcheck assert new.json --contains "John Doe"# 🆕 6️⃣ Test deterministic behavior
python demo/demo_deterministic.py
Test behavioral consistency of non-deterministic agents
(Python API only)
@agentcheck.deterministic_replay()
🆕 Analytics Dashboard
Beautiful web GUI for trace analysis and testing insights
python launch_dashboard.py
Web interface
importagentcheckimportopenai@agentcheck.trace(output="my_trace.json")defmy_agent(user_input: str) ->str:
response=openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": user_input}]
)
returnresponse.choices[0].message.content# Automatically traces execution and saves to my_trace.jsonresult=my_agent("Hello, world!")
🆕 Deterministic Replay Testing
The Problem: AI agents are non-deterministic - they produce different outputs for identical inputs, making traditional testing impossible.
The Solution: AgentCheck's deterministic replay testing learns your agent's behavioral patterns and detects when behavior changes unexpectedly.
importagentcheckimportopenai@agentcheck.deterministic_replay(consistency_threshold=0.8, # 80% behavioral consistency requiredbaseline_runs=5, # Run 5 times to establish baselinebaseline_name="my_agent"# Name for this baseline)defmy_agent(user_input: str) ->str:
withagentcheck.trace() astrace:
response=openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": user_input}],
temperature=0.7# Non-deterministic!
)
# Record the LLM calltrace.add_llm_call(
messages=[{"role": "user", "content": user_input}],
response={"content": response.choices[0].message.content},
model="gpt-4o-mini"
)
returnresponse.choices[0].message.content# Step 1: Establish behavioral baselinereplayer=my_agent._deterministic_replayertest_inputs= ["What is Python?", "How do I install packages?"]
replayer.establish_baseline(
agent_func=my_agent,
test_inputs=test_inputs,
baseline_name="my_agent"
)
# Step 2: Test current agent against baselinefailures=replayer.test_consistency(
agent_func=my_agent,
test_inputs=test_inputs,
baseline_name="my_agent"
)
iffailures:
print(f"❌ {len(failures)} tests failed - agent behavior changed!")
forfailureinfailures:
print(f"Input: {failure.input_data}")
print(f"Consistency Score: {failure.consistency_score:.3f}")
else:
print("✅ All tests passed - agent behavior is consistent!")
What it detects:
Changes in reasoning patterns
Different tool usage sequences
Altered response structures
Performance regressions
Error rate changes
Perfect for:
Regression testing after prompt changes
Model version upgrades
Code refactoring validation
CI/CD pipeline integration
Get beautiful insights into your agent performance with the built-in web dashboard:
# Launch the dashboard
python launch_dashboard.py
# Or manually with streamlit
pip install streamlit plotly pandas numpy
streamlit run agentcheck_dashboard.py
Dashboard Features:
📊 Overview: Key metrics, traces over time, model usage distribution
🧪 Deterministic Testing: Baseline management and consistency trends
💰 Cost Analysis: Cost breakdowns by model and time periods
What you can track:
Total traces and execution costs
Error rates and failure patterns
LLM model usage and performance
Behavioral consistency trends
Cost optimization opportunities
The dashboard automatically loads data from your traces/ and baselines/ directories and provides real-time analytics as you develop and test your agents.
# Install in development mode
git clone https://github.com/agentcheck/agentcheck
cd agentcheck
pip install -e ".[dev]"# Run tests
pytest
# Format code
ruff format .# Type check
mypy agentcheck/
🔧 Core Framework Improvements
Enhanced Tracing & Observability
Multi-Agent Tracing: Support for complex agent orchestrations and conversations
Real-time Streaming: Live trace streaming for long-running agents
Custom Metrics: User-defined KPIs and business metrics tracking
Performance Profiling: Detailed timing analysis and bottleneck detection
Memory Usage Tracking: Monitor agent memory consumption and optimization
Advanced Testing Capabilities
Property-Based Testing: Generate test cases automatically based on agent specifications
Mutation Testing: Automatically modify prompts/code to test robustness
Load Testing: Concurrent agent execution testing with performance metrics
A/B Testing Framework: Built-in support for comparing agent variants
Regression Test Suite: Automated detection of performance and quality regressions