Show HN: AgentCheck – Snapshot and Replay AI Agents Like Real Software

4 months ago 6

agentcheck: Trace ⋅ Replay ⋅ Test your AI agents like real software.

AgentCheck is a minimal but complete toolkit for tracing, replaying, diffing, and testing AI agent executions. Think of it as version control and testing for your AI agents.

export OPENAI_API_KEY=sk-... # 1️⃣ Capture baseline trace python demo/demo_agent.py --output baseline.json # 2️⃣ Modify the prompt inside demo_agent.py (e.g. change tone) # 3️⃣ Replay with new code/model agentcheck replay baseline.json --output new.json # 4️⃣ See what changed agentcheck diff baseline.json new.json # 5️⃣ Assert the new output still mentions the user's name agentcheck assert new.json --contains "John Doe" # 🆕 6️⃣ Test deterministic behavior python demo/demo_deterministic.py

Or run the complete demo:

Feature Description CLI Command Python API

Trace	Capture agent execution (prompts, outputs, costs, timing)	agentcheck trace <command>	@agentcheck.trace()
Replay	Re-run trace against current code/model	agentcheck replay trace.json	agentcheck.replay_trace()
Diff	Compare traces and highlight changes	agentcheck diff trace_a.json trace_b.json	agentcheck.diff_traces()
Assert	Test trace contents (CI-friendly)	agentcheck assert trace.json --contains "foo"	agentcheck.assert_trace()
🆕 Deterministic Testing	Test behavioral consistency of non-deterministic agents	(Python API only)	@agentcheck.deterministic_replay()
🆕 Analytics Dashboard	Beautiful web GUI for trace analysis and testing insights	python launch_dashboard.py	Web interface

import agentcheck import openai @agentcheck.trace(output="my_trace.json") def my_agent(user_input: str) -> str: response = openai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": user_input}] ) return response.choices[0].message.content # Automatically traces execution and saves to my_trace.json result = my_agent("Hello, world!")

🆕 Deterministic Replay Testing

The Problem: AI agents are non-deterministic - they produce different outputs for identical inputs, making traditional testing impossible.

The Solution: AgentCheck's deterministic replay testing learns your agent's behavioral patterns and detects when behavior changes unexpectedly.

import agentcheck import openai @agentcheck.deterministic_replay( consistency_threshold=0.8, # 80% behavioral consistency required baseline_runs=5, # Run 5 times to establish baseline baseline_name="my_agent" # Name for this baseline ) def my_agent(user_input: str) -> str: with agentcheck.trace() as trace: response = openai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": user_input}], temperature=0.7 # Non-deterministic! ) # Record the LLM call trace.add_llm_call( messages=[{"role": "user", "content": user_input}], response={"content": response.choices[0].message.content}, model="gpt-4o-mini" ) return response.choices[0].message.content # Step 1: Establish behavioral baseline replayer = my_agent._deterministic_replayer test_inputs = ["What is Python?", "How do I install packages?"] replayer.establish_baseline( agent_func=my_agent, test_inputs=test_inputs, baseline_name="my_agent" ) # Step 2: Test current agent against baseline failures = replayer.test_consistency( agent_func=my_agent, test_inputs=test_inputs, baseline_name="my_agent" ) if failures: print(f"❌ {len(failures)} tests failed - agent behavior changed!") for failure in failures: print(f"Input: {failure.input_data}") print(f"Consistency Score: {failure.consistency_score:.3f}") else: print("✅ All tests passed - agent behavior is consistent!")

What it detects:

Changes in reasoning patterns
Different tool usage sequences
Altered response structures
Performance regressions
Error rate changes

Perfect for:

Regression testing after prompt changes
Model version upgrades
Code refactoring validation
CI/CD pipeline integration

Get beautiful insights into your agent performance with the built-in web dashboard:

# Launch the dashboard python launch_dashboard.py # Or manually with streamlit pip install streamlit plotly pandas numpy streamlit run agentcheck_dashboard.py

Dashboard Features:

📊 Overview: Key metrics, traces over time, model usage distribution
🔍 Trace Analysis: Detailed step-by-step execution analysis
🧪 Deterministic Testing: Baseline management and consistency trends
💰 Cost Analysis: Cost breakdowns by model and time periods

What you can track:

Total traces and execution costs
Error rates and failure patterns
LLM model usage and performance
Behavioral consistency trends
Cost optimization opportunities

The dashboard automatically loads data from your traces/ and baselines/ directories and provides real-time analytics as you develop and test your agents.

Tracing with Context Manager

import agentcheck with agentcheck.Trace(output="trace.json") as t: # Your agent code here messages = [{"role": "user", "content": "Hello"}] # Manually add LLM calls to trace response = openai.chat.completions.create( model="gpt-4o-mini", messages=messages ) t.add_llm_call( messages=messages, response={"content": response.choices[0].message.content, "usage": response.usage}, model="gpt-4o-mini" )

# Trace a Python script agentcheck trace "python my_agent.py" --output trace.json # Replay a trace with a different model agentcheck replay trace.json --model gpt-4 --output new_trace.json # Compare two traces agentcheck diff baseline.json new_trace.json # Assert trace contains expected content agentcheck assert trace.json --contains "expected output" # Assert with JSONPath agentcheck assert trace.json --jsonpath "$.steps[-1].output.content" --contains "John" # Assert cost and step constraints agentcheck assert trace.json --max-cost 0.05 --min-steps 1 --max-steps 10 # Pretty-print a trace agentcheck show trace.json

┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐ │ Your Agent │───▶│ agentcheck │───▶│ trace.json │ │ │ │ tracer │ │ │ └─────────────────┘ └──────────────┘ └─────────────────┘ │ ┌─────────────────────┼─────────────────────┐ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ replay │ │ diff │ │ assert │ │ (re-execute) │ │ (compare) │ │ (test) │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ ┌─────────────────────┐ │ 🆕 deterministic │ │ behavioral testing │ └─────────────────────┘

AgentCheck uses a standardized JSON schema for traces:

{ "trace_id": "uuid", "version": "1.0", "start_time": "2024-01-01T12:00:00Z", "end_time": "2024-01-01T12:00:05Z", "metadata": { "total_cost": 0.0023, "function_name": "my_agent" }, "steps": [ { "step_id": "uuid", "type": "llm_call", "start_time": "2024-01-01T12:00:01Z", "end_time": "2024-01-01T12:00:04Z", "input": { "messages": [...], "model": "gpt-4o-mini" }, "output": { "content": "Agent response...", "usage": {"prompt_tokens": 10, "completion_tokens": 20}, "cost": 0.0023 } } ] }

🧪 Testing & CI Integration

AgentCheck is designed for CI/CD pipelines:

# Traditional trace testing agentcheck replay baseline_trace.json --output ci_trace.json agentcheck assert ci_trace.json --contains "expected behavior" --max-cost 0.10 # 🆕 Deterministic behavioral testing python -c " import agentcheck from my_agent import my_agent replayer = my_agent._deterministic_replayer test_inputs = ['test1', 'test2', 'test3'] failures = replayer.test_consistency( agent_func=my_agent, test_inputs=test_inputs, baseline_name='production' ) if failures: print(f'❌ {len(failures)} behavioral consistency tests failed') exit(1) else: print('✅ All behavioral tests passed') exit(0) " # Exit codes # 0 = success # 1 = assertion failed or error

# Install in development mode git clone https://github.com/agentcheck/agentcheck cd agentcheck pip install -e ".[dev]" # Run tests pytest # Format code ruff format . # Type check mypy agentcheck/

🔧 Core Framework Improvements

Enhanced Tracing & Observability

Multi-Agent Tracing: Support for complex agent orchestrations and conversations
Real-time Streaming: Live trace streaming for long-running agents
Custom Metrics: User-defined KPIs and business metrics tracking
Performance Profiling: Detailed timing analysis and bottleneck detection
Memory Usage Tracking: Monitor agent memory consumption and optimization

Advanced Testing Capabilities

Property-Based Testing: Generate test cases automatically based on agent specifications
Mutation Testing: Automatically modify prompts/code to test robustness
Load Testing: Concurrent agent execution testing with performance metrics
A/B Testing Framework: Built-in support for comparing agent variants
Regression Test Suite: Automated detection of performance and quality regressions

CI/CD Plugins: Native GitHub Actions, GitLab CI, Jenkins integrations
Database Backends: PostgreSQL, MongoDB support for large-scale trace storage
SSO & RBAC: Enterprise authentication and role-based access control
Audit Logging: Comprehensive audit trails for compliance requirements
API Gateway: REST/GraphQL APIs for enterprise system integration

Multi-Modal Agent Support

Vision Agent Testing: Image/video input tracing and consistency testing
Audio Agent Testing: Speech-to-text and text-to-speech agent validation
Document Processing: PDF, Word, Excel agent testing capabilities
Code Generation: Specialized testing for code-generating agents

Advanced Analytics & Insights

Predictive Analytics: ML-powered prediction of agent behavior changes
Anomaly Detection: Automatic detection of unusual agent behaviors
Cost Optimization: AI-powered recommendations for cost reduction
Quality Scoring: Automated quality assessment and improvement suggestions
Behavioral Clustering: Group similar agent behaviors for pattern analysis

IDE Extensions: VS Code, PyCharm plugins for inline testing
Interactive Debugging: Step-through debugging for agent executions
Visual Flow Builder: Drag-and-drop agent testing pipeline creation
Template Library: Pre-built testing templates for common agent patterns
Auto-Documentation: Generate testing documentation from traces

🚀 Platform & Infrastructure

AgentCheck Cloud: Hosted platform for teams and enterprises
Kubernetes Operator: Native Kubernetes deployment and scaling
Docker Compose: One-click local development environment
Serverless Support: AWS Lambda, Azure Functions, Google Cloud Functions
Edge Computing: Testing for edge-deployed agents

LangChain Integration: Native support for LangChain agents and chains
AutoGen Integration: Multi-agent conversation testing
CrewAI Integration: Specialized crew-based agent testing
Custom Framework Support: Plugin system for any agent framework

🏢 Enterprise Testing Standards

How AgentCheck Achieves Enterprise-Grade Testing

1. Compliance & Governance

# Regulatory compliance testing @agentcheck.compliance_test( standards=["SOX", "GDPR", "HIPAA"], audit_trail=True, data_retention_days=2555 # 7 years ) def financial_advisor_agent(query: str) -> str: # Agent implementation pass # Test for compliance violations failures = agentcheck.test_compliance( agent_func=financial_advisor_agent, test_cases=load_compliance_test_cases(), regulations=["financial_advice_disclosure", "data_privacy"] )

2. Quality Assurance Framework

# Multi-dimensional quality testing quality_metrics = agentcheck.QualityFramework([ agentcheck.AccuracyMetric(threshold=0.95), agentcheck.SafetyMetric(harmful_content_threshold=0.0), agentcheck.BiasMetric(demographic_fairness=True), agentcheck.LatencyMetric(max_response_time_ms=2000), agentcheck.CostMetric(max_cost_per_request=0.10), agentcheck.ConsistencyMetric(behavioral_threshold=0.85) ]) # Enterprise-grade testing pipeline test_results = quality_metrics.evaluate( agent_func=my_agent, test_dataset=enterprise_test_dataset, environments=["staging", "production"] )

3. Security & Safety Testing

# Comprehensive security testing security_tests = agentcheck.SecurityTestSuite([ agentcheck.PromptInjectionTest(), agentcheck.DataLeakageTest(), agentcheck.AdversarialInputTest(), agentcheck.AuthorizationTest(), agentcheck.PIIDetectionTest() ]) # Red team testing red_team_results = security_tests.run_red_team_scenarios( agent_func=my_agent, attack_vectors=["jailbreaking", "data_extraction", "privilege_escalation"] )

4. Performance & Scalability Testing

# Load testing with realistic scenarios load_test = agentcheck.LoadTest( concurrent_users=1000, ramp_up_time=300, # 5 minutes test_duration=3600, # 1 hour realistic_user_behavior=True ) performance_results = load_test.run( agent_func=my_agent, user_scenarios=enterprise_user_scenarios ) # SLA validation sla_compliance = agentcheck.validate_sla( results=performance_results, requirements={ "p95_latency_ms": 1500, "availability_percent": 99.9, "error_rate_percent": 0.1, "throughput_rps": 100 } )

5. Continuous Monitoring & Alerting

# Production monitoring monitor = agentcheck.ProductionMonitor( alert_channels=["slack", "email", "pagerduty"], thresholds={ "error_rate": 0.01, # 1% error rate "latency_p99": 3000, # 3 second P99 latency "cost_per_hour": 50.0, # $50/hour cost limit "behavioral_drift": 0.2 # 20% behavior change } ) # Real-time alerts monitor.start_monitoring( agent_func=my_agent, baseline_name="production_v1.0" )

Enterprise Implementation Checklist

Behavioral Consistency: ≥85% consistency across test runs
Performance SLAs: P95 latency <2s, 99.9% availability
Cost Controls: Automated cost monitoring and alerts
Security Validation: Regular red team testing and vulnerability scans
Compliance Testing: Automated regulatory compliance validation
Quality Gates: Multi-stage testing pipeline with approval gates

📊 Monitoring & Observability

Real-time Dashboards: Executive and operational dashboards
Automated Alerting: PagerDuty/Slack integration for critical issues
Audit Trails: Complete audit logs for all agent interactions
Performance Baselines: Established performance benchmarks
Business Metrics: Custom KPIs aligned with business objectives

Access Controls: Role-based access to testing and monitoring
Data Protection: Encryption at rest and in transit
Incident Response: Automated incident detection and response
Change Management: Controlled deployment with rollback capabilities
Documentation: Comprehensive testing and operational documentation

ROI Metrics for Enterprise Adoption

Risk Reduction:

90% reduction in production agent failures
75% faster incident detection and resolution
60% reduction in compliance violations

Cost Optimization:

40% reduction in LLM API costs through optimization
50% reduction in manual testing effort
30% faster time-to-market for new agent features

Quality Improvement:

95% improvement in agent response consistency
80% reduction in customer complaints
99.9% uptime achievement for critical agent services

MIT License - see LICENSE file.

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Built for the era of AI agents 🤖✨

Read Entire Article