Show HN: Open-source testing framework for AI agents with semantic validation

1 month ago 5

A composable, pipeline-based testing framework for AI systems and APIs.
Build complex test scenarios using simple, reusable blocks with semantic validation.

npm install @blade47/semantic-test

Testing AI systems is hard. Responses are non-deterministic, you need to validate tool usage, and semantic meaning matters more than exact text matching.

SemanticTest solves this with:

Composable blocks for HTTP, parsing, validation, and AI evaluation
Pipeline architecture where data flows through named slots
LLM Judge to evaluate responses semantically using GPT-4
JSON test definitions that are readable and version-controllable

npm install @blade47/semantic-test

{ "name": "API Test", "version": "1.0.0", "context": { "BASE_URL": "https://api.example.com" }, "tests": [ { "id": "get-user", "name": "Get User", "pipeline": [ { "id": "request", "block": "HttpRequest", "input": { "url": "${BASE_URL}/users/1", "method": "GET" }, "output": "response" }, { "id": "parse", "block": "JsonParser", "input": "${response.body}", "output": "user" }, { "id": "validate", "block": "ValidateContent", "input": { "from": "user.parsed.name", "as": "text" }, "config": { "contains": "John" }, "output": "validation" } ], "assertions": { "response.status": 200, "user.parsed.id": 1, "validation.passed": true } } ] }

Tests are pipelines of blocks that execute in sequence:

HttpRequest → JsonParser → Validate → Assert

Each block:

Reads inputs from named slots
Does one thing well
Writes outputs to named slots

Data flows through a DataBus with named slots:

{ "pipeline": [ { "id": "fetch", "block": "HttpRequest", "output": "response" // Writes to 'response' slot }, { "id": "parse", "block": "JsonParser", "input": "${response.body}", // Reads from 'response.body' "output": "data" // Writes to 'data' slot } ] }

1. String - becomes { body: value }

"input": "${response.body}"

2. From/As - maps slot to parameter

"input": { "from": "response.body", "as": "text" }

3. Object - deep resolves all values

"input": { "url": "${BASE_URL}/api", "method": "POST", "headers": { "Authorization": "Bearer ${token}" } }

1. String - stores entire output

2. Object - maps output fields to slots

"output": { "parsed": "data", "error": "parseError" }

3. Default - uses block ID

{ "id": "parse" // Output stored in 'parse' slot }

HttpRequest - Make HTTP requests

{ "block": "HttpRequest", "input": { "url": "https://api.example.com/users", "method": "POST", "headers": { "Authorization": "Bearer ${token}" }, "body": { "name": "John Doe" }, "timeout": 5000 } }

JsonParser - Parse JSON

{ "block": "JsonParser", "input": "${response.body}" }

StreamParser - Parse streaming responses

{ "block": "StreamParser", "input": "${response.body}", "config": { "format": "sse-vercel" // or "sse-openai", "sse" } }

Outputs: text, toolCalls, chunks, metadata

ValidateContent - Validate text

{ "block": "ValidateContent", "input": { "from": "data.message", "as": "text" }, "config": { "contains": ["success", "confirmed"], "notContains": ["error", "failed"], "minLength": 10, "maxLength": 1000, "matches": "^[A-Z].*" } }

ValidateTools - Validate AI tool usage

{ "block": "ValidateTools", "input": { "from": "parsed.toolCalls", "as": "toolCalls" }, "config": { "expected": ["search_database", "send_email"], "forbidden": ["delete_all"], "order": ["search_database", "send_email"], "minTools": 1, "maxTools": 5, "validateArgs": { "send_email": { "to": "[email protected]" } } } }

LLMJudge - Semantic evaluation with GPT-4

{ "block": "LLMJudge", "input": { "text": "${response.text}", "toolCalls": "${response.toolCalls}", "expected": { "expectedBehavior": "Should greet the user and offer to help with their calendar" } }, "config": { "model": "gpt-4o-mini", "criteria": { "accuracy": 0.4, "completeness": 0.3, "relevance": 0.3 } } }

Returns: score (0-1), reasoning, shouldContinue, nextPrompt

Loop - Loop back to previous blocks

{ "block": "Loop", "config": { "target": "retry-request", "maxIterations": 3 } }

Organize multiple tests with shared setup/teardown:

{ "name": "User API Tests", "version": "1.0.0", "context": { "BASE_URL": "${env.API_URL}", "API_KEY": "${env.API_KEY}" }, "setup": [ { "id": "auth", "block": "HttpRequest", "input": { "url": "${BASE_URL}/auth/login", "method": "POST", "body": { "username": "test", "password": "test123" } }, "output": "auth" } ], "tests": [ { "id": "create-user", "name": "Create User", "pipeline": [ { "id": "request", "block": "HttpRequest", "input": { "url": "${BASE_URL}/users", "method": "POST", "headers": { "Authorization": "Bearer ${auth.body.token}" }, "body": { "name": "Jane Doe" } }, "output": "createResponse" } ], "assertions": { "createResponse.status": 201 } }, { "id": "get-user", "name": "Get User", "pipeline": [ { "id": "request", "block": "HttpRequest", "input": { "url": "${BASE_URL}/users/${createResponse.body.id}", "method": "GET", "headers": { "Authorization": "Bearer ${auth.body.token}" } }, "output": "getResponse" } ], "assertions": { "getResponse.status": 200, "getResponse.body.name": "Jane Doe" } } ], "teardown": [ { "id": "cleanup", "block": "HttpRequest", "input": { "url": "${BASE_URL}/users/${createResponse.body.id}", "method": "DELETE", "headers": { "Authorization": "Bearer ${auth.body.token}" } } } ] }

Validate final results with operators:

{ "assertions": { "response.status": 200, // Equality "data.count": { "gt": 10 }, // Greater than "data.count": { "lt": 100 }, // Less than "data.message": { "contains": "success" }, // Contains "data.email": { "matches": ".*@.*\\.com" } // Regex } }

Use .env file:

API_URL=https://api.example.com API_KEY=secret123 OPENAI_API_KEY=sk-...

Reference in tests:

{ "context": { "BASE_URL": "${env.API_URL}", "API_KEY": "${env.API_KEY}" } }

{ "name": "AI Chat Tests", "context": { "CHAT_URL": "${env.CHAT_API_URL}", "API_KEY": "${env.API_KEY}" }, "tests": [ { "id": "chat-test", "name": "Chat with Tool Usage", "pipeline": [ { "id": "chat", "block": "HttpRequest", "input": { "url": "${CHAT_URL}", "method": "POST", "headers": { "Authorization": "Bearer ${API_KEY}" }, "body": { "messages": [ { "role": "user", "content": "Search for users named John" } ] } }, "output": "chatResponse" }, { "id": "parse", "block": "StreamParser", "input": "${chatResponse.body}", "config": { "format": "sse-vercel" }, "output": "parsed" }, { "id": "validate-tools", "block": "ValidateTools", "input": { "from": "parsed.toolCalls", "as": "toolCalls" }, "config": { "expected": ["search_users"] }, "output": "toolValidation" }, { "id": "judge", "block": "LLMJudge", "input": { "text": "${parsed.text}", "toolCalls": "${parsed.toolCalls}", "expected": { "expectedBehavior": "Should use search_users tool and confirm searching for John" } }, "config": { "model": "gpt-4o-mini" }, "output": "judgement" } ], "assertions": { "chatResponse.status": 200, "toolValidation.passed": true, "judgement.score": { "gt": 0.7 } } } ] }

AI outputs vary. Exact text matching fails. Instead, use another LLM to evaluate semantic meaning:

"2:00 PM", "2 PM", "14:00" are all acceptable
Focuses on intent and helpfulness
Provides reasoning for failures
Configurable scoring criteria

// blocks/custom/MyBlock.js import { Block } from '@blade47/semantic-test'; export class MyBlock extends Block { static get inputs() { return { required: ['data'], optional: ['config'] }; } static get outputs() { return { produces: ['result', 'metadata'] }; } async process(inputs, context) { const { data, config } = inputs; // Your logic const result = await processData(data, config); return { result, metadata: { timestamp: Date.now() } }; } }

import { blockRegistry } from '@blade47/semantic-test'; import { MyBlock } from './blocks/custom/MyBlock.js'; blockRegistry.register('MyBlock', MyBlock);

{ "block": "MyBlock", "input": { "data": "${previous.output}", "config": { "mode": "fast" } }, "output": "myResult" }

See blocks/examples/ for complete examples.

# Run single test npx semtest test.json # Run multiple tests npx semtest tests/*.json # Generate HTML report npx semtest test.json --html # Custom output file npx semtest test.json --html --output report.html # Debug mode LOG_LEVEL=DEBUG npx semtest test.json

import { PipelineBuilder } from '@blade47/semantic-test'; import fs from 'fs/promises'; const testDef = JSON.parse(await fs.readFile('test.json', 'utf-8')); const pipeline = PipelineBuilder.fromJSON(testDef); const result = await pipeline.execute(); if (result.success) { console.log('Test passed!'); } else { console.error('Test failed:', result.error); }

See examples/ directory:

simple-api-test.json - Basic REST API testing
test-llm-judge.json - AI response evaluation
test-error-reporting.json - Error handling
test-reporting.json - Rich output formatting

{ "block": "LLMJudge", "input": { "text": "${response.text}", "history": [ { "role": "user", "content": "Hello" }, { "role": "assistant", "content": "Hi there!" }, { "role": "user", "content": "What's the weather?" } ] }, "config": { "continueConversation": true, "maxTurns": 5 } }

import { StreamParser } from '@blade47/semantic-test'; function myCustomParser(body) { // Parse your custom format return { text: extractedText, toolCalls: extractedTools, chunks: allChunks, metadata: { format: 'custom' } }; } StreamParser.register('my-format', myCustomParser);

Use it:

{ "block": "StreamParser", "config": { "format": "my-format" } }

{ "pipeline": [ { "id": "attempt", "block": "HttpRequest", "input": { "url": "${API_URL}" } }, { "id": "check", "block": "ValidateContent", "input": { "from": "attempt.body", "as": "text" }, "config": { "contains": "success" } }, { "id": "retry", "block": "Loop", "config": { "target": "attempt", "maxIterations": 3 } } ] }

1. Use Meaningful Slot Names

// Good "output": "userProfile" "output": "authToken" // Bad "output": "data" "output": "result"

{ "pipeline": [ { "block": "HttpRequest", "output": "response" }, { "block": "JsonParser", "output": "data" }, { "block": "ValidateContent" }, // Validate before expensive operations { "block": "LLMJudge" } // Expensive: calls GPT-4 ] }

Always clean up test data:

{ "setup": [ { "id": "create-test-data", "block": "..." } ], "tests": [ /* ... */ ], "teardown": [ { "id": "delete-test-data", "block": "..." } ] }

4. Semantic Validation for AI

Don't match exact text:

// Bad - too brittle { "assertions": { "response.text": "The meeting is scheduled for 2:00 PM" } } // Good - semantic validation { "block": "LLMJudge", "input": { "expected": { "expectedBehavior": "Should confirm meeting is scheduled for 2 PM" } } }

git clone https://github.com/blade47/semantic-test.git cd semantic-test npm install npm test