Since the release of the Kimi K2 model, we have received numerous feedback on the precision of Kimi K2 in toolcall. Given that K2 focuses on the agentic loop, the reliability of toolcall is of utmost importance.
We have observed significant differences in the toolcall performance of various open-source solutions and vendors. When selecting a provider, users often prioritize lower latency and cost, but may inadvertently overlook more subtle yet critical differences in model accuracy.
These inconsistencies not only affect user experience but also impact K2's performance in various benchmarking results. To mitigate these problems, we launch K2 Vendor Verifier to monitor and enhance the quality of all K2 APIs.
We hope K2VV can help ensuring that everyone can access a consistent and high-performing Kimi K2 model.
Test Time: 2025-09-22
kimi-k2-0905-preview | MoonshotAI | 1437 | 522 | 41 | 0 | 522 | - |
Moonshot AI Turbo | 1441 | 513 | 46 | 0 | 513 | 99.29% | |
NovitaAI | 1483 | 514 | 3 | 10 | 504 | 96.82% | |
SiliconFlow | 1408 | 553 | 39 | 46 | 507 | 96.78% | |
Volc | 1423 | 516 | 61 | 40 | 476 | 96.70% | |
DeepInfra | 1455 | 545 | 0 | 42 | 503 | 96.59% | |
Fireworks | 1483 | 511 | 6 | 39 | 472 | 95.68% | |
Infinigence | 1484 | 467 | 49 | 0 | 467 | 95.44% | |
Baseten | 1777 | 217 | 6 | 9 | 208 | 72.23% | |
Together | 1866 | 134 | 0 | 8 | 126 | 64.89% | |
AtlasCloud | 1906 | 94 | 0 | 4 | 90 | 61.55% |
The detailed evaluation metrics are as follows:
Count of Finish Reason: stop | Number of responses where finish_reason is "stop". |
Count of Finish Reason: tool_calls | Number of responses where finish_reason is "tool_calls". |
Count of Finish Reason: others | Number of responses where finish_reason is neither "stop" nor "tool_calls". |
Schema Validation Error Count | Among "tool_calls" responses, the number that failed schema validation. |
Successful Tool Call Count | Among "tool_calls" responses, the number that passed schema validation. |
Similarity to Official API | 1-Euclidean distance between a provider's metric values and those of the official Moonshot AI API/estimated_max_distance(2000). |
We test toolcall's response over a set of 2,000 requests. Each provider's responses are collected and compared against the official Moonshot AI API. K2 providers are periodically evaluated. If you are not on the list and would like to be included, feel free to contact us.
Sample Data: We have provided detailed sample data in samples.jsonl.
To run the evaluation tool with sample data, use the following command:
- samples.jsonl: Path to the test set file in JSONL format
- --model: Model name (e.g., kimi-k2-0905-preview)
- --base-url: API endpoint URL
- --api-key: API key for authentication (or set OPENAI_API_KEY environment variable)
- --concurrency: Maximum number of concurrent requests (default: 5)
- --output: Path to save detailed results (default: results.jsonl)
- --summary: Path to save aggregated summary (default: summary.json)
- --timeout: Per-request timeout in seconds (default: 600)
- --retries: Number of retries on failure (default: 3)
- --extra-body: Extra JSON body as string to merge into each request payload (e.g., '{"temperature":0.5}')
- --incremental: Incremental mode to only rerun failed requests
For testing other providers via OpenRouter:
If you have any questions or concerns, please reach out to us at [email protected].