K2VV: Wild Precision Gaps Across "Kimi K2" API Vendors

2 hours ago 2

Since the release of the Kimi K2 model, we have received numerous feedback on the precision of Kimi K2 in toolcall. Given that K2 focuses on the agentic loop, the reliability of toolcall is of utmost importance.

We have observed significant differences in the toolcall performance of various open-source solutions and vendors. When selecting a provider, users often prioritize lower latency and cost, but may inadvertently overlook more subtle yet critical differences in model accuracy.

These inconsistencies not only affect user experience but also impact K2's performance in various benchmarking results. To mitigate these problems, we launch K2 Vendor Verifier to monitor and enhance the quality of all K2 APIs.

We hope K2VV can help ensuring that everyone can access a consistent and high-performing Kimi K2 model.

Test Time: 2025-09-22

Model Name Providers Tool calls test Count of Finish Reason stop Count of Finish Reason Tool calls Count of Finish Reason others Schema Validation Error Count Successful Tool Call Count Similarity compared to the official Implementation
kimi-k2-0905-preview MoonshotAI 1437 522 41 0 522 -
Moonshot AI Turbo 1441 513 46 0 513 99.29%
NovitaAI 1483 514 3 10 504 96.82%
SiliconFlow 1408 553 39 46 507 96.78%
Volc 1423 516 61 40 476 96.70%
DeepInfra 1455 545 0 42 503 96.59%
Fireworks 1483 511 6 39 472 95.68%
Infinigence 1484 467 49 0 467 95.44%
Baseten 1777 217 6 9 208 72.23%
Together 1866 134 0 8 126 64.89%
AtlasCloud 1906 94 0 4 90 61.55%

The detailed evaluation metrics are as follows:

Metric Name Description
Count of Finish Reason: stop Number of responses where finish_reason is "stop".
Count of Finish Reason: tool_calls Number of responses where finish_reason is "tool_calls".
Count of Finish Reason: others Number of responses where finish_reason is neither "stop" nor "tool_calls".
Schema Validation Error Count Among "tool_calls" responses, the number that failed schema validation.
Successful Tool Call Count Among "tool_calls" responses, the number that passed schema validation.
Similarity to Official API 1-Euclidean distance between a provider's metric values and those of the official Moonshot AI API/estimated_max_distance(2000).

We test toolcall's response over a set of 2,000 requests. Each provider's responses are collected and compared against the official Moonshot AI API. K2 providers are periodically evaluated. If you are not on the list and would like to be included, feel free to contact us.

Sample Data: We have provided detailed sample data in samples.jsonl.

To run the evaluation tool with sample data, use the following command:

python tool_calls_eval.py samples.jsonl \ --model kimi-k2-0905-preview \ --base-url https://api.moonshot.cn/v1 \ --api-key YOUR_API_KEY \ --concurrency 5 \ --output results.jsonl \ --summary summary.json
  • samples.jsonl: Path to the test set file in JSONL format
  • --model: Model name (e.g., kimi-k2-0905-preview)
  • --base-url: API endpoint URL
  • --api-key: API key for authentication (or set OPENAI_API_KEY environment variable)
  • --concurrency: Maximum number of concurrent requests (default: 5)
  • --output: Path to save detailed results (default: results.jsonl)
  • --summary: Path to save aggregated summary (default: summary.json)
  • --timeout: Per-request timeout in seconds (default: 600)
  • --retries: Number of retries on failure (default: 3)
  • --extra-body: Extra JSON body as string to merge into each request payload (e.g., '{"temperature":0.5}')
  • --incremental: Incremental mode to only rerun failed requests

For testing other providers via OpenRouter:

python tool_calls_eval.py samples.jsonl \ --model kimi-k2-0905-preview \ --base-url https://openrouter.ai/api/v1 \ --api-key YOUR_OPENROUTER_API_KEY \ --concurrency 5 \ --extra-body '{"provider": {"only": ["YOUR_DESIGNATED_PROVIDER"]}}'

If you have any questions or concerns, please reach out to us at [email protected].

Read Entire Article