K2VV: Wild Precision Gaps Across "Kimi K2" API Vendors

2 hours ago 2

Since the release of the Kimi K2 model, we have received numerous feedback on the precision of Kimi K2 in toolcall. Given that K2 focuses on the agentic loop, the reliability of toolcall is of utmost importance.

We have observed significant differences in the toolcall performance of various open-source solutions and vendors. When selecting a provider, users often prioritize lower latency and cost, but may inadvertently overlook more subtle yet critical differences in model accuracy.

These inconsistencies not only affect user experience but also impact K2's performance in various benchmarking results. To mitigate these problems, we launch K2 Vendor Verifier to monitor and enhance the quality of all K2 APIs.

We hope K2VV can help ensuring that everyone can access a consistent and high-performing Kimi K2 model.

Test Time: 2025-09-22

Model Name Providers Tool calls test Count of Finish Reason stop Count of Finish Reason Tool calls Count of Finish Reason others Schema Validation Error Count Successful Tool Call Count Similarity compared to the official Implementation

kimi-k2-0905-preview	MoonshotAI	1437	522	41	0	522	-
	Moonshot AI Turbo	1441	513	46	0	513	99.29%
	NovitaAI	1483	514	3	10	504	96.82%
	SiliconFlow	1408	553	39	46	507	96.78%
	Volc	1423	516	61	40	476	96.70%
	DeepInfra	1455	545	0	42	503	96.59%
	Fireworks	1483	511	6	39	472	95.68%
	Infinigence	1484	467	49	0	467	95.44%
	Baseten	1777	217	6	9	208	72.23%
	Together	1866	134	0	8	126	64.89%
	AtlasCloud	1906	94	0	4	90	61.55%

The detailed evaluation metrics are as follows:

Metric Name Description

Count of Finish Reason: stop	Number of responses where finish_reason is "stop".
Count of Finish Reason: tool_calls	Number of responses where finish_reason is "tool_calls".
Count of Finish Reason: others	Number of responses where finish_reason is neither "stop" nor "tool_calls".
Schema Validation Error Count	Among "tool_calls" responses, the number that failed schema validation.
Successful Tool Call Count	Among "tool_calls" responses, the number that passed schema validation.
Similarity to Official API	1-Euclidean distance between a provider's metric values and those of the official Moonshot AI API/estimated_max_distance(2000).

We test toolcall's response over a set of 2,000 requests. Each provider's responses are collected and compared against the official Moonshot AI API. K2 providers are periodically evaluated. If you are not on the list and would like to be included, feel free to contact us.

Sample Data: We have provided detailed sample data in samples.jsonl.

To run the evaluation tool with sample data, use the following command:

python tool_calls_eval.py samples.jsonl \ --model kimi-k2-0905-preview \ --base-url https://api.moonshot.cn/v1 \ --api-key YOUR_API_KEY \ --concurrency 5 \ --output results.jsonl \ --summary summary.json

samples.jsonl: Path to the test set file in JSONL format
--model: Model name (e.g., kimi-k2-0905-preview)
--base-url: API endpoint URL
--api-key: API key for authentication (or set OPENAI_API_KEY environment variable)
--concurrency: Maximum number of concurrent requests (default: 5)
--output: Path to save detailed results (default: results.jsonl)
--summary: Path to save aggregated summary (default: summary.json)
--timeout: Per-request timeout in seconds (default: 600)
--retries: Number of retries on failure (default: 3)
--extra-body: Extra JSON body as string to merge into each request payload (e.g., '{"temperature":0.5}')
--incremental: Incremental mode to only rerun failed requests

For testing other providers via OpenRouter:

python tool_calls_eval.py samples.jsonl \ --model kimi-k2-0905-preview \ --base-url https://openrouter.ai/api/v1 \ --api-key YOUR_OPENROUTER_API_KEY \ --concurrency 5 \ --extra-body '{"provider": {"only": ["YOUR_DESIGNATED_PROVIDER"]}}'

If you have any questions or concerns, please reach out to us at [email protected].

Read Entire Article

K2VV: Wild Precision Gaps Across "Kimi K2" API Vendors

Related

Implementing WebSockets with AWS

Detailed Design Notes [of a bicycle frame] [pdf]

Traces of CAL: a lost hybrid programming language (1959–1964...