JanitorBench: A new LLM benchmark for multi-turn chats

2 hours ago 1

Real-world LLM performance ratings from production usage on janitorai.com

About janitorBench

janitorBench is a new benchmark for evaluating chatbot models based on real conversations with real users.

On janitorAI, millions of users chat with characters powered by AI models. The majority of our users use the free model we provide, JLLM, but we also allow for users to connect third-party models using proxies.

After each AI message, users can leave a 1-5 star rating. These ratings measure what actually matters in storytelling-based contexts: how much do users actually like a model's performance in long multi-turn conversations?

Our large scale and third-party model support give us a unique set of data to evaluate the performance of a range of different AI models in real time. Now, we're sharing this data with the community.

Loading benchmark data...

Methodology

We track message ratings (1-5 stars), engagement time, and behavioral signals across millions of conversations. Right now, these scores are based on average ratings, but will be expanded to include other signals.Scores are normalized to 0-100 with statistical safeguards against manipulation.

Updated every 12 hours • Minimum 5000 ratings required • No raw data published