(Daily Point) — Anthropic’s revolutionary artificial intelligence model, Claude 3 Opus, has clinched the coveted top position on the Chatbot Arena leaderboard, outshining OpenAI’s GPT-4 for the first time since its launch last year.
Departing from conventional methods of AI model evaluation, the LMSYS Chatbot Arena adopts a distinctive approach, placing emphasis on human judgment. Participants are tasked with evaluating and ranking responses generated by two distinct models in a blind test scenario.
For an extended period, OpenAI’s GPT-4 has dominated this benchmark, with any contender approaching its performance often dubbed as “GPT-4 class.” Hence, Claude 3’s achievement is particularly noteworthy.
However, it’s worth noting that while Claude 3 has surpassed GPT-4 in these results, the margin between the two models is narrow. Claude 3’s reign at the top may be short-lived, as the imminent release of GPT-4.5 looms.
Administered by the Large Model Systems Organization (LMSys), the Chatbot Arena hosts a diverse array of large language models engaging in anonymous randomized battles. Since its inception last year, the benchmark has amassed over 400,000 user votes, consistently featuring models from OpenAI, Google, Anthropic, as well as emerging contenders like Mistral’s and Alibaba’s offerings.
Utilizing the Elo system, commonly employed in e-sports and chess, the benchmark calculates the skill level of participating models. However, in this case, the participants are not humans interacting with the chatbots, but rather the AI models themselves.
Claude 3 Opus, the flagship model in the Claude 3 lineup, has secured the top position on the leaderboard with an influx of over 70,000 new votes. Remarkably, even the smaller Claude 3 models have performed admirably. Claude 3 Haiku, the smallest variant in the series designed for consumer devices akin to Google’s Gemini Nano, has delivered impressive results without matching the scale of GPT-4 or Claude Opus.
All three Claude models have made notable appearances in the top 10 rankings of these benchmarks. Opus leads the pack, Sonnet shares the fourth position alongside Gemini Pro, and Haiku secures the sixth spot alongside an earlier version of GPT-4.