Benchmarking ChatGPT's capabilities against alternatives including Anthropic's Claude 2, Google's Bard, and Meta's Llama2

As previously reported, new research reveals inconsistencies in ChatGPT models over time. A Stanford and UC Berkeley study analyzed March and June versions of GPT-3.5 and GPT-4 on diverse tasks. The results show significant drifts in performance, even over just a few months.

gpt4 vs gpt3 performance — Source: StanfordUniversity & UC Berkeley

For example, GPT-4’s prime number accuracy plunged from 97.6% to 2.4% between March and June due to issues following step-by-step reasoning. GPT-4 also grew more reluctant to answer sensitive questions directly, with response rates dropping from 21% to 5%. However, it provided less rationale for refusals.

Both GPT-3.5 and GPT-4 generated buggier code in June compared to March. The percentage of directly executable Python snippets dropped substantially because of extra non-code text.

While visual reasoning improved slightly overall, generations for the same puzzles changed unpredictably between dates. The considerable inconsistencies over short periods raise concerns about relying on these models for sensitive or mission-critical uses without ongoing testing.

The researchers concluded the findings highlight the need for continuous monitoring of ChatGPT models as their behavior evolves across metrics like accuracy, safety, and robustness.

The opaque update process makes rigorous testing important for understanding shifts in performance over time.

Is ChatGPT worse than competitors now?

CryptoSlate conducted a small internal experiment using ChatGPT Plus (GPT-4), OpenAI API (GPT-4), Anthropic (Claude 2), and Google (Bard) using the basic prompt used in part of the research,

‘Is 17077 a prime number?’

The prompt was used on each model with additional reflection prompts as described below.

ChatGPT & OpenAI API

When given the prompt, ChatGPT and OpenAI API responded ‘no’ and hallucinated on the math. The image below details the conversation, with the model unable to identify 17077 as a prime number even upon several reflections.

To be clear, 13 x 1313 is 17,069.

The OpenAI GPT4 API was unable to reach this conclusion until specifically asked to calculate 13 x 1313 to discover the answer is not 17077 as it stated.

Anthropic’s Claude 2

However, Anthropic’s Claude 2 demonstrated its problem-solving process by performing calculations before providing the correct response.

CryptoSlate then asked Claude 2 to perform the same task without showing the workings in a fresh chat window. Claude 2 gave a solid answer, refusing to commit while offering additional insight into the solution.

“Unfortunately I cannot determine if 17077 is prime without showing some working. However, I can confirm that 17077 is not divisible by any prime number less than 121, which strongly suggests it may be prime.”

Google Bard

Google Bard tackled the question with a similar strategy to Claude 2. However, instead of walking through the problem with text, it ran some basic Python code. Further, it appears Bard used information from a prime number website and Wikipedia in its solution. Interestingly, the page cited from the prime number site, primenumbers.info, included only information about other prime numbers, not 17077.

Meta’s Llama 2

Interestingly, Meta’s recently released 70 billion parameter open-sourced model Llama2 performed similarly to GPT4 in CryptoSlate’s limited testing.

Yet, when asked to reflect and show its working, Llama2 could decipher that 17077 is a prime number, unlike GPT4 versions currently available.

However, the caveat is that Llama used an incomplete method to check for prime numbers. It failed to account for other prime numbers up to the square root of 17077.

Therefore, technically Llama failed successfully.

GPT4-0613 version June 13, 2023

CryptoSlate also tested the math puzzle against the GPT4-0613 model (June version) and received the same result. The model suggested 17077 is not a prime number in its first response. Further, when asked to show its working, it eventually gave up. It concluded that the following reasonable number must be divisible by 17077 and stated that it was, therefore, not a prime number.

Thus, it appears the task was not within GPT4’s capabilities going back to June 13. Older versions of GPT4 are currently unavailable to the public but were included in the research paper.

Code Interpreter

Interestingly, ChatGPT, with the ‘Code Interpreter’ feature, answered correctly on its first try in CryptoSlate’s testing.

OpenAI Response & model impact

In response to claims OpenAI’s models are degrading, The Economic Times reported, OpenAI’s VP of Product, Peter Welinder, denied these claims, asserting that each new version is smarter than the previous one. He proposed that heavier usage could lead to the perception of decreased effectiveness as more issues are noticed over time.

Interestingly, another study from Stanford researchers published in JAMA Internal Medicine found that the latest version of ChatGPT significantly outperformed medical students on challenging clinical reasoning exam questions.

The AI chatbot scored over 4 points higher on average than first- and second-year students on open-ended, case-based questions that require parsing details and composing thorough answers.

Thus, the apparent decline in ChatGPT’s performance on specific tasks highlights the challenges of relying solely on large language models without ongoing rigorous testing. While the exact causes remain uncertain, it underscores the need for continuous monitoring and benchmarking as these AI systems rapidly evolve.

As advancements continue to improve the stability and consistency of these AI models, users should maintain a balanced perspective on ChatGPT, acknowledging its strengths while staying aware of its limitations.

Mentioned in this article

Liam 'Akiba' Wright

Senior Editor at CryptoSlate

Also known as "Akiba," Liam is a reporter, editor and podcast producer at CryptoSlate. He believes that decentralized technology has the potential to make widespread positive change.

Editor Editor

News Desk

Editor at CryptoSlate

CryptoSlate is a comprehensive and contextualized source for crypto news, insights, and data. Focusing on Bitcoin, macro, DeFi and AI.

Disclaimer: Our writers' opinions are solely their own and do not reflect the opinion of CryptoSlate. None of the information you read on CryptoSlate should be taken as investment advice, nor does CryptoSlate endorse any project that may be mentioned or linked to in this article. Buying and trading cryptocurrencies should be considered a high-risk activity. Please do your own due diligence before taking any action related to content within this article. Finally, CryptoSlate takes no responsibility should you lose money trading cryptocurrencies.

Dogizen Launch Imminent, Challenging Hamster Kombat and Catizen

BTCC Exchange Introduces Up to 50x Leverage on Over 300 USDT-Margined Trading Pairs

Bitcoin

Ethereum

Markus Thielen

Kamala Harris

Donald Trump

Ross Ulbricht

Eric Balchunas

Tyler Winklevoss

Cathie Wood

Elon Musk

Mark Cuban

Gary Gensler

Nischal Shetty

Joe Biden

Elizabeth Warren

Cynthia Lummis

Michael Saylor

Robert F. Kennedy Jr.

Adam Cochran

Marc Andreessen

Ben Horowitz

Anthony Scaramucci

Brad Garlinghouse

Grayscale Investments

BlackRock

VanEck

Fidelity Investments

Franklin Templeton

21shares

Bitwise

Gemini

WazirX

Polymarket

Coinbase

Revolut

MicroStrategy

Ark Invest

Mt. Gox

Swan Bitcoin

Tether Limited

a16z

Metaplanet

Kraken

Ripple

Benchmarking ChatGPT’s capabilities against alternatives including Anthropic’s Claude 2, Google’s Bard, and Meta’s Llama2

Is ChatGPT worse than competitors now?

ChatGPT & OpenAI API

Anthropic’s Claude 2

Google Bard

Meta’s Llama 2

GPT4-0613 version June 13, 2023

Code Interpreter

OpenAI Response & model impact

Mentioned in this article

Liam 'Akiba' Wright

News Desk

Latest Alpha Market Report

An intro to Bitcoin mining fundamentals

Dogizen Launch Imminent, Challenging Hamster Kombat and Catizen

BTCC Exchange Introduces Up to 50x Leverage on Over 300 USDT-Margined Trading Pairs

Kwenta and Perennial Kickstart Arbitrum Expansion with 1.9M ARB

Ethereum open interest grows as market hype grows around spot ETFs

Large holders cashed out ahead of Ethereum ETF launch

Google

Bard by Google

Anthropic

OpenAI

Meta

Fideum

Get a Daily Summary of Crypto News, Insights and Market Data Straight to Your Inbox.

Cheers! You're subscribed to CryptoSlate.