Increasingly, AI companies are testing new and experimental models under strange names on the LMSYS Chatbot Arena and quietly deploying them without any release notes. Case in point, since last week, X users have been discussing improved performance on ChatGPT, whether for coding or creative tasks. Many believed it was a new OpenAI model, likely related to Project Strawberry — a new advanced reasoning engine.
Recently, OpenAI confirmed that ChatGPT is indeed running a new model. This model, known as chatgpt-4o-latest
, is an updated version of the GPT-4o model optimized specifically for chat interactions. According to OpenAI, this new iteration has been fine-tuned based on qualitative feedback and experimental results to enhance its performance.
OpenAI also stated that it continues to remove bad data from its training dataset while adding quality data and “experimenting with new research methods.” This raises questions about Project Strawberry, which is designed to introduce a new post-training method aimed at improving reasoning capabilities. Is the new ChatGPT model already utilizing this advanced reasoning engine?
Many users on X have noted that ChatGPT now employs multi-step reasoning to provide correct answers. This method allows the model to enhance its responses by generating various step-by-step reasoning rationales, ultimately leading to correct conclusions.
Understanding the Impact of GPT-4o
The GPT-4o model has garnered significant attention, especially after being tested under the pseudonym “anonymous-chatbot” on LMSYS. It received over 11,000 votes and has outperformed other AI models from competitors like Google, Anthropic, and Meta, becoming the first model to score 1314 points in the LMSYS Arena.
As users continue to share their experiences, the new GPT-4o model appears to have made a notable impact. For instance, a user recently remarked on social media about the improved “vibes” of the outputs from GPT-4o compared to its predecessor, the 3.5 Sonnet. This kind of feedback is essential as it helps developers understand user satisfaction and areas for improvement.
The Vibe Test: Analyzing User Experience
To assess the effectiveness of the updated ChatGPT model, I conducted a few reasoning prompts and found that the performance was surprisingly consistent with the previous version. For example, when asked to identify the larger number between 9.11 and 9.9, it provided the correct answer, just like before.
However, some prompts revealed that the model still struggles with certain tasks. In one instance, it suggested stacking nine eggs on top of a bottle, which is impossible. Such errors highlight the ongoing challenges in AI reasoning capabilities.
Common Errors in Reasoning
Another notable mistake occurred when I asked how many “R”s are in the word "strawberry." The model incorrectly stated there were only two. These types of errors suggest that while improvements are evident, widespread rollout may still be pending.
Future Prospects and Expectations
Looking ahead, it’s reasonable to expect that OpenAI's ongoing work with the ChatGPT model will lead to further enhancements in various areas. As the technology evolves, the integration of new methodologies and continuous data refinement will likely contribute to better reasoning and output quality.
With the improvements seen in the GPT-4o model, users can anticipate a more sophisticated interaction experience with ChatGPT. If you have any questions or experiences to share, feel free to leave a comment below!