ChatGPT-4o Review: OpenAI's Next-Level Multimodal AI Model

ChatGPT-4o: OpenAI's Latest Advanced Model

After a year-long wait, OpenAI has introduced its latest version in the transformer series, dubbed GPT-4o (“omnimodal”).

This model is exceptionally quick at processing text, audio, images, and video, demonstrating significant advancements in coding and multimodal reasoning while also supporting new capabilities like 3D rendering.

According to lmsys.org’s chatbot arena, ChatGPT-4o has already emerged as the top model overall, based on its performance compared to previous models like the well-known gpt2-chatbot discussed earlier.

Interestingly, the motivation behind this release focuses less on “pushing the veil of ignorance forward,” as stated by Sam Altman, and more on making cutting-edge AI accessible to billions at no cost.

Here’s what you need to know about ChatGPT-4o.

Considering AI Newsletters

You might be tired of AI newsletters that endlessly talk about recent events. While these updates are plentiful, they often lack depth, offering limited value and exaggerated hype.

Conversely, newsletters that forecast future developments are rare gems. If you seek clear insights into the future of AI, the TheTechOasis newsletter could be just what you need.

???? Subscribe today below:

TheTechOasis

The newsletter to stay ahead of the curve in AI thetechoasis.beehiiv.com

The Challenge of True Multimodality

Although Multimodal Large Language Models (MLLMs) are not new, GPT-4o stands out as the first to seamlessly integrate four modalities: audio, video, images, and text.

While models like Gemini 1.5 appeared multimodal for three of these categories, audio was notably absent.
GPT-4V allowed for audio processing and image generation but did so through the integration of separate models like Whisper and Dall-e3.

In contrast, ChatGPT-4o is an all-in-one model, functioning natively across all modalities mentioned.

What does this mean?

I will elaborate more in my free newsletter this Thursday (see above), but the essence is that ChatGPT-4o transcends the traditional concept of being “just a Large Language Model.”

Typically, Large Language Models (LLMs) are designed for sequence-to-sequence tasks, which means they take in text and produce text in return.

When paired with image encoders, they can also manage image processing, and this principle extends to other modalities as well. However, in numerous instances, these components are external; they are not integral to the model itself. As a result, while LLMs can use these inputs to handle various data types, they are limited in their ability to perform cross-modal reasoning.

What does this imply?

As emphasized by Mina Murati during the official presentation, speech includes more than mere words; it encompasses tone, emotion, pauses, and other cues that contribute to the speaker's message.

For instance, the phrase “I’m going to kill you!” could convey very different meanings depending on the speaker's tone or context, such as whether they are serious or laughing.

Previously, ChatGPT processed only the transcription of speech, omitting these essential cues. Consequently, it struggled to interpret spoken language accurately.

Now, with ChatGPT-4o, all the necessary components to process and generate text, images, audio, and video (aside from video generation) are integrated into one model. Essentially, GPT-4o is the first AI that combines these modalities and reasons across them, mimicking human-like understanding.

With this context, what new capabilities were unveiled yesterday?

An All-Encompassing Model

Despite the presentation lasting only 30 minutes, a wealth of significant features was showcased.

In fact, ChatGPT-4o possesses many attributes that could elevate it from a tool used by millions to one utilized by billions.

A Remarkable Demonstration

One of the standout features is ChatGPT's ability to perform real-time video recognition, a capability that Google claimed for Gemini but failed to deliver.

In another instance, an audience member suggested real-time translation, which ChatGPT-4o executed flawlessly, thanks to its impressive human-level latency.

If you're curious about the drastic reduction in latency, it likely stems from the model no longer needing to send data to external models, as everything is now processed within the same framework.

Another noteworthy application for voice assistants like ChatGPT-4o is in education, as a patient AI can aid students in mastering complex subjects:

The extent of the model's intelligence in educational contexts remains uncertain. OpenAI mentioned that ChatGPT-4o possesses “GPT-4-level intelligence,” so it's prudent to view this demonstration as a glimpse into the future rather than the current reality.

Memory was another intriguing feature that gained attention during video demonstrations. In one video, OpenAI President Greg Brockman encountered an “intruder” in his frame, which the model initially disregarded.

However, when Greg prompted it to respond, the model recalled the earlier interaction, indicating two things:

The model seems to retain memory of prior events.
It appears to have a system that allows it to focus on specific tasks while ignoring others. This could suggest that OpenAI has developed an advanced video encoding mechanism with a tiered focus.

In simpler terms, the model effectively prioritizes relevant details while disregarding distractions, enhancing overall efficiency and likely contributing to its remarkably low latency even with high-resolution video.

Naturally, excitement ensued on X, leading to numerous fascinating discussions. One particularly impressive thread came from Will Depue of OpenAI, showcasing various instances that demonstrated GPT-4o’s native multimodality.

The model appears to maintain character consistency across multiple generations without relying on Control-net-type image conditioning:

Demonstration of character consistency in GPT-4o

In ControlNet, sketches can influence the generative diffusion process to create new images, while here, simply prompting the model to maintain the girl and dog across various images is sufficient for compliance.

The model can even take a single image and produce alternative 3D perspectives that can be compiled into an actual 3D rendering.

Generation of alternative 3D views by GPT-4o

It’s important to clarify that GPT-4o is not text-to-3D. However, it can generate various perspectives that can be used as inputs for 3D rendering software, contrary to claims made on X.

Aside from demonstrations, the model appears to dominate benchmark tests as well.

More Intelligent, but Not AGI

As suspected and confirmed by researchers at OpenAI, the “im-also-a-good-gpt2-chatbot” was, in fact, ChatGPT-4o all along.

Images shared by lmsys.org show that gpt2-chatbots, also known as GPT-4o chatbots, significantly outperform both GPT-4 and Claude 3 Opus models in overall ELO (a quality metric).

Performance comparison of GPT-4o with previous models

There’s also a notable enhancement in coding, with improvements soaring by an impressive 100 ELO points. For reference, a 100-point difference typically means that the lower-performing model is favored only one-third of the time.

In other words, ChatGPT-4o is preferred 66% of the time compared to the previous best.

In the realm of coding, one significant announcement was the introduction of the ChatGPT desktop app, which will provide full-screen access to the model for tasks like debugging, as illustrated in this video.

Additionally, the announcement included substantial language improvements.

Serving 97% of the Global Population

OpenAI seems to have significantly enhanced the model’s tokenizer, particularly for non-English languages, claiming it can now cater to 97% of the world's population, which is quite an assertion.

To support this, they shared a table indicating a considerable reduction in tokens across various languages.

Token reduction across different languages

If you're curious why this compression is important, it signifies not only faster and more efficient outputs (fewer tokens to generate leads to better performance) but also showcases enhanced “language intelligence.”

In simpler terms, the fewer tokens a language consists of, the better the model understands its generation patterns.

For example, “ing” is the most frequently used three-letter combination in English.

Consequently, as the model evolves, it recognizes this fact and instead of generating it as three separate tokens—‘i’, ‘n,’ and ‘g’—it efficiently understands “ing” as a single token for English generation.

But is ChatGPT-4o genuinely that superior, marking a substantial leap in intelligence?

The answer is no.

Hold Your Horses, This Isn’t AGI

As shown in a graph shared by OpenAI, the model undoubtedly leads the pack, but its improvements in intelligence over others are slight.

Graph showing intelligence improvements of ChatGPT-4o

While this release may seem underwhelming regarding intelligence advancements, I strongly disagree; this launch is not about reaching the next significant milestone.

However, what’s the purpose if the model hasn’t become smarter?

OpenAI’s True Intentions

In my view, this release has three key objectives:

To buy time for the anticipated launch of the next AI frontier, potentially named GPT-5.
To distract from Google’s I/O conference happening today.
To secure a partnership with Apple.

Let’s explore each of these points.

The Next Frontier is Approaching, but Not Yet Here

OpenAI’s CTO, Mina Murati, explicitly addressed this issue. GPT-4o does not represent a significant leap in intelligence; they noted it retains “GPT-4-level intelligence.”

Furthermore, they indicated that we should expect updates regarding the next frontier soon.

As I have suggested before, I suspect this next phase will merge MLLMs with search algorithms, enabling the model to explore various solution paths before delivering an answer.

This shift would be considerably more resource-intensive, which could clarify why these companies are increasingly interested in acquiring more computing power despite the general trend toward smaller models.

Google’s Biggest Challenge

If you want to predict when OpenAI will release something new, pay attention to Google’s actions.

For instance, after Google announced its impressive million-context window for Gemini 1.5, a significant upgrade in MLLM capabilities, OpenAI completely changed the narrative by unveiling Sora, their video generation model.

Now, just ahead of Google’s highly anticipated I/O conference today, OpenAI conducted its own presentation the day before, setting exceedingly high expectations for analysts regarding Google.

Simply put, this isn't just about Google showcasing its new AI features; it’s now a matter of “let’s see how Google responds to OpenAI’s announcements.”

We can debate Sam Altman’s aggressive competitive strategy, but it’s undeniably effective. Few people would be preferable opponents than him.

Lastly, given the speculation surrounding the competition between Google and OpenAI for control over Siri, we need to consider Apple.

The Ultimate Goal?

If we analyze the potential benefits of partnering with Apple, securing the Siri contract may have been OpenAI's underlying intention all along.

With exceptional latency, engaging voice capabilities, and impressive performance across various data types, OpenAI is positioning itself as a partner to enhance Siri.

Unless Apple develops an outstanding on-device model, users will likely compare Siri unfavorably against state-of-the-art alternatives.

Consequently, while this isn’t an ideal public relations situation for Apple—despite its enormous cash reserves, evidenced by the largest stock buyback in history at $131 billion—the company has little room for error.

As a result, there will be strong motivation to invest in GPT-4o (or whatever Google showcases today) while they address their internal AI challenges and work toward better products.

If this scenario unfolds, we must speculate on how this partnership could take shape.

Apple is known for its commitment to user privacy, which may conflict with utilizing a cloud-based LLM solution for Siri, particularly from a company that has previously faced allegations of copyright and security violations.

However, when financial interests clash with ethical considerations, companies often prioritize profit over principles.

Perhaps OpenAI could deliver a streamlined ChatGPT model capable of achieving similar results to ChatGPT-4o, but this remains speculative.

From Millions to Billions

Overall, OpenAI continues to impress. Yet, this time, their intentions may not be as clear-cut as in the past.

Generative AI products have a reputation for falling short of expectations, a reality that even applies to ChatGPT. Challenges such as latency and inadequate cross-modal reasoning persist.

Now, OpenAI believes it has the product that aligns AI with the high expectations often associated with the “biggest discovery since the Internet.”

While it’s premature to determine if this is the case, it certainly provides OpenAI with the tools to challenge Google and buy time for their most significant release yet—the next frontier of AI.

Nonetheless, GPT-4o still has limitations, and it does not bring us any closer to AGI than GPT-4 did.

However, it does bring Generative AI closer to society by making a powerful AI widely accessible (the product will be free), potentially reaching billions if it secures a partnership with Siri, which is precisely what AI needs to begin fulfilling its promises.

On a final note, if you enjoyed this article, I share similar insights in a more detailed and accessible manner for free on my LinkedIn.

If you prefer, you can connect with me on X.

I look forward to connecting with you.