Revealing the Truth About GPT-4: The Expert Model Unveiled

GPT-4: The Hidden Structure Emerges

The GPT-4 model has revolutionized the field, accessible to the public for free or via a commercial platform (in beta). It has sparked numerous project ideas for entrepreneurs, but the ambiguity surrounding its parameters has frustrated many who anticipated a single model boasting anywhere from 1 trillion to 100 trillion parameters!

The Secret is Out

On June 20th, George Hotz, the founder of Comma.ai, disclosed that GPT-4 is not a singular, dense model like its predecessors, GPT-3 and GPT-3.5. Instead, it comprises a mixture of eight models, each with 220 billion parameters. This revelation was later corroborated by Soumith Chintala, co-founder of PyTorch at Meta, and hinted at by Mikhail Parakhin, lead of Microsoft Bing AI.

GPT-4: An Ensemble of Models

The key takeaway from these discussions is that GPT-4 functions not as a monolith but as an ensemble of eight smaller models that share expertise. This approach utilizes a well-established method known as the "mixture of experts" paradigm, akin to the mythological concept of a multi-headed being.

Understanding the Mixture of Experts Paradigm

The "Mixture of Experts" (MoE) is a specialized ensemble learning technique tailored for neural networks. Unlike traditional ensemble methods, MoE effectively breaks down tasks into subtasks, assigning experts to tackle each one. This strategy allows for a more refined model for each sub-task, where a meta-model determines which expert is most suited for a specific task.

The Essence of Mixture of Experts - Task Division: The first step involves segmenting the main problem into manageable subtasks, often utilizing domain-specific knowledge. - Expert Development: Each subtask is handled by an expert model, typically a neural network designed for prediction. - Gating Mechanism: A gating model evaluates the predictions from each expert, deciding which one to trust based on the input. - Pooling Predictions: Finally, the outputs from the experts are combined to produce a final prediction.

The figure below illustrates the architecture of this approach:

How the Eight Models of GPT-4 Collaborate

The first task in this model is to divide predictive challenges into subtasks, which can be informed by domain knowledge. For instance, an image may be analyzed by separating elements like background, objects, and colors.

The Role of Expert Models Each expert model receives the same input and makes its prediction. Though traditionally neural networks have been used, any predictive model can be plugged in.

Function of the Gating Model The gating model interprets the predictions from each expert, determining which expert's output should be prioritized based on the input provided. This model dynamically adjusts based on the specific features of the input data.

Pooling Mechanism The final prediction is generated through an aggregation method, which might involve selecting the expert with the highest confidence or calculating a weighted sum of predictions.

Switch Routing vs. Mixture of Experts

Microsoft appears to have utilized a switch routing mechanism, differing from the traditional MoE. This approach streamlines computation, as each token is routed to a single expert model, reducing the overall complexity.

Advantages of Switch Routing: 1. Reduced computational overhead. 2. Halved batch size for processing. 3. Simplified routing and communication.

Insights and Future Directions

It's essential to approach these revelations with caution, as they stem from unofficial sources. However, if true, Microsoft’s strategy of keeping this architecture under wraps has generated significant buzz and may have helped them maintain a competitive edge.

While the performance of GPT-4 is impressive, its architecture does not represent a groundbreaking innovation but rather an astute application of existing techniques. OpenAI has neither confirmed nor denied these claims, but the evidence suggests that this model could indeed reflect the reality of GPT-4's design.

Credit is due to Alberto Romero for his investigative work that brought these insights to light.

Visualization of switch routing dynamics