newcohospitality.com

Building a Large Language Model from the Ground Up

Written on

This article serves as the sixth installment in a series that delves into practical applications of large language models (LLMs). Earlier discussions centered on utilizing pre-trained LLMs through techniques like prompt engineering and fine-tuning. While these methods are effective for most LLM applications, certain scenarios may warrant the creation of an LLM from the ground up. This piece will examine essential elements involved in building a foundational LLM, drawing insights from the development of models such as GPT-3, Llama, and Falcon.

Image related to building LLMs

Historically, training large-scale language models (with over 10 billion parameters) was primarily the domain of AI researchers. However, following the surge of interest in AI and LLMs post-ChatGPT, many organizations are now exploring the development of custom LLMs from scratch. While this might not be essential for over 99% of LLM applications, understanding the process of developing these large models is valuable.

What Are the Costs Involved?

Before examining the technical intricacies of LLM development, it is beneficial to estimate the associated financial costs. For instance, Meta's Llama 2 models required approximately 180,000 GPU hours for the 7 billion parameter model and around 1.7 million GPU hours for the 70 billion parameter model. Thus, it can be inferred that a model with roughly 10 billion parameters could require about 100,000 GPU hours, while a 100 billion parameter model might take around 1 million GPU hours.

In terms of commercial cloud computing expenses, renting an Nvidia A100 GPU—which was employed for training Llama 2—costs about $1 to $2 per GPU per hour. Consequently, training a model with around 10 billion parameters could cost approximately $150,000, while a model with 100 billion parameters could reach around $1.5 million.

Alternatively, acquiring the GPUs could be an option for those who prefer not to rent. In this case, the total training expense would include the purchase price of the A100 GPUs and additional energy costs. A single A100 GPU is priced around $10,000, and a cluster of 1,000 GPUs would lead to a hardware expenditure of approximately $10 million. Furthermore, if energy costs are estimated at about $100 per megawatt hour, training a 100 billion parameter model may incur an energy cost of around $100,000.

It is crucial to note that these calculations exclude the cost of hiring a team consisting of machine learning engineers, data engineers, data scientists, and others involved in the model development process, which can easily escalate to $1 million.

In summary, training an LLM from scratch represents a substantial investment, particularly at this stage. Thus, there should be a compelling reason for organizations to pursue this route instead of utilizing prompt engineering or fine-tuning existing models.

Four Essential Steps

If you are reconsidering the idea of training an LLM from the ground up (or perhaps you're still determined), let's outline the model development process. This can be broken down into four fundamental steps:

  1. Data Curation
  2. Model Architecture
  3. Training at Scale
  4. Evaluation

While each of these steps involves intricate technical details, this discussion will remain at a high level, highlighting a few key points. For a more in-depth understanding of any aspect, readers are encouraged to refer to the cited resources.

#### Step 1: Data Curation

The quality of machine learning models hinges on the quality of their training data, meaning that the effectiveness of your model is directly linked to the quality of your data. This presents a significant challenge for LLMs, given the vast amounts of data required. For context, here are the training set sizes for some well-known foundational models:

  • GPT-3 175b: 0.5 trillion tokens
  • Llama 70b: 2 trillion tokens
  • Falcon 180b: 3.5 trillion tokens

This signifies approximately a trillion words of text, equating to around a million novels or one billion news articles.

The primary source for this data is the internet, encompassing a wide range of text sources, including webpages, books, scholarly articles, codebases, and conversational data. Numerous open datasets are available for training LLMs, such as Common Crawl and its filtered variants, The Pile (a diverse 825 GB dataset), and others accessible via platforms like Hugging Face.

Alternatively, existing LLMs can be utilized to generate high-quality training text. For example, researchers at Stanford employed GPT-3 to create the training corpus for Alpaca, an LLM designed with an instruction-input-output format.

Regardless of the source, diversity is critical in a training dataset, as it enhances model generalization for downstream tasks. Most popular foundational models exhibit a degree of training data diversity.

Comparison of training data diversity across foundational models

#### Preparing the Data

Collecting a vast array of text data is only the first step; ensuring the quality of that data is paramount. Numerous techniques can be employed, but here are four essential text preprocessing steps:

  • Quality Filtering: This involves eliminating low-quality text from the dataset, which can include nonsensical content, toxic comments, or irrelevant characters. Techniques can be classifier-based or heuristic-based.
  • De-duplication: This step is crucial to avoid biases introduced by repeated instances of the same text, which can distort the training process.
  • Privacy Redaction: Sensitive personal information must be removed to prevent the model from inadvertently revealing confidential data.
  • Tokenization: Language models operate on numerical data, so the training data must be converted into numerical format through a process called tokenization. The bytepair encoding (BPE) algorithm is a popular method for this.

#### Step 2: Model Architecture

Transformers have established themselves as the leading architecture for language modeling. While this sets boundaries for model design, several high-level decisions can still be made within this framework.

##### What Is a Transformer?

A transformer is a neural network architecture that employs attention mechanisms to relate inputs and outputs. Attention mechanisms learn the dependencies between different elements in a sequence based on their content and position, highlighting the importance of context in language.

For example, in the sentence "I hit the baseball with a bat," the word "baseball" indicates that "bat" refers to a baseball bat. Conversely, if the sentence were rearranged to "I hit the bat with a baseball," the meaning changes entirely.

The attention mechanism allows the neural network to consider both content and position, a concept long recognized in machine learning. However, the transformative aspect of the Transformer is its ability to conduct computations in parallel, yielding significant speed advantages over recurrent neural networks.

##### Types of Transformers

Transformers consist of two main components: an encoder and a decoder. These modules can function independently or in combination, leading to three types of transformers:

  • Encoder-only: This type translates tokens into semantically meaningful numerical representations using self-attention. It excels in tasks like text classification. Google’s BERT is a well-known example.
  • Decoder-only: This type also translates tokens but does not allow self-attention with future sequence elements, making it suitable for text generation tasks. Most LLMs, including GPT-3 and Llama, fall into this category.
Illustration of self-attention and masked self-attention weight matrices
  • Encoder-Decoder: This combines both modules, allowing for cross-attention, which learns dependencies between tokens in different sequences. This architecture is useful for tasks requiring input, such as translation or summarization.

##### Additional Design Considerations

  • Residual Connections (RC) help improve training stability by allowing intermediate values to bypass certain layers.
  • Layer Normalization (LN) standardizes values between layers to enhance training stability and speed.
  • Activation Functions (AF) introduce non-linearities that help the model capture complex relationships between inputs and outputs.
  • Position Embedding (PE) captures token position information, which is vital for understanding sequences in language.

##### Determining Model Size

Finding the right balance between model size, training duration, and dataset size is crucial. Models that are too large may overfit, while those that are too small may underperform. Research suggests a scaling schedule that correlates model parameters with token counts.

Step 3: Training at Scale

LLMs are typically trained through self-supervised learning, which involves predicting the next token in a sequence. However, scaling up to models with tens or hundreds of billions of parameters presents challenges.

To optimize training, several strategies can be employed:

  • Mixed Precision Training utilizes both 32-bit and 16-bit floating-point types to reduce computational costs, memory requirements, and training time.
  • 3D Parallelism combines multiple parallelization techniques to distribute training effectively across computational resources.
  • Zero Redundancy Optimizer (ZeRO) minimizes data redundancy in the optimizer state and gradient, enhancing efficiency.

These techniques are integrated into libraries like DeepSpeed, which streamline the process of large-scale model training.

##### Training Stability

Ensuring training stability is another challenge, characterized by a smooth reduction in training loss. Strategies to maintain stability include:

  • Checkpointing: Saves model states to allow resumption of training from a previous point in case of failure.
  • Weight Decay: Regularizes model parameters to prevent overfitting.
  • Gradient Clipping: Rescales gradients to mitigate the exploding gradient problem.

##### Hyperparameters

Hyperparameters govern model training processes and include:

  • Batch Size: The number of samples processed before updating parameters.
  • Learning Rate: The step size for optimization updates.
  • Optimizer: Determines how model parameters are updated.
  • Dropout: Randomly zeros out a portion of parameters during training to avoid overfitting.

As training an LLM is resource-intensive, understanding the trade-offs between model size, training time, and performance is beneficial.

Step 4: Evaluation

Once a model is trained, the evaluation process becomes crucial. This typically involves assessing the model's performance on various tasks. Common benchmarks for LLM evaluation include:

  • ARC: A question-answering dataset with grade-school science questions.
  • Hellaswag: A commonsense natural language inference dataset designed to challenge machines.
  • MMLU: Evaluates world knowledge across multiple subjects.
  • TruthfulQA: Assesses the truthfulness of model responses.

For multiple-choice tasks, performance can be evaluated using prompt templates. More open-ended tasks, however, may require manual evaluation or the use of NLP metrics to quantify the model's output.

What Lies Ahead?

While this article provides a foundational overview of constructing a large language model from scratch, further exploration into the mentioned topics is encouraged. Regardless of whether you opt for a pre-existing foundational model or choose to build your own, it is essential to recognize that base models often serve as starting points for AI solutions rather than final products.

For further insights into LLMs, consider exploring articles on prompt engineering and fine-tuning.

Resources

  • Connect: My website | Book a call | Ask me anything
  • Socials: YouTube | LinkedIn | Twitter
  • Support: Buy me a coffee

[1] BloombergGPT | Paper [2] Llama 2 Paper [3] LLM Energy Costs [4] arXiv:2005.14165 [cs.CL] [5] Falcon 180b Blog [6] arXiv:2101.00027 [cs.CL] [7] Alpaca Repo [8] arXiv:2303.18223 [cs.CL] [9] arXiv:2112.11446 [cs.CL] [10] arXiv:1508.07909 [cs.CL] [11] SentencePiece Repo [12] Tokenizers Doc [13] arXiv:1706.03762 [cs.CL] [14] Andrej Karpathy Lecture [15] Hugging Face NLP Course [16] arXiv:1810.04805 [cs.CL] [17] arXiv:1910.13461 [cs.CL] [18] arXiv:1603.05027 [cs.CV] [19] arXiv:1607.06450 [stat.ML] [20] arXiv:1803.02155 [cs.CL] [21] arXiv:2203.15556 [cs.CL] [22] Trained with Mixed Precision Nvidia Doc [23] DeepSpeed Doc [24] Weight Decay Methodology [25] Gradient Clipping Explanation [26] Scaling Laws for Language Models [27] ARC Dataset [28] Hellaswag Dataset [29] MMLU Dataset [30] TruthfulQA Benchmark [31] Evaluating MMLU Leaderboard