Understanding Different Types of Position Embeddings in LLMs

Position embeddings have become increasingly important in recent large language models (LLMs). This article delves into the underlying concepts of position embeddings, exploring their various forms and distinguishing features.

What are Position Embeddings?

In recurrent neural networks (RNNs), the hidden state is updated based on the current and previous states. In contrast, transformers do not inherently capture the sequence order within sentences. This limitation arises because the attention mechanism assesses relationships among tokens, with each token attending to all others in the sequence without considering their sequential positions.

This means that both of these sentences are interpreted identically by a transformer.

To remedy this, researchers have proposed methodologies to incorporate order information, known as position embeddings. These embeddings are vectors added to token embeddings, providing contextual order information. The concept of absolute embeddings, introduced in the transformer paper, serves as an example of position embeddings.

What are Absolute Positional Embeddings?

Absolute positional embeddings are vectors that match the dimensionality of the corresponding word embeddings. Each vector denotes the position of a word within a sentence, ensuring that each word has a distinct absolute embedding representing its location.

To generate these absolute embeddings, two primary approaches can be utilized:

Learning from Data: In this method, position embeddings are treated as learnable parameters. A unique vector embedding is assigned to each position, constrained by a predetermined maximum length.
Sinusoidal Functions: This approach employs mathematical functions that yield a vector encoding based on sentence position. The original transformer paper alternates between sine and cosine functions, maintaining the same dimension as the word embedding, with higher positions exhibiting rapid oscillations in frequency.

Both methods exhibit comparable performance in experiments.

The Disadvantages of Absolute Positional Embeddings

One drawback of absolute embeddings is that transformers merely memorize the position vectors without understanding the relative positioning of tokens. Consequently, the model perceives positions 1 and 2 as identical to positions 1 and 500, lacking a discernible pattern.

Additionally, absolute positional embeddings face challenges when predicting positions not encountered during training. For instance, if a model trained on sequences of 512 tokens is later presented with a longer sequence, the new positional embedding for token 520 may pose difficulties, as the model has no prior experience with this embedding.

The graph above demonstrates how perplexity escalates when the number of inference tokens exceeds 512, with sinusoidal embeddings showing significant deterioration, while rotary and T5 variants perform better. ALiBi embeddings, which will be discussed in a subsequent article, yield the best results.

What Are Relative Positional Embeddings?

Relative positional embeddings differ by not focusing on individual token positions but rather on the relationships between token pairs within a sentence. This approach helps elucidate the positional context of tokens relative to one another.

Since relative positional embeddings are computed between two vectors, they cannot be directly added to the token vector. Instead, they are incorporated into the self-attention mechanism.

Initially, a matrix that indicates the distances between tokens is constructed.

Next, the weight from this matrix is added to the values matrix, modifying the attention computation.

The self-attention process is further enhanced with positional information.

This positional information also modifies the multiplication of query and key vectors.

The paper on relative positional encoding introduces clipping, suggesting that after processing many tokens, new position values may not be necessary and earlier token values can be reused. For instance, the relationship between tokens 1 and 499 is unlikely to change significantly when comparing tokens 1 and 500.

The clipping parameter K, which indicates when to start reusing values, is a hyperparameter that can be selected. This method effectively addresses issues in processing long sequences but also has its limitations.

The Disadvantages of Relative Positional Embeddings

The addition of new information from relative positional embeddings necessitates that it be included in two locations during attention computations, which can introduce delays during training and inference.

The images above, sourced from the paper "TRAIN SHORT, TEST LONG," illustrate the comparison of different positional embedding methodologies, including the relative positional embeddings utilized by the T5 Bias variant.

It is evident that relative positional embeddings are slower than other types during both training and inference. Additionally, each time a new token is added, new key-value pairs must be computed, which prevents caching.

What Are Rotary Positional Embeddings?

Rotary positional embeddings take a different approach by rotating the same embedding instead of adding a positional embedding. The rotation angle is determined by the token's position in the sequence. When a token appears further along, it undergoes rotation by an integer multiplied by the angle, indicating its position.

This method ensures that the angle between tokens remains consistent when adding tokens at either end, thus maintaining relative positional information.

For a 2D scenario, the following equation applies:

The first matrix represents the rotation operation, while the last vector is the one being rotated. It's important to perform the multiplication of keys and queries first to maintain the same value, even as the rotation factor 'm' is altered.

In higher dimensions, the equation is as follows:

Since the rotation matrix functions only on a two-dimensional scale, the vector is split to process pairs of values sequentially.

For the enhanced token, 'm' remains constant across weight pairs, while the angle varies for each dimension pair. The sparsity of the rotation matrix allows for accelerated vector computations, following the equation below:

Results

The following figure illustrates how RoPE embeddings emphasize nearby tokens, with the significance of this proximity diminishing as the distance increases, which is a favorable trait.

The findings suggest that RoFormer performs marginally better (by 0.2) than the standard transformer in translation tasks. This slight enhancement may be attributable to factors beyond RoPE embeddings.

RoFormer demonstrates superior performance over the BERT variant on certain tasks but not on all, indicating that RoPE embeddings do not universally outperform all other embedding types.

The figure above shows that transformer variants utilizing RoPE embeddings reach convergence with fewer training steps compared to those without these embeddings.

Takeaways

Transformers require position embeddings as they do not inherently recognize the order of tokens in a sequence.
Absolute embeddings assign distinct values to each token using either learned parameters or sinusoidal functions, but they cannot generalize beyond the training sequence length.
Relative embeddings create a matrix to compute weights for distances between tokens and are integrated into the attention mechanism. However, they do not allow for caching of keys and values, leading to slower training and inference.
Rotary embeddings rotate token embeddings based on their sequence position, offering faster convergence and better generalization for long sequences, although they do not consistently outperform all other embedding types in every task.

Papers And Resources

Self-Attention with Relative Position Representations Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves…

[arxiv.org](https://arxiv.org/abs/2108.12409) [arxiv.org](https://arxiv.org/abs/2104.09864)

Feel free to applaud, follow, and comment if you found this article valuable! Stay connected through LinkedIn with Aziz Belaweid or on GitHub.

This article is published under Generative AI Publication. Connect with us on Substack, LinkedIn, and Zeniteq to keep abreast of the latest in AI. Let's collaborate to shape the future of AI!