[Articles] Attention is All You Need

About this Article

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Journal: Advances in Neural Information Processing Systems (NIPS 2017)
Year: 2017
Official Citation: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).

Accomplishments

Ended the era of sequential models like RNNs by demonstrating that superior performance could be achieved with an attention-only architecture, setting a new standard for modern AI.

Key Points

1. Building Blocks

Scaled Dot-Product Attention
- Purpose: To generate a new vector which reflects the context information between words in a sentence.
- Terms:
- Query(Q): Information to find out. A word which is under processing.
- Key(K): Index or keyword of other words.
- Value(V): Ultimate actual information to figure out. - Formula: $Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V$ - Steps: Find the appropriate key first, and use the information to figure out the appropriate value.
  1. Compute similarity of query(Q) and key(K) $(QK^{T})$.
  2. Divide by $\sqrt{d_{k}}$ to avoid the problem of small gradients.
  3. Apply softmax to get the final attention weights.
  4. Compute the final vector by multiplying the weights by value (V).

2. Multi-Head Attention

Purpose: To perform scaled dot-product attention in parallel, allowing the model to jointly attend to information from different representation subspaces at different positions.
Formula: $MultiHead(Q,K,V)=Concat(head_{1},…,head_{h})W^{O}$ where $head_{i}=Attention(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})$
Steps:
1. Divide Q, K, V into h fragments using different, learned linear projections.
2. For each fragment, use different weight matrices in order to have different perspectives.
3. Perform scaled dot-product attention independently and in parallel.
4. Concatenate the results of all heads.
5. Make a final vector by multiplying the concatenated result by another weight matrix.

2. Full Architecture

1. Encoder (left)

Transforms a sentence into a representation (vector) that reflects the context and meaning.
Composed of a stack of 6 identical layers.
Each layer has two sub-layers: Multi-Head Attention and a simple Feed Forward Network.
Employs residual connections around each of the two sub-layers, followed by layer normalization.

2. Decoder (right)

Generates an output sentence based on the representation from the encoder.
Composed of a stack of 6 identical layers.
Each layer has three sub-layers: Masked Multi-Head Attention, Encoder-Decoder Attention, and a Feed Forward Network.
The Masked Multi-Head Attention ensures that the predictions for a position can depend only on the known outputs at positions less than it.
The Encoder-Decoder Attention is where the output of the encoder (K, V) and the output of the previous decoder layer (Q) meet.
Employs residual connections and layer normalization.

3. Positional Encoding

As the model contains no recurrence or convolution, positional encodings are added to give the model information about the relative or absolute position of the tokens in the sequence.
Authors used sine and cosine functions of different frequencies.

3. Advantages of Self-Attention

Self-Attention was evaluated on three criteria compared to Recurrent and Convolutional layers:

Computational complexity per layer: Self-Attention is faster than Recurrent layers when the sequence length n is smaller than the representation dimension d.
Sequential operations: Self-Attention has an O(1) number of sequential operations, allowing for much more parallelization than the O(n) of Recurrent layers.
Maximum path length: Self-Attention offers a constant O(1) path length between any two tokens, making it easier to learn long-range dependencies compared to the O(n) path length in Recurrent layers.

These advantages are why the Transformer is a predominant architecture, suitable for long sentences, massive GPU parallelization, and fast training. Additionally, self-attention allows for more interpretable models by inspecting attention distributions.

Possibility for Further Research

Apply to other types of data: Authors anticipated applying the model to other types of data, such as images, audio, and video.
Making generation less sequential: Investigating ways to make the auto-regressive generation process more parallel.

Twitter Facebook LinkedIn

[Articles] Attention is All You Need

HJ

About this Article

Accomplishments

Key Points

1. Building Blocks

2. Full Architecture

3. Advantages of Self-Attention

Possibility for Further Research

공유하기

댓글남기기

참고

[Articles] Know “No” Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

[Articles] AIVariant: a deep learning-based somatic variant detector for highly contaminated tumor samples

[Articles] ProPILE: Probing Privacy Leakage in Large Language Models

[Articles] ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models