About this Article

  • Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever
  • Journal: OpenAI
  • Year: 2018
  • Official Citation: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI.


Accomplishments

  • Introducing Generative Pre-Training (GPT) model.


Key Points

1. Architecture

The basic architecture is from Transformer architecture. However, Generative Pre-Training(GPT) model only uses decoder (decoder-only model).

2. Training Steps

Training consists of two parts: unsupervised pre-training and supervised fine-tuning.

1. Unsupervised Pre-Training

  • Purpose: Learning universal representation that needs little adaptation to various tasks.
  • Objective function: Maximize the following formula, which represents the conditional likelihood of i-th token u_i, when only the previous tokens u_{i-k},…,u_{i-1} are given.

    $L_{1}(\mathbb{U})=\sum_{i}log~P(u_{i} u_{i-k},…,u_{i-1};\Theta).$
  • Specific Steps:

    $h_{0}=U~W_{e}+W_{P}$

    $h_{l}=\text{transformer block}(h_{l-1}) \forall i \in [1,n]$

    $P(u)=\text{softmax}(h_{n}W_{e}^{T})$

2. Supervised Fine-Tuning

  • Purpose: Customizing a model to specific tasks.
  • Specific Steps: One additional layer is needed in fine-tuning. (1) Labeled input(dataset) goes into pre-trained model. (2) Output vector $h_{l}^{m}$, which contains contextual meaning of given data(sentence). (3) Output of step 2 goes into a layer(with parameters $W_{y}$), which is a special layer for a specific task. (4) Final label y is achieved.
  • Computing probability: In fine-tuning, it is important to exactly predict the label y of a task. The probability is computed as:

    $P(y x^{1},…,x^{m})=\text{softmax}(h_{l}^{m}W_{y}).$
  • Objective function: Maximize the following formula, which represents the probability of label y, when the whole sentence are given.

    $L_{2}(\mathbb{G})=\sum_{(x,y)}log~P(y x^{1},…,x^{m}).$
  • Final objective function with auxiliary objective: To prevent the model to forget the basic knowledge gained from pre-training, adding an auxiliary objective is helpful. Therefore, the final objective is a mix(ratio λ) of the two objectives above.

    $L_{3}(\mathbb{G})=L_{2}(\mathbb{G})+\lambda*L_{1}(\mathbb{G})$

  • Task-specific input transformations: By modifying input formats, there was no need to change the model architecture when applying to multiple types of tasks.


Figures & Table Explanation

1. Table 2: Results of ‘Natural Language Inference (NLI)’ experiments

  • Natural Language Inference (NLI) experiments is to infer the logical relationship between two sentences.
  • Finetuned Transformer LM is a GPT model.

2. Table 3: Results of ‘Question answering and commonsense reasoning’ experiments

  • In question answering and commonsense reasoning, models would read English passages and solve following questions, just like middle or high school exams.
  • Finetuned Transformer LM is a GPT model.

3. Table 4: Results of ‘Semantic Similarity’ and ‘Classification’ experiments

  • Semantic Similarity: Determine that structurally different two sentences have the same meaning.
  • Classification: Grammar checking and sentiment (positive or negative) analysis.
  • Finetuned Transformer LM is a GPT model.

4. Figure 2: Analysis of GPT model

  • Left: Effects of # of transferred layers from pre-training. The more layers are transferred, the better performance was achieved. This shows that pre-trained model learned useful functionality.
  • Right: Zero-shot performance of GPT model. As pre-training performed, performance for specific tasks also increased.

5. Table 5: Ablation study results.

  • Three things were targeted: Auxiliary objective, Transformer architecture, and Pre-training. All of them were important.

댓글남기기