[Articles] Improving Language-Understanding by Generative Pre-Training

About this Article

Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever
Journal: OpenAI
Year: 2018
Official Citation: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI.

The basic architecture is from Transformer architecture. However, Generative Pre-Training(GPT) model only uses decoder (decoder-only model).

Training consists of two parts: unsupervised pre-training and supervised fine-tuning.

Purpose: Learning universal representation that needs little adaptation to various tasks.
Objective function: Maximize the following formula, which represents the conditional likelihood of i-th token u_i, when only the previous tokens u_{i-k},…,u_{i-1} are given.

$L_{1}(\mathbb{U})=\sum_{i}log~P(u_{i} u_{i-k},…,u_{i-1};\Theta).$
Specific Steps:

$h_{0}=U~W_{e}+W_{P}$

$h_{l}=\text{transformer block}(h_{l-1}) \forall i \in [1,n]$

$P(u)=\text{softmax}(h_{n}W_{e}^{T})$

Purpose: Customizing a model to specific tasks.
Specific Steps: One additional layer is needed in fine-tuning. (1) Labeled input(dataset) goes into pre-trained model. (2) Output vector $h_{l}^{m}$, which contains contextual meaning of given data(sentence). (3) Output of step 2 goes into a layer(with parameters $W_{y}$), which is a special layer for a specific task. (4) Final label y is achieved.
Computing probability: In fine-tuning, it is important to exactly predict the label y of a task. The probability is computed as:

$P(y x^{1},…,x^{m})=\text{softmax}(h_{l}^{m}W_{y}).$
Objective function: Maximize the following formula, which represents the probability of label y, when the whole sentence are given.

$L_{2}(\mathbb{G})=\sum_{(x,y)}log~P(y x^{1},…,x^{m}).$
Final objective function with auxiliary objective: To prevent the model to forget the basic knowledge gained from pre-training, adding an auxiliary objective is helpful. Therefore, the final objective is a mix(ratio λ) of the two objectives above.

$L_{3}(\mathbb{G})=L_{2}(\mathbb{G})+\lambda*L_{1}(\mathbb{G})$
Task-specific input transformations: By modifying input formats, there was no need to change the model architecture when applying to multiple types of tasks.

Natural Language Inference (NLI) experiments is to infer the logical relationship between two sentences.
Finetuned Transformer LM is a GPT model.

In question answering and commonsense reasoning, models would read English passages and solve following questions, just like middle or high school exams.
Finetuned Transformer LM is a GPT model.

Semantic Similarity: Determine that structurally different two sentences have the same meaning.
Classification: Grammar checking and sentiment (positive or negative) analysis.
Finetuned Transformer LM is a GPT model.

Left: Effects of # of transferred layers from pre-training. The more layers are transferred, the better performance was achieved. This shows that pre-trained model learned useful functionality.
Right: Zero-shot performance of GPT model. As pre-training performed, performance for specific tasks also increased.

Three things were targeted: Auxiliary objective, Transformer architecture, and Pre-training. All of them were important.