Vision Transformer

The first application of attention mechanism over NMT is image caption generating using visual attention. Decoder focuses on the appropriate part on image using an attention model for each tiem step(word). Therefore, we can check whether the basis for decision was reasonable (it is called explainability).

There were a lot of progress on applying attention mechanism.

  1. Vision Transformer(ViT)
    Complete transformer based vision model. Crop image into 16 x 16 squares and consider their sequences as sequences of words. The resulted vector sequence could be processed like a word embedding sequence (add positional embedding and deliver to the transformer).

  2. Data-efficient Image Transformers (DeiT)
    Does not require additional train data. Similar to ViT, but utilizes distillation to transfer the knowledge from the latest CNN model to DeiT.

  3. Perceiver
    Multimodal transformer. Able to input almost all types of data (text, image, audio, etc.). Solve a problem of an attention layer to calculate too large matrix(M x M, when M tokens) by improving latent representation of relatively short input (M x N, N is latent tokens). Additionally, share weights across consecutive cross attention layers.

  4. DINO
    Self-supervised learning without labeling and able to perform semantic segmentation in high accuracy. The model is duplicated during training, one is a teacher and the other is a student. Gradient only affect a student, and the weights of a teacher is a exponential moving average of a student’s weights. Student is trained to predict as same as the teacher. It is called self-distillation.

  5. Flamingo
    One model can be used for various different works.

  6. GATO
    Multimodal model. The same transformer performs various works.


Hugging Face

The core of hugging face system is to easily download pre-trained model and its compatible tokenizers. And if needed, it could be fine-tuned in the own dataset by using Transformers library.

The easiest way to use Transformers library is to use transformers.pipeline(). And then, set the desired work.

>>> from transformers import pipeline
>>> classifier = pipeline('sentiment-analysis')     # desired work
>>> result = classifier('The actors were very convinving.')
>>> result
[{'label': 'NEGATIVE', 'score': 0.9610099792480469}]

Delivering batch of sentences is also possible.

For text classification, the default model is ‘distilbert-base-uncased-finetuned-sst-2-english’. Manually assigning another model is also possible.

model_name = 'huggingface/distilbert-base-uncased-finetuned-mnli'
classifier_mnli = pipeline('text-classification', model=model_name)
classifier_mnli('She loves me. [SEP] She loves me not.')

Transformers library provides multiple types of tokenizers, models, setting, and callbacks.

The following is an example of using AutoTokenizer(tokenizer) and TFAutoModelForSequenceClassification(model).

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

By using the tokenizer, we can tokenize sentences. Using padding and setting the return type are also possible.

>>> tokens_ids = tokenizer(["I like soccer. [SEP] We all love soccer!",
                       "Joe lived for a very long time. [SEP] Joe is old."], padding=True, return_tensors='tf')
>>> tokens_ids
{'input_ids': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[ 101, 1045, 2066, 4715, 1012,  102, 2057, 2035, 2293, 4715,  999,
         102,    0,    0,    0],
       [ 101, 3533, 2973, 2005, 1037, 2200, 2146, 2051, 1012,  102, 3533,
        2003, 2214, 1012,  102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

Then, we can deliver the tokens to the model. The returning object will be TFSequenceClassifierOutput containing logits of classes. We can get a class with highest probability by applying softmax, and using argmax() function.


Go for Codes



All images, except those with separate source indications, are excerpted from lecture materials.

댓글남기기