Multi-modal representative works

Multi-modal representative works

Figure 2. Four categories of vision-and-language models. The height of each rectangle denotes its relative computational size. VE, TE, and MI are short for visual embedder, textual embedder, and modality interaction, respectively
Pasted image 20240529213153.png

  1. Category A: Models like VSE++ and SCAN, which use separate and heavy visual embedders compared to textual ones.
  2. Category B: CLIP, which employs equally heavy transformer embedders for both visual and textual modalities.
  3. Category C: Recent VLP models that utilize deep transformers for modality interaction but still rely on convolutional networks for visual embedding.
  4. Category D: The proposed ViLT model, which simplifies the embedding of raw pixels and focuses computation on modality interactions.

ViLT

ViLT belongs to Category D; it removes region feature embed and adopts Patch projection.
ViLT runs fast, but the performance is actually not as well as Category C.
Pasted image 20240529211055.png
Pasted image 20240529212309.png

ALBEF

Pasted image 20240529220042.png

Unlike traditional methods that rely on pre-trained object detectors and high-resolution images, ALBEF uses a detector-free image encoder and a text encoder to independently process images and texts before fusing them with a multimodal encoder.

VLMo

Pasted image 20240529221423.png
Pasted image 20240529222652.png

Each paradigm has its pros and cons, so naturally they proposed MoME.
Transformer block does not have much inductive bias, so it can adapt to multiple modalities.
Pasted image 20240529223531.png

At that time, LAION dataset, WIT of Clip, etc. was not released; another benefit is that MoMe can utilize the single modality data to train the model.
Pasted image 20240529223620.png
Loss: same as ALBEF

BLIP

Also Salesforce Research like ALBEF

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Bootstrapping:
Pasted image 20240529230113.png

Generate new cap to substitute the wrong alt text.

Unified:
Pasted image 20240529231008.png

Contrast with VLMo, sharing parameters, a extra decoder.
Referring to the decoder, language modeling (LM) loss and causal self-att are reasonable.

Coca

Pasted image 20240529234232.png

  1. Images used attentional pooling, which was effective in the experiments of this paper.

  2. The ITM loss was removed to speed up training. Originally, the text needed to be forwarded 2-3 times, but after removing the ITM loss, only one forward pass is needed. In ALBEF, ITM requires the complete text, while MLM requires masking, so there are two inputs. In BLIP, ITC is done once, ITM is done separately because a new module is inserted in the text model, and LM is done separately again because it uses both a new module and causal self-attention. In CoCa, to complete the captioning loss and ITC loss, only one forward pass is needed. In GPT, placing the cls-token at the end allows for a global representation to be obtained for the ITC loss.

BeiTv3

Image as a Foreign Language
Same structure as VLMo.
Only one loss.
Mask anything.
Pasted image 20240529235504.png

  1. Backbone Architecture: This involves the design of the underlying network that supports both vision and language tasks. A unified architecture allows for seamless integration and processing of multimodal data.
  1. Pretraining Task: The tasks used during pretraining are crucial. They should be designed in a way that the model can learn generalizable features from both images and text that can be applied to various downstream tasks.

  2. Model Scaling Up: This refers to increasing the size and capacity of the model to handle more complex tasks and larger datasets, which can lead to better performance and generalization.

Pasted image 20240529235139.png

Llava