CLIP and variants

CLIP and variants

mindmap
  root((CLIP))
    Multimodal
      Vision_Language_downstream_tasks
    Image_Generation
      CLIP_passo
      VQGAN_CLIP
      CLIP_Draw
    Other_Domains
      depth_CLIP_optical_flow
      point_CLIP_3D
      audio_CLIP_audio
    Segmentation
      Group_Vit
      Lseg
    Object_Detection
      ViLD
      GLIP_V1_V2
    Video
      Video_CLIP
      CLIP4clip
      Action_CLIP

All these papers:

  1. CLIP as a feature provider.
  2. CLIP as a teacher.
  3. CLIP as a method. (multi-modality contrast learning)

Pasted image 20240529202225.png

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

ViLD

Knowledge distillation

GLID

Vision grounding

Glidv 2 more tasks.

ClIPasso

cite 13, why CLIP robust.
siggraph 2022 best paper
sketching

CLIP4Clip

If you know the similarity between text and image, you can then do ranking matching retrieving.

Empirical study of how to utilize tempo information.
Simple way (mean pooling)/transformer/early fusion (tight type)

ActionClip

Prompt tuning adapter LoRA efficient fine tuning
Just like CLIP4Clip -- three way to convert image to video embedding.