Contrastive learning in CV

#untaged

Monday, May 30, 2022 7:29:57 AM UTC

mindmap
  root((Self-Supervised Learning))
    First Phase
      InstDisc
        note: Proposed instance discrimination, using a memory bank for negative samples
      InvaSpread
        note: End-to-end learning without external structures, limited by small batch size
      CPCv1
        note: Introduced infoNCE objective, versatile for images, audio, video, text, and RL
      CMC
        note: Extended two-view tasks to multiple views, laying groundwork for multi-view or multi-modal contrastive learning
    Second Phase
      MoCov1
        note: Extended InstDisc by turning the memory bank into a queue, momentum encoder for pre-training
      SimCLRv1
        note: Similar to InvaSpread but used larger batch size, more data augmentations, projection head, longer training
      CPCv2
        note: Applied new techniques to CPCv1, significantly improved performance
      Infomin
        note: Introduced Infomin principle, balancing mutual information between samples or views
      MoCov2
        note: Applied SimCLRv1 techniques to MoCov1, achieving better results
      SimCLRv2
        note: Enhanced SimCLR for semi-supervised learning
      SwAV
        note: Combined clustering and contrastive learning, introduced multi-crop technique
    Third Phase
      BYOL
        note: Removed negative samples, using simple mse loss for self-supervised learning
      BN Blog
        note: Suggested BYOL works due to batch norm providing implicit negative samples
      BYOLv2
        note: Refuted BN Blog, showing model initialization is more crucial than batch norm statistics
      SimSiam
        note: Simplified approach, introduced stop gradient operation as critical, avoiding model collapse
      Barlow twins
        note: New objective function, comparing similarity between two matrices
    Fourth Phase
      note: Backbone changed from ResNet to Vision Transformers, addressing training stability issues
      MoCov3
        note: Froze patch projection layer
      DINO
        note: Applied centering to teacher network output

InstDisc: Inspired by supervised learning outcomes, it treats each image as a separate class and uses a convolutional neural network (CNN) to encode images into distinctive low-dimensional features. It employs contrastive learning with positive and negative samples, utilizing a memory bank to store negative samples due to their large number.
InvaSpread: Considered a precursor to SimCLR, it uses mini-batch data as negative samples and performs end-to-end learning with an encoder.
Contrastive Predictive Coding (CPC): A versatile structure that can handle audio, images, text, and reinforcement learning scenarios. It uses past inputs to predict future feature outputs in a sequence.
Contrastive Multiview Coding (CMC): Aims to learn robust features invariant to different views, such as visual and auditory signals, by maximizing mutual information across different perspectives.
Momentum Contrast (MoCo): Introduces a queue and a momentum encoder to form a large and consistent dictionary for better contrastive learning.
A Simple Framework for Contrastive Learning of Visual Representations (SimCLR): Utilizes a large batch size and data augmentation to generate positive and negative samples for contrastive learning.
Bootstrap Your Own Latent (BYOL): A self-supervised learning approach that does not rely on negative samples, avoiding model collapse through a unique training strategy.
Exploring Simple Siamese Representation Learning (SimSiam): Achieves good results without negative samples, large batch sizes, or momentum encoders.
Vision Transformers: The integration of Vision Transformers with self-supervised learning has led to methods like MoCov 3 and DINO, which explore the properties of Vision Transformers in a self-supervised setting.

The BN blog in the diagram: Understanding self-supervised and contrastive learning with "Bootstrap Your Own Latent" (BYOL) - imbue